date:20110426

 NOT flushing memstore for region keep on printing for half an hour
-

 Key: HBASE-3821
 URL: https://issues.apache.org/jira/browse/HBASE-3821
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.90.1
Reporter: zhoushuaifeng
 Fix For: 0.90.3


NOT flushing memstore for region keep on printing for half an hour in the 
regionserver. Then I restart hbase. I think there may be deadlock or cycling.
I know that when splitting region, it will doclose of region, and set 
writestate.writesEnabled = false  and may run close preflush. This will make 
flush fail and print NOT flushing memstore for region. But It should be 
finished after a while.

logs:
2011-04-18 16:28:27,960 DEBUG 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested 
for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because 
regionserver60020.cacheFlusher; priority=-1, compaction queue size=1
2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:30,171 WARN 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region 
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store files; 
delaying flush up to 9ms
2011-04-18 16:28:32,119 INFO 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- 
HDFS-200
2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
Roll /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, 
entries=5226, filesize=255913736. New hlog 
/hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822
2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 11037 
from region 031f37c9c23fcab17797b06b90205610
2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
moving old hlog file 
/hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 whose 
highest sequenceid is 6052 to /hbase/.oldlogs/linux253%3A60020.1303123945481
2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: 
Completed major compaction of 4 file(s), new 
file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290,
 size=281.4m; total size for store is 468.8m
2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
completed compaction on region 
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec
2011-04-18 16:28:42,741 INFO 
org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region 
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling 
compactions  flushes
2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Running close preflush of ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Started memstore flush for 
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region memstore 
size 105.6m
2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Finished snapshotting, commencing flushing stores
2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT 
flushing memstore for region 
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, 
writesEnabled=false
2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT 
flushing memstore for region 
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, 
writesEnabled=false ..
2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT 
flushing memstore for region 
ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, 
writesEnabled=false
2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
Mon Apr 18 17:04:24 IST 2011 Starting regionserver on linux253 ulimit -n 1024


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour


[ 
https://issues.apache.org/jira/browse/HBASE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025103#comment-13025103
 ] 

zhoushuaifeng commented on HBASE-3821:
--

I think the problem is like this:

1, When split region, it will close parent region, and set 
writestate.writesEnabled = false:

  private ListStoreFile doClose(final boolean abort)
  throws IOException {
synchronized (writestate) {
  // Disable compacting and flushing by background threads for this
  // region.
  writestate.writesEnabled = false;

2, If the memstore is large enouth, preflush will happen:

if (!abort  !wasFlushing  worthPreFlushing()) {
  LOG.info(Running close preflush of  + this.getRegionNameAsString());
  internalFlushcache();
}
this.closing.set(true);
lock.writeLock().lock();

3, IOException happened, and preflushing failed, and closing parent failed:
createSplitDir(this.parent.getFilesystem(), this.splitdir);
this.journal.add(JournalEntry.CREATE_SPLIT_DIR);

ListStoreFile hstoreFilesToSplit = this.parent.close(false); if 
(hstoreFilesToSplit == null) {


4, roll back split is calling, but split state stay in CREATE_SPLIT_DIR, so , 
only clenupSplitDir will happen.
while (iterator.hasPrevious()) {
  JournalEntry je = iterator.previous();
  switch(je) {
  case CREATE_SPLIT_DIR:
cleanupSplitDir(fs, this.splitdir);
break;

  case CLOSED_PARENT_REGION:

5, what about writestate.writesEnabled? it stayed in false, no one handle it. 
So, even split is roll back, but no flush can success in parent region.


  NOT flushing memstore for region keep on printing for half an hour
 -

 Key: HBASE-3821
 URL: https://issues.apache.org/jira/browse/HBASE-3821
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.90.1
Reporter: zhoushuaifeng
 Fix For: 0.90.3


 NOT flushing memstore for region keep on printing for half an hour in the 
 regionserver. Then I restart hbase. I think there may be deadlock or cycling.
 I know that when splitting region, it will doclose of region, and set 
 writestate.writesEnabled = false  and may run close preflush. This will make 
 flush fail and print NOT flushing memstore for region. But It should be 
 finished after a while.
 logs:
 2011-04-18 16:28:27,960 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested 
 for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because 
 regionserver60020.cacheFlusher; priority=-1, compaction queue size=1
 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:30,171 WARN 
 org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store 
 files; delaying flush up to 9ms
 2011-04-18 16:28:32,119 INFO 
 org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs 
 -- HDFS-200
 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
 Roll 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, 
 entries=5226, filesize=255913736. New hlog 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822
 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
 Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 
 11037 from region 031f37c9c23fcab17797b06b90205610
 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
 moving old hlog file 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 
 whose highest sequenceid is 6052 to 
 /hbase/.oldlogs/linux253%3A60020.1303123945481
 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Completed major compaction of 4 file(s), new 
 file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290,
  size=281.4m; total size for store is 468.8m
 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 completed compaction on region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec
 2011-04-18 16:28:42,741 INFO 
 org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of 
 region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling 
 compactions  flushes
 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Running close preflush of 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:42,771 DEBUG

[jira] [Updated] (HBASE-3629) Update our thrift to 0.6

2011-04-26 Thread Moaz Reyad (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moaz Reyad updated HBASE-3629:
--

Attachment: HBASE-3629.patch.zip

Here is the generated files using thrift 0.6.1 plus the pom changes and little 
fixes in ThriftServer.java.

Not sure if the hadoop-non-releases is still needed or it can be also removed 
from pom.

 Update our thrift to 0.6
 

 Key: HBASE-3629
 URL: https://issues.apache.org/jira/browse/HBASE-3629
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: Moaz Reyad
 Attachments: HBASE-3629.patch.zip, pom.diff


 HBASE-3117 was about updating to 0.5.  Moaz Reyad over in that issue is 
 trying to move us to 0.6.  Lets move the 0.6 upgrade effort here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1744) Thrift server to match the new java api.

2011-04-26 Thread Lars Francke (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025153#comment-13025153
 ] 

Lars Francke commented on HBASE-1744:
-

https://issues.apache.org/jira/browse/THRIFT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13024799#comment-13024799

Thrift is finally in the Maven repository!

 Thrift server to match the new java api.
 

 Key: HBASE-1744
 URL: https://issues.apache.org/jira/browse/HBASE-1744
 Project: HBase
  Issue Type: Improvement
  Components: thrift
Reporter: Tim Sell
Assignee: Lars Francke
Priority: Critical
 Fix For: 0.92.0

 Attachments: HBASE-1744.2.patch, HBASE-1744.preview.1.patch, 
 thriftexperiment.patch


 This mutateRows, etc.. is a little confusing compared to the new cleaner java 
 client.
 Thinking of ways to make a thrift client that is just as elegant. something 
 like:
 void put(1:Bytes table, 2:TPut put) throws (1:IOError io)
 with:
 struct TColumn {
   1:Bytes family,
   2:Bytes qualifier,
   3:i64 timestamp
 }
 struct TPut {
   1:Bytes row,
   2:mapTColumn, Bytes values
 }
 This creates more verbose rpc  than if the columns in TPut were just 
 mapBytes, mapBytes, Bytes, but that is harder to fit timestamps into and 
 still be intuitive from say python.
 Presumably the goal of a thrift gateway is to be easy first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1744) Thrift server to match the new java api.


[ 
https://issues.apache.org/jira/browse/HBASE-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025249#comment-13025249
 ] 

stack commented on HBASE-1744:
--

@Lars Some fellas figured it ahead of you (smile).  See 
https://mail.google.com/mail/?shva=1#label/hbase-issues/12f8cf7db44d14fb

 Thrift server to match the new java api.
 

 Key: HBASE-1744
 URL: https://issues.apache.org/jira/browse/HBASE-1744
 Project: HBase
  Issue Type: Improvement
  Components: thrift
Reporter: Tim Sell
Assignee: Lars Francke
Priority: Critical
 Fix For: 0.92.0

 Attachments: HBASE-1744.2.patch, HBASE-1744.preview.1.patch, 
 thriftexperiment.patch


 This mutateRows, etc.. is a little confusing compared to the new cleaner java 
 client.
 Thinking of ways to make a thrift client that is just as elegant. something 
 like:
 void put(1:Bytes table, 2:TPut put) throws (1:IOError io)
 with:
 struct TColumn {
   1:Bytes family,
   2:Bytes qualifier,
   3:i64 timestamp
 }
 struct TPut {
   1:Bytes row,
   2:mapTColumn, Bytes values
 }
 This creates more verbose rpc  than if the columns in TPut were just 
 mapBytes, mapBytes, Bytes, but that is harder to fit timestamps into and 
 still be intuitive from say python.
 Presumably the goal of a thrift gateway is to be easy first.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3629) Update our thrift to 0.6

2011-04-26 Thread Lars Francke (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025257#comment-13025257
 ] 

Lars Francke commented on HBASE-3629:
-

Thanks for doing this!

Please remove the central repository again though. And the finalName entry can 
be removed as well as it is inherited anyway. Can't alter the patch right now 
but can do tomorrow if needed. Also the comment about the newer version doesn't 
apply anymore.

 Update our thrift to 0.6
 

 Key: HBASE-3629
 URL: https://issues.apache.org/jira/browse/HBASE-3629
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: Moaz Reyad
 Attachments: HBASE-3629.patch.zip, pom.diff


 HBASE-3117 was about updating to 0.5.  Moaz Reyad over in that issue is 
 trying to move us to 0.6.  Lets move the 0.6 upgrade effort here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-3822) region server stuck in waitOnAllRegionsToClose

region server stuck in waitOnAllRegionsToClose
--

 Key: HBASE-3822
 URL: https://issues.apache.org/jira/browse/HBASE-3822
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani


The regionserver is not able to exit because the rs thread is stuck here



regionserver60020 prio=10 tid=0x2ab2b039e000 nid=0x760a waiting on 
condition [0x4365e000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:126)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.waitOnAllRegionsToClose(HRegionServer.java:736)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:689)
at java.lang.Thread.run(Thread.java:619)


===

In CloseRegionHandler.process() we do not call removeFromOnlineRegions() if 
there is an exception. (In this case I suspect there was a log-rolling 
exception because of another issue)

// Close the region
try {
  // TODO: If we need to keep updating CLOSING stamp to prevent against
  // a timeout if this is long-running, need to spin up a thread?
  if (region.close(abort) == null) {
// This region got closed.  Most likely due to a split. So instead
// of doing the setClosedState() below, let's just ignore and continue.
// The split message will clean up the master state.
LOG.warn(Can't close region: was already closed during close():  +
  regionInfo.getRegionNameAsString());
return;
  }
} catch (IOException e) {
  LOG.error(Unrecoverable exception while closing region  +
regionInfo.getRegionNameAsString() + , still finishing close, e);
}

this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName());


===

I think we set the closing flag on the region, it won't be taking any more 
requests, it is as good as offline.

Either we should refine the check in waitOnAllRegionsToClose() or 
CloseRegionHandler.process() should remove the region from online-regions set.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour


[ 
https://issues.apache.org/jira/browse/HBASE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025363#comment-13025363
 ] 

stack commented on HBASE-3821:
--

Excellent digging Zhou!  Yes if the preflush fails, we need to reset the 
work done by 'writestate.writesEnabled = false;'  If you have a patch, that'd 
be great.

Was the fail to split a transitory error?  Were you able to flush memory 
successfully later?

  NOT flushing memstore for region keep on printing for half an hour
 -

 Key: HBASE-3821
 URL: https://issues.apache.org/jira/browse/HBASE-3821
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.90.1
Reporter: zhoushuaifeng
 Fix For: 0.90.3


 NOT flushing memstore for region keep on printing for half an hour in the 
 regionserver. Then I restart hbase. I think there may be deadlock or cycling.
 I know that when splitting region, it will doclose of region, and set 
 writestate.writesEnabled = false  and may run close preflush. This will make 
 flush fail and print NOT flushing memstore for region. But It should be 
 finished after a while.
 logs:
 2011-04-18 16:28:27,960 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested 
 for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because 
 regionserver60020.cacheFlusher; priority=-1, compaction queue size=1
 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:30,171 WARN 
 org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store 
 files; delaying flush up to 9ms
 2011-04-18 16:28:32,119 INFO 
 org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs 
 -- HDFS-200
 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
 Roll 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, 
 entries=5226, filesize=255913736. New hlog 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822
 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
 Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 
 11037 from region 031f37c9c23fcab17797b06b90205610
 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
 moving old hlog file 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 
 whose highest sequenceid is 6052 to 
 /hbase/.oldlogs/linux253%3A60020.1303123945481
 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Completed major compaction of 4 file(s), new 
 file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290,
  size=281.4m; total size for store is 468.8m
 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 completed compaction on region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec
 2011-04-18 16:28:42,741 INFO 
 org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of 
 region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling 
 compactions  flushes
 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Running close preflush of 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region 
 memstore size 105.6m
 2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished snapshotting, commencing flushing stores
 2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 NOT flushing memstore for region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, 
 writesEnabled=false
 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 NOT flushing memstore for region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, 
 writesEnabled=false ..
 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 NOT flushing memstore for region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, 
 writesEnabled=false
 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Flush requested on

[jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss

[
https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025364#comment-13025364
]

stack commented on HBASE-3065:
--

Ping Liyin!

Retry all 'retryable' zk operations; e.g. connection loss
-

Key: HBASE-3065
URL: https://issues.apache.org/jira/browse/HBASE-3065
Project: HBase
Issue Type: Bug
Reporter: stack
Assignee: Liyin Tang
Fix For: 0.92.0

Attachments: HBase-3065[r1088475]_1.patch

The 'new' master refactored our zk code tidying up all zk accesses and
coralling them behind nice zk utility classes. One improvement was letting
out all KeeperExceptions letting the client deal. Thats good generally
because in old days, we'd suppress important state zk changes in state. But
there is at least one case the new zk utility could handle for the
application and thats the class of retryable KeeperExceptions. The one that
comes to mind is conection loss. On connection loss we should retry the
just-failed operation. Usually the retry will just work. At worse, on
reconnect, we'll pick up the expired session event.
Adding in this change shouldn't be too bad given the refactor of zk corralled
all zk access into one or two classes only.
One thing to consider though is how much we should retry. We could retry on
a timer or we could retry for ever as long as the Stoppable interface is
passed so if another thread has stopped or aborted the hosting service, we'll
notice and give up trying. Doing the latter is probably better than some
kinda timeout.
HBASE-3062 adds a timed retry on the first zk operation. This issue is about
generalizing what is over there across all zk access.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3674) Treat ChecksumException as we would a ParseException splitting logs; else we replay split on every restart


[ 
https://issues.apache.org/jira/browse/HBASE-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025366#comment-13025366
 ] 

stack commented on HBASE-3674:
--

Committed to TRUNK.  Thanks for review Prakash.

 Treat ChecksumException as we would a ParseException splitting logs; else we 
 replay split on every restart
 --

 Key: HBASE-3674
 URL: https://issues.apache.org/jira/browse/HBASE-3674
 Project: HBase
  Issue Type: Bug
  Components: wal
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 0.90.2

 Attachments: 3674-distributed.txt, 3674-v2.txt, 3674.txt


 In short, a ChecksumException will fail log processing for a server so we 
 skip out w/o archiving logs.  On restart, we'll then reprocess the logs -- 
 hit the checksumexception anew, usually -- and so on.
 Here is the splitLog method (edited):
 {code}
   private ListPath splitLog(final FileStatus[] logfiles) throws IOException 
 {
 
 outputSink.startWriterThreads(entryBuffers);
 
 try {
   int i = 0;
   for (FileStatus log : logfiles) {
Path logPath = log.getPath();
 long logLength = log.getLen();
 splitSize += logLength;
 LOG.debug(Splitting hlog  + (i++ + 1) +  of  + logfiles.length
 + :  + logPath + , length= + logLength);
 try {
   recoverFileLease(fs, logPath, conf);
   parseHLog(log, entryBuffers, fs, conf);
   processedLogs.add(logPath);
 } catch (EOFException eof) {
   // truncated files are expected if a RS crashes (see HBASE-2643)
   LOG.info(EOF from hlog  + logPath + . Continuing);
   processedLogs.add(logPath);
 } catch (FileNotFoundException fnfe) {
   // A file may be missing if the region server was able to archive it
   // before shutting down. This means the edits were persisted already
   LOG.info(A log was missing  + logPath +
   , probably because it was moved by the +
now dead region server. Continuing);
   processedLogs.add(logPath);
 } catch (IOException e) {
   // If the IOE resulted from bad file format,
   // then this problem is idempotent and retrying won't help
   if (e.getCause() instanceof ParseException ||
   e.getCause() instanceof ChecksumException) {
 LOG.warn(ParseException from hlog  + logPath + .  continuing);
 processedLogs.add(logPath);
   } else {
 if (skipErrors) {
   LOG.info(Got while parsing hlog  + logPath +
 . Marking as corrupted, e);
   corruptedLogs.add(logPath);
 } else {
   throw e;
 }
   }
 }
   }
   if (fs.listStatus(srcDir).length  processedLogs.size()
   + corruptedLogs.size()) {
 throw new OrphanHLogAfterSplitException(
 Discovered orphan hlog after split. Maybe the 
 + HRegionServer was not dead when we started);
   }
   archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf); 
  
 } finally {
   splits = outputSink.finishWritingAndClose();
 }
 return splits;
   }
 {code}
 Notice how we'll only archive logs only if we successfully split all logs.  
 We won't archive 31 of 35 files if we happen to get a checksum exception on 
 file 32.
 I think we should treat a ChecksumException the same as a ParseException; a 
 retry will not fix it if HDFS could not get around the ChecksumException 
 (seems like in our case all replicas were corrupt).
 Here is a play-by-play from the logs:
 {code}
 813572 2011-03-18 20:31:44,687 DEBUG 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 34 of 
 35: 
 hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481,
  length=150   65662813573 2011-03-18 20:31:44,687 INFO 
 org.apache.hadoop.hbase.util.FSUtils: Recovering file 
 hdfs://sv2borg170:9000/hbase/.logs/sv2borg182,60020,1300384550664/sv2borg182%3A60020.1300461329481
 
 813617 2011-03-18 20:31:46,238 INFO org.apache.hadoop.fs.FSInputChecker: 
 Found checksum error: b[0, 
 512]=00cd00502037383661376439656265643938636463343433386132343631323633303239371d6170695f6163636573735f746f6b656e5f7374

 6174735f6275636b6574000d9fa4d5dc012ec9c7cbaf000001006d005d0008002337626262663764626431616561366234616130656334383436653732333132643a32390764656661756c746170695f616e64726f69645f6c6f67676564

[jira] [Commented] (HBASE-3065) Retry all 'retryable' zk operations; e.g. connection loss

2011-04-26 Thread Liyin Tang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025369#comment-13025369
]

Liyin Tang commented on HBASE-3065:
---

Hi Stack
I am so sorry for the delay:) I will fix this and submit a new patch~~
Thanks for the review:)

Retry all 'retryable' zk operations; e.g. connection loss
-

Key: HBASE-3065
URL: https://issues.apache.org/jira/browse/HBASE-3065
Project: HBase
Issue Type: Bug
Reporter: stack
Assignee: Liyin Tang
Fix For: 0.92.0

Attachments: HBase-3065[r1088475]_1.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3777) Redefine Identity Of HBase Configuration

2011-04-26 Thread jirapos...@reviews.apache.org (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025368#comment-13025368
 ] 

jirapos...@reviews.apache.org commented on HBASE-3777:
--



bq.  On 2011-04-25 20:05:54, Michael Stack wrote:
bq.   src/main/java/org/apache/hadoop/hbase/client/HTable.java, line 259
bq.   https://reviews.apache.org/r/643/diff/3/?file=16912#file16912line259
bq.  
bq.   Yeah, this is ugly its almost as though you should have a 
special method for it, one that does not up the counters?
bq.  
bq.  Karthick Sankarachary wrote:
bq.  Just a thought - how about if we hide the ugliness in HCM, like so:
bq.  
bq.public abstract class ConnectableT {
bq.  public Configuration conf;
bq.  
bq.  public Connectable(Configuration conf) {
bq.this.conf = conf;
bq.  }
bq.  
bq.  public abstract T connect(Connection connection);
bq.}
bq.  
bq.public static T T execute(ConnectableT connectable) {
bq.  if (connectable == null || connectable.conf == null) {
bq.return null;
bq.  }
bq.  HConfiguration conf = connectable.conf;
bq.  HConnection connection = HConnectionManager.getConnection(conf);
bq.  try {
bq.return connectable.connect(connection);
bq.  } finally {
bq.HConnectionManager.deleteConnection(conf, false);
bq.  }
bq.}
bq.  
bq.  That way, the HTable call would look somewhat prettier:
bq.  
bq.HConnectionManager.execute(new ConnectableBoolean(conf) {
bq.  public Boolean connect(Connection connection) {
bq.return connection.isTableEnabled(tableName);
bq.  }
bq.});
bq.  
bq.  Karthick Sankarachary wrote:
bq.  BTW, if we bypass the reference counters in this situation, there's a 
chance, albeit small, that the connection might get closed by someone else 
while this guy is still trying to talk to it, which could result in a 
connection is closed type of error.

Your proposal is also ugly but I think less ugly than what we currently have so 
I would prefer it; it has the benefit of moving the ref counting back into HCM, 
not letting it out of the class (I'm fine w/ all your other comments Karthick)


- Michael


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/643/#review543
---


On 2011-04-22 21:16:59, Karthick Sankarachary wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/643/
bq.  ---
bq.  
bq.  (Updated 2011-04-22 21:16:59)
bq.  
bq.  
bq.  Review request for hbase and Ted Yu.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Judging from the javadoc in HConnectionManager, sharing connections across 
multiple clients going to the same cluster is supposedly a good thing. However, 
the fact that there is a one-to-one mapping between a configuration and 
connection instance, kind of works against that goal. Specifically, when you 
create HTable instances using a given Configuration instance and a copy 
thereof, we end up with two distinct HConnection instances under the covers. Is 
this really expected behavior, especially given that the configuration instance 
gets cloned a lot?
bq.  
bq.  Here, I'd like to play devil's advocate and propose that we deep-compare 
HBaseConfiguration instances, so that multiple HBaseConfiguration instances 
that have the same properties map to the same HConnection instance. In case one 
is concerned that a single HConnection is insufficient for sharing amongst 
clients, to quote the javadoc, then one should be able to mark a given 
HBaseConfiguration instance as being uniquely identifiable.
bq.  
bq.  
bq.  This addresses bug HBASE-3777.
bq.  https://issues.apache.org/jira/browse/HBASE-3777
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/HConstants.java 5701639 
bq.src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 
be31179 
bq.src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java afb666a 
bq.src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java 
c348f7a 
bq.src/main/java/org/apache/hadoop/hbase/client/HTable.java edacf56 
bq.src/main/java/org/apache/hadoop/hbase/client/HTablePool.java 88827a8 
bq.src/main/java/org/apache/hadoop/hbase/client/MetaScanner.java 9e3f4d1 
bq.
src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java 
d76e333 
bq.
src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
 ed88bfa 
bq.

[jira] [Commented] (HBASE-3823) NPE in ZKAssign.transitionNode


[ 
https://issues.apache.org/jira/browse/HBASE-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025399#comment-13025399
 ] 

stack commented on HBASE-3823:
--

Is this HBASE-3627 (fixed in 0.90.2)?

 NPE in ZKAssign.transitionNode
 --

 Key: HBASE-3823
 URL: https://issues.apache.org/jira/browse/HBASE-3823
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 This issue led to a region being multiply assigned.
 hbck output
 ERROR: Region 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  is listed in META on region server pumahbase107.snc5.facebook.com:60020 but 
 is multiply assigned to region servers pumahbase150.snc5.facebook.com:60020, 
 pumahbase107.snc5.facebook.com:60020
 ===
 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
 Caught throwable while processing event M_RS_OPEN_REGION
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621)
 at 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168)
 at 
 org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 byte [] existingBytes =
   ZKUtil.getDataNoWatch(zkw, node, stat);
 RegionTransitionData existingData =
   RegionTransitionData.fromBytes(existingBytes);
 existingBytes can be null. have to return -1 if null.
 ===
 master logs
 2011-04-25 05:24:03,250 DEBUG 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer 
 path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:19,246 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path 
 hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  (wrote 4342690 edits in 46904ms)
 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x32f7bb74e8a Creating (or updating) unassigned node for 
 e7a478b4bd164525052f1dedb832de0a with OFFLINE state
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  so generated a random one; 
 hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.,
  src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 
 (online=70, exclude=null) available servers
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  to pumahbase107.snc5.facebook.com,60020,1303450731227
 2011-04-25 09:09:26,139 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:44,045 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:59,050 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:10:14,054 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:10:29,055 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:10:44,060 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING,

[jira] [Commented] (HBASE-3822) region server stuck in waitOnAllRegionsToClose


[ 
https://issues.apache.org/jira/browse/HBASE-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025443#comment-13025443
 ] 

Prakash Khemani commented on HBASE-3822:


The code snippet that I pointed out doesn't have a problem - that piece of code 
will remove the region from online regions even if there is an exception. Sorry 
for the confusion. I don't really know why the onlineRegions set was not 
cleaned up.

 region server stuck in waitOnAllRegionsToClose
 --

 Key: HBASE-3822
 URL: https://issues.apache.org/jira/browse/HBASE-3822
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 The regionserver is not able to exit because the rs thread is stuck here
 regionserver60020 prio=10 tid=0x2ab2b039e000 nid=0x760a waiting on 
 condition [0x4365e000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
 at java.lang.Thread.sleep(Native Method)
 at org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:126)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.waitOnAllRegionsToClose(HRegionServer.java:736)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:689)
 at java.lang.Thread.run(Thread.java:619)
 ===
 In CloseRegionHandler.process() we do not call removeFromOnlineRegions() if 
 there is an exception. (In this case I suspect there was a log-rolling 
 exception because of another issue)
 // Close the region
 try {
   // TODO: If we need to keep updating CLOSING stamp to prevent against
   // a timeout if this is long-running, need to spin up a thread?
   if (region.close(abort) == null) {
 // This region got closed.  Most likely due to a split. So instead
 // of doing the setClosedState() below, let's just ignore and 
 continue.
 // The split message will clean up the master state.
 LOG.warn(Can't close region: was already closed during close():  +
   regionInfo.getRegionNameAsString());
 return;
   }
 } catch (IOException e) {
   LOG.error(Unrecoverable exception while closing region  +
 regionInfo.getRegionNameAsString() + , still finishing close, e);
 }
 this.rsServices.removeFromOnlineRegions(regionInfo.getEncodedName());
 ===
 I think we set the closing flag on the region, it won't be taking any more 
 requests, it is as good as offline.
 Either we should refine the check in waitOnAllRegionsToClose() or 
 CloseRegionHandler.process() should remove the region from online-regions set.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3823) NPE in ZKAssign.transitionNode


 [ 
https://issues.apache.org/jira/browse/HBASE-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani resolved HBASE-3823.


  Resolution: Duplicate
Release Note: fixed in HBASE-3627

 NPE in ZKAssign.transitionNode
 --

 Key: HBASE-3823
 URL: https://issues.apache.org/jira/browse/HBASE-3823
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 This issue led to a region being multiply assigned.
 hbck output
 ERROR: Region 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  is listed in META on region server pumahbase107.snc5.facebook.com:60020 but 
 is multiply assigned to region servers pumahbase150.snc5.facebook.com:60020, 
 pumahbase107.snc5.facebook.com:60020
 ===
 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
 Caught throwable while processing event M_RS_OPEN_REGION
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621)
 at 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168)
 at 
 org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 byte [] existingBytes =
   ZKUtil.getDataNoWatch(zkw, node, stat);
 RegionTransitionData existingData =
   RegionTransitionData.fromBytes(existingBytes);
 existingBytes can be null. have to return -1 if null.
 ===
 master logs
 2011-04-25 05:24:03,250 DEBUG 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer 
 path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:19,246 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path 
 hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  (wrote 4342690 edits in 46904ms)
 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x32f7bb74e8a Creating (or updating) unassigned node for 
 e7a478b4bd164525052f1dedb832de0a with OFFLINE state
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  so generated a random one; 
 hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.,
  src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 
 (online=70, exclude=null) available servers
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  to pumahbase107.snc5.facebook.com,60020,1303450731227
 2011-04-25 09:09:26,139 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:44,045 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:59,050 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:10:14,054 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:10:29,055 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:10:44,060 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING,

[jira] [Created] (HBASE-3824) region server timed out during open region

region server timed out during open region
--

 Key: HBASE-3824
 URL: https://issues.apache.org/jira/browse/HBASE-3824
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani


When replaying a large log file, mestore flushes can happen. But there is no 
Progressible report being sent during memstore flushes. That can lead to master 
timing out the region server during region open.

===
Another related issue and Jonathan's response

 So if a region server that is handed a region for opening and has done part of
 the work ... it has created some HFiles (because the logs were so huge that
 the mestore got flushed while the logs were being replayed) ... and then it is
 asked to give up because the master thought the region server was taking
 too long to open the region.
 
 When the region server gives up on the region then will it make sure that it
 removes all the HFiles it had created for that region?


Will need to check the code, but would it matter?  One issue is whether it 
cleans up after itself (I'm guessing not).  Another issue is whether the replay 
is idempotent (duplicate KVs across files shouldn't matter in most cases).

===

2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
Caught throwable while processing event M_RS_OPEN_REGION
java.lang.NullPointerException
at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
at 
org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
at 
org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

byte [] existingBytes =
ZKUtil.getDataNoWatch(zkw, node, stat);
RegionTransitionData existingData =
RegionTransitionData.fromBytes(existingBytes);

existingBytes can be null. have to return -1 if null.


===

master logs

2011-04-25 05:24:03,250 DEBUG 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer 
path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
 region=e7a478b4bd164525052f1dedb832de0a
2011-04-25 09:09:19,246 INFO 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path 
hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
 (wrote 4342690 edits in 46904ms)
2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:6-0x32f7bb74e8a Creating (or updating) unassigned node for 
e7a478b4bd164525052f1dedb832de0a with OFFLINE state
2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
 so generated a random one; 
hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.,
 src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 (online=70, 
exclude=null) available servers
2011-04-25 09:09:26,136 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region 
realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
 to pumahbase107.snc5.facebook.com,60020,1303450731227
2011-04-25 09:09:26,139 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
region=e7a478b4bd164525052f1dedb832de0a
2011-04-25 09:09:44,045 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
region=e7a478b4bd164525052f1dedb832de0a
2011-04-25 09:09:59,050 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
region=e7a478b4bd164525052f1dedb832de0a
2011-04-25 09:10:14,054 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
region=e7a478b4bd164525052f1dedb832de0a
2011-04-25 09:10:29,055 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=pumahbase107.snc5.facebook.com,60020,1303450731227,

[jira] [Commented] (HBASE-3484) Replace memstore's ConcurrentSkipListMap with our own implementation

2011-04-26 Thread Joe Pallas (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025463#comment-13025463
 ] 

Joe Pallas commented on HBASE-3484:
---

This issue was cited by jdcryans as related to unfortunate performance seen in 
the following case:

A test program fills a single row of a family with tens of thousands of 
sequentially increasing qualifiers.  Then it performs random gets (or exists) 
of those qualifiers.  The response time seen is (on average) proportional to 
the ordinal position of the qualifier.  If the table is flushed before the 
random tests begin, then the average response time is basically constant, 
independent of the qualifier's ordinal position.

I'm not sure that either of the two points in the description actually covers 
this case, but I don't know enough to say.


 Replace memstore's ConcurrentSkipListMap with our own implementation
 

 Key: HBASE-3484
 URL: https://issues.apache.org/jira/browse/HBASE-3484
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.92.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
 Fix For: 0.92.0


 By copy-pasting ConcurrentSkipListMap into HBase we can make two improvements 
 to it for our use case in MemStore:
 - add an iterator.replace() method which should allow us to do upsert much 
 more cheaply
 - implement a Set directly without having to do MapKeyValue,KeyValue to 
 save one reference per entry
 It turns out CSLM is in public domain from its development as part of JSR 
 166, so we should be OK with licenses.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3824) region server timed out during open region

2011-04-26 Thread Jean-Daniel Cryans (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025483#comment-13025483
 ] 

Jean-Daniel Cryans commented on HBASE-3824:
---

So what's the issue about exactly? We expect region server to time out opening 
AFAIK, so is the problem more about the idempotent nature of opening a region 
and then failing at doing it when it's assigned somewhere else?

 region server timed out during open region
 --

 Key: HBASE-3824
 URL: https://issues.apache.org/jira/browse/HBASE-3824
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 When replaying a large log file, mestore flushes can happen. But there is no 
 Progressible report being sent during memstore flushes. That can lead to 
 master timing out the region server during region open.
 ===
 Another related issue and Jonathan's response
  So if a region server that is handed a region for opening and has done part 
  of
  the work ... it has created some HFiles (because the logs were so huge that
  the mestore got flushed while the logs were being replayed) ... and then it 
  is
  asked to give up because the master thought the region server was taking
  too long to open the region.
  
  When the region server gives up on the region then will it make sure that it
  removes all the HFiles it had created for that region?
 Will need to check the code, but would it matter?  One issue is whether it 
 cleans up after itself (I'm guessing not).  Another issue is whether the 
 replay is idempotent (duplicate KVs across files shouldn't matter in most 
 cases).
 ===
 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
 Caught throwable while processing event M_RS_OPEN_REGION
 java.lang.NullPointerException
 at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621)
 at 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 byte [] existingBytes =
 ZKUtil.getDataNoWatch(zkw, node, stat);
 RegionTransitionData existingData =
 RegionTransitionData.fromBytes(existingBytes);
 existingBytes can be null. have to return -1 if null.
 ===
 master logs
 2011-04-25 05:24:03,250 DEBUG 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer 
 path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:19,246 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path 
 hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  (wrote 4342690 edits in 46904ms)
 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x32f7bb74e8a Creating (or updating) unassigned node for 
 e7a478b4bd164525052f1dedb832de0a with OFFLINE state
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  so generated a random one; 
 hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.,
  src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 
 (online=70, exclude=null) available servers
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  to pumahbase107.snc5.facebook.com,60020,1303450731227
 2011-04-25 09:09:26,139 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:44,045 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:59,050 DEBUG

[jira] [Resolved] (HBASE-3824) region server timed out during open region


 [ 
https://issues.apache.org/jira/browse/HBASE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Khemani resolved HBASE-3824.


Resolution: Not A Problem

 region server timed out during open region
 --

 Key: HBASE-3824
 URL: https://issues.apache.org/jira/browse/HBASE-3824
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 When replaying a large log file, mestore flushes can happen. But there is no 
 Progressible report being sent during memstore flushes. That can lead to 
 master timing out the region server during region open.
 ===
 Another related issue and Jonathan's response
  So if a region server that is handed a region for opening and has done part 
  of
  the work ... it has created some HFiles (because the logs were so huge that
  the mestore got flushed while the logs were being replayed) ... and then it 
  is
  asked to give up because the master thought the region server was taking
  too long to open the region.
  
  When the region server gives up on the region then will it make sure that it
  removes all the HFiles it had created for that region?
 Will need to check the code, but would it matter?  One issue is whether it 
 cleans up after itself (I'm guessing not).  Another issue is whether the 
 replay is idempotent (duplicate KVs across files shouldn't matter in most 
 cases).
 ===
 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
 Caught throwable while processing event M_RS_OPEN_REGION
 java.lang.NullPointerException
 at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621)
 at 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 byte [] existingBytes =
 ZKUtil.getDataNoWatch(zkw, node, stat);
 RegionTransitionData existingData =
 RegionTransitionData.fromBytes(existingBytes);
 existingBytes can be null. have to return -1 if null.
 ===
 master logs
 2011-04-25 05:24:03,250 DEBUG 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer 
 path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:19,246 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path 
 hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  (wrote 4342690 edits in 46904ms)
 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x32f7bb74e8a Creating (or updating) unassigned node for 
 e7a478b4bd164525052f1dedb832de0a with OFFLINE state
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  so generated a random one; 
 hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.,
  src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 
 (online=70, exclude=null) available servers
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  to pumahbase107.snc5.facebook.com,60020,1303450731227
 2011-04-25 09:09:26,139 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:44,045 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:59,050 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:10:14,054 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling

[jira] [Resolved] (HBASE-3629) Update our thrift to 0.6


 [ 
https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3629.
--

  Resolution: Fixed
Hadoop Flags: [Reviewed]

 Update our thrift to 0.6
 

 Key: HBASE-3629
 URL: https://issues.apache.org/jira/browse/HBASE-3629
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: Moaz Reyad
 Attachments: HBASE-3629.patch.zip, pom.diff


 HBASE-3117 was about updating to 0.5.  Moaz Reyad over in that issue is 
 trying to move us to 0.6.  Lets move the 0.6 upgrade effort here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3824) region server timed out during open region


[ 
https://issues.apache.org/jira/browse/HBASE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025491#comment-13025491
 ] 

Prakash Khemani commented on HBASE-3824:


Probably not an issue. The memstore flush happens in the background and cannot 
cause the log-replay thread to block. My mistake. I will close this.

 



 region server timed out during open region
 --

 Key: HBASE-3824
 URL: https://issues.apache.org/jira/browse/HBASE-3824
 Project: HBase
  Issue Type: Bug
Reporter: Prakash Khemani

 When replaying a large log file, mestore flushes can happen. But there is no 
 Progressible report being sent during memstore flushes. That can lead to 
 master timing out the region server during region open.
 ===
 Another related issue and Jonathan's response
  So if a region server that is handed a region for opening and has done part 
  of
  the work ... it has created some HFiles (because the logs were so huge that
  the mestore got flushed while the logs were being replayed) ... and then it 
  is
  asked to give up because the master thought the region server was taking
  too long to open the region.
  
  When the region server gives up on the region then will it make sure that it
  removes all the HFiles it had created for that region?
 Will need to check the code, but would it matter?  One issue is whether it 
 cleans up after itself (I'm guessing not).  Another issue is whether the 
 replay is idempotent (duplicate KVs across files shouldn't matter in most 
 cases).
 ===
 2011-04-25 09:11:36,844 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
 Caught throwable while processing event M_RS_OPEN_REGION
 java.lang.NullPointerException
 at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
 at 
 org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpened(ZKAssign.java:621)
 at 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:168)
 at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 byte [] existingBytes =
 ZKUtil.getDataNoWatch(zkw, node, stat);
 RegionTransitionData existingData =
 RegionTransitionData.fromBytes(existingBytes);
 existingBytes can be null. have to return -1 if null.
 ===
 master logs
 2011-04-25 05:24:03,250 DEBUG 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer 
 path=hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:19,246 INFO 
 org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path 
 hdfs://pumahbase002-snc5-dfs.data.facebook.com:9000/PUMAHBASE002-SNC5-HBASE/realtime_domain_imps_urls/e7a478b4bd164525052f1dedb832de0a/recovered.edits/57528037047
  (wrote 4342690 edits in 46904ms)
 2011-04-25 09:09:26,134 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x32f7bb74e8a Creating (or updating) unassigned node for 
 e7a478b4bd164525052f1dedb832de0a with OFFLINE state
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  so generated a random one; 
 hri=realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.,
  src=, dest=pumahbase107.snc5.facebook.com,60020,1303450731227; 70 
 (online=70, exclude=null) available servers
 2011-04-25 09:09:26,136 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 realtime_domain_imps_urls,afbe,1295556905482.e7a478b4bd164525052f1dedb832de0a.
  to pumahbase107.snc5.facebook.com,60020,1303450731227
 2011-04-25 09:09:26,139 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:44,045 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 
 server=pumahbase107.snc5.facebook.com,60020,1303450731227, 
 region=e7a478b4bd164525052f1dedb832de0a
 2011-04-25 09:09:59,050 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING,

[jira] [Updated] (HBASE-3629) Update our thrift to 0.6


 [ 
https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3629:
-

Release Note: Updated our thrift to 0.6.1. Incompatible change with 
previous HBase thrift.

 Update our thrift to 0.6
 

 Key: HBASE-3629
 URL: https://issues.apache.org/jira/browse/HBASE-3629
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: Moaz Reyad
 Fix For: 0.92.0

 Attachments: HBASE-3629.patch.zip, pom.diff


 HBASE-3117 was about updating to 0.5.  Moaz Reyad over in that issue is 
 trying to move us to 0.6.  Lets move the 0.6 upgrade effort here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3629) Update our thrift to 0.6


 [ 
https://issues.apache.org/jira/browse/HBASE-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3629:
-

Fix Version/s: 0.92.0

Resolved.  Applied to TRUNK (With Lars Francke suggested changes).  Thanks for 
the patch Moaz (its missing apache license off the generated files but I think 
that ok -- until someone tells me otherwise).  Thanks for the review Lars.

 Update our thrift to 0.6
 

 Key: HBASE-3629
 URL: https://issues.apache.org/jira/browse/HBASE-3629
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: Moaz Reyad
 Fix For: 0.92.0

 Attachments: HBASE-3629.patch.zip, pom.diff


 HBASE-3117 was about updating to 0.5.  Moaz Reyad over in that issue is 
 trying to move us to 0.6.  Lets move the 0.6 upgrade effort here.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3741) OpenRegionHandler and CloseRegionHandler are possibly racing

2011-04-26 Thread Jean-Daniel Cryans (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans updated HBASE-3741:
--

Attachment: HBASE-3741-rsfix-v3.patch

Takes care of what Stack mentioned in his review except for 
getRegionsInTransitionInRS that needs to bu public (in the scope of 
RegionServerServices).

 OpenRegionHandler and CloseRegionHandler are possibly racing
 

 Key: HBASE-3741
 URL: https://issues.apache.org/jira/browse/HBASE-3741
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.1
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Blocker
 Fix For: 0.90.3

 Attachments: HBASE-3741-rsfix-v2.patch, HBASE-3741-rsfix-v3.patch, 
 HBASE-3741-rsfix.patch


 This is a serious issue about a race between regions being opened and closed 
 in region servers. We had this situation where the master tried to unassign a 
 region for balancing, failed, force unassigned it, force assigned it 
 somewhere else, failed to open it on another region server (took too long), 
 and then reassigned it back to the original region server. A few seconds 
 later, the region server processed the first closed and the region was left 
 unassigned.
 This is from the master log:
 {quote}
 11-04-05 15:11:17,758 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
 Sent CLOSE to serverName=sv4borg42,60020,1300920459477, load=(requests=187, 
 regions=574, usedHeap=3918, maxHeap=6973) for region 
 stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
 2011-04-05 15:12:10,021 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  
 stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
  state=PENDING_CLOSE, ts=1302041477758
 2011-04-05 15:12:10,021 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
 ...
 2011-04-05 15:14:45,783 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
  state=CLOSED, ts=1302041685733
 2011-04-05 15:14:45,783 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x42ec2cece810b68 Creating (or updating) unassigned node for 
 1470298961 with OFFLINE state
 ...
 2011-04-05 15:14:45,885 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961;
  
 plan=hri=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961,
  src=sv4borg42,60020,1300920459477, dest=sv4borg40,60020,1302041218196
 2011-04-05 15:14:45,885 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
  to sv4borg40,60020,1302041218196
 2011-04-05 15:15:39,410 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  
 stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
  state=PENDING_OPEN, ts=1302041700944
 2011-04-05 15:15:39,410 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
 2011-04-05 15:15:39,410 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
  state=PENDING_OPEN, ts=1302041700944
 ...
 2011-04-05 15:15:39,410 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
  so generated a random one; 
 hri=stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961,
  src=, dest=sv4borg42,60020,1300920459477; 19 (online=19, exclude=null) 
 available servers
 2011-04-05 15:15:39,410 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 stumbles_by_userid2,\x00'\x8E\xE8\x7F\xFF\xFE\xE7\xA9\x97\xFC\xDF\x01\x10\xCC6,1266566087256.1470298961
  to sv4borg42,60020,1300920459477
 2011-04-05 15:15:40,951 DEBUG

[jira] [Resolved] (HBASE-3805) Log RegionState that are processed too late in the master

2011-04-26 Thread Jean-Daniel Cryans (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved HBASE-3805.
---

Resolution: Fixed

Committed the additional logging.

 Log RegionState that are processed too late in the master 
 --

 Key: HBASE-3805
 URL: https://issues.apache.org/jira/browse/HBASE-3805
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.2
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
Priority: Minor
 Fix For: 0.90.3

 Attachments: HBASE-3805.patch


 Working on all the weird delayed processing in the master, I saw that it was 
 hard to figure when a zookeeper event is processed too late. For example, 
 cases where the processing of the events gets too slow and the master takes 
 more than a minute after the event is triggered in the region server to get 
 to it's processing.
 We should at least print that out.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1502) Remove need for heartbeats in HBase

2011-04-26 Thread jirapos...@reviews.apache.org (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025520#comment-13025520
 ] 

jirapos...@reviews.apache.org commented on HBASE-1502:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/647/
---

(Updated 2011-04-26 23:50:23.656187)


Review request for hbase.


Changes
---

All tests pass now.  I'd like to get this patch in soon.  I'm currently 
spending a good bit of my time trying to keep this patch up with current TRUNK. 
 I'd rather commit and then address issues after.

This version of the patch does one  make significant change though in that it 
deprecates prewarmRegionCache.  IMO this is a burdensome feature that is little 
used; i'd like to have it die off.


Summary
---

This patch does not completely remove heartbeats.  It unburdens the heartbeat 
of control messages; now heartbeat is used to
send the master load only (At most recent hackathon we had rough agreement that 
we'd keep heartbeat to carry load)... if we miss some, no biggie.   

RPC version changed on HMasterRegionInfo since the regionServerStartup and 
regionServerReport arguments have changed.
We pass a String now instead of HServerAddress so this should help with our DNS 
issues where the two sides disagree.

Removed HMsg.

HServerAddress as been sort_of_deprecated.  Its in our API so can't remove it 
easily (its embedded inside HRegionLocation).
Otherwise, we don't use it internally anymore.

HServerInfo is deprecated.  Server meta data is now available in new class 
ServerName and load lives apart from HSI now.

Fixed up regionserver and master startup so they now look the same.

New tests

Cruft cleanup.


This addresses bug hbase-1502.
https://issues.apache.org/jira/browse/hbase-1502


Diffs (updated)
-

  src/main/java/org/apache/hadoop/hbase/ClusterStatus.java 26a8bef 
  src/main/java/org/apache/hadoop/hbase/HConstants.java 5701639 
  src/main/java/org/apache/hadoop/hbase/HMsg.java 87beb00 
  src/main/java/org/apache/hadoop/hbase/HRegionLocation.java bd353b8 
  src/main/java/org/apache/hadoop/hbase/HServerAddress.java 7f8a472 
  src/main/java/org/apache/hadoop/hbase/HServerInfo.java 0b5bd94 
  src/main/java/org/apache/hadoop/hbase/HServerLoad.java 2372053 
  src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java 0d696ab 
  src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1da9742 
  src/main/java/org/apache/hadoop/hbase/Server.java df396fa 
  src/main/java/org/apache/hadoop/hbase/ServerName.java PRE-CREATION 
  src/main/java/org/apache/hadoop/hbase/avro/AvroUtil.java d7a1e67 
  src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java be31179 
  src/main/java/org/apache/hadoop/hbase/catalog/MetaEditor.java c2ee031 
  src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 6e22cf5 
  src/main/java/org/apache/hadoop/hbase/catalog/RootLocationEditor.java aee64c5 
  src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java afb666a 
  src/main/java/org/apache/hadoop/hbase/client/HConnection.java 2bb4725 
  src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java c348f7a 
  src/main/java/org/apache/hadoop/hbase/client/HTable.java edacf56 
  
src/main/java/org/apache/hadoop/hbase/client/RetriesExhaustedWithDetailsException.java
 6c62024 
  src/main/java/org/apache/hadoop/hbase/coprocessor/BaseMasterObserver.java 
8df6aa4 
  src/main/java/org/apache/hadoop/hbase/coprocessor/MasterObserver.java d64817f 
  src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java c22e342 
  src/main/java/org/apache/hadoop/hbase/executor/RegionTransitionData.java 
a55f9d6 
  src/main/java/org/apache/hadoop/hbase/io/HbaseObjectWritable.java d8f8463 
  src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java ec28de4 
  src/main/java/org/apache/hadoop/hbase/ipc/HMasterRegionInterface.java 25139b3 
  src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 663cab5 
  src/main/java/org/apache/hadoop/hbase/ipc/WritableRpcEngine.java 2273e55 
  src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java 66a3345 
  src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 58fdb30 
  src/main/java/org/apache/hadoop/hbase/master/DeadServer.java 05600c4 
  src/main/java/org/apache/hadoop/hbase/master/HMaster.java 79a48ba 
  src/main/java/org/apache/hadoop/hbase/master/LoadBalancer.java 6c92cbc 
  src/main/java/org/apache/hadoop/hbase/master/MasterCoprocessorHost.java 
4bb072e 
  src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java 55e0162 
  src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 04befe9 
  src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java dada818

[jira] [Updated] (HBASE-3794) TestRpcMetrics fails on machine where region server is running


 [ 
https://issues.apache.org/jira/browse/HBASE-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3794:
-

   Resolution: Fixed
Fix Version/s: 0.90.3
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

Committed branch and trunk.  Thanks for the patch Alex.

 TestRpcMetrics fails on machine where region server is running
 --

 Key: HBASE-3794
 URL: https://issues.apache.org/jira/browse/HBASE-3794
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.90.2
Reporter: Ted Yu
Assignee: Alex Newman
 Fix For: 0.90.3

 Attachments: HBASE-3794.patch


 Since whole test suite takes over an hour to run, I ran them on Linux where 
 region server is running.
 Here is the consistent TestRpcMetrics failure I saw: 
 {code}
 Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.196 sec  
 FAILURE!
 testCustomMetrics(org.apache.hadoop.hbase.regionserver.TestRpcMetrics)  Time 
 elapsed: 0.079 sec   ERROR!
 java.net.BindException: Problem binding to /10.202.50.107:60020 : Address 
 already in use
 at org.apache.hadoop.hbase.ipc.HBaseServer.bind(HBaseServer.java:216)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.init(HBaseServer.java:283)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer.init(HBaseServer.java:1189)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.init(WritableRpcEngine.java:266)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine.getServer(WritableRpcEngine.java:233)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine.getServer(WritableRpcEngine.java:46)
 at org.apache.hadoop.hbase.ipc.HBaseRPC.getServer(HBaseRPC.java:379)
 at org.apache.hadoop.hbase.ipc.HBaseRPC.getServer(HBaseRPC.java:368)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:336)
 at 
 org.apache.hadoop.hbase.regionserver.TestRpcMetrics$TestRegionServer.init(TestRpcMetrics.java:58)
 at 
 org.apache.hadoop.hbase.regionserver.TestRpcMetrics.testCustomMetrics(TestRpcMetrics.java:119)
 {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3811) Allow adding attributes to Scan


[ 
https://issues.apache.org/jira/browse/HBASE-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025522#comment-13025522
 ] 

stack commented on HBASE-3811:
--

You could change Scan so it passes an optional Map.   Add an attributes method 
to Scan.  If attributes  0, instantiate a Map and then serialize it out.  On 
other end, have check for null or non-null Map when deserializing (We don't 
want to carry the Map for every Scan I'd say -- just carry it when attributes 
present... but perhaps I'm doing premature optimization here).

This could become more important now we have CPs.

 Allow adding attributes to Scan
 ---

 Key: HBASE-3811
 URL: https://issues.apache.org/jira/browse/HBASE-3811
 Project: HBase
  Issue Type: Improvement
  Components: client
Reporter: Alex Baranau
Priority: Minor

 There's sometimes a need to add custom attribute to Scan object so that it 
 can be accessed on server side.
 Example of the case where it is needed discussed here: 
 http://search-hadoop.com/m/v3Jtb2GkiO. There might be other cases where it is 
 useful, which are mostly about logging/gathering stats on server side.
 Alternative to allowing adding any custom attributes to scan could be adding 
 some fixed field, like type to the class.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-3210) HBASE-1921 for the new master


 [ 
https://issues.apache.org/jira/browse/HBASE-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-3210.
--

  Resolution: Fixed
Assignee: Subbu M Iyer
Hadoop Flags: [Reviewed]

Committed to TRUNK.  Thank you for the patch Subbu.

 HBASE-1921 for the new master
 -

 Key: HBASE-3210
 URL: https://issues.apache.org/jira/browse/HBASE-3210
 Project: HBase
  Issue Type: Improvement
Reporter: Jean-Daniel Cryans
Assignee: Subbu M Iyer
Priority: Critical
 Fix For: 0.92.0

 Attachments: 
 HBASE-3210-When_the_Master_s_session_times_out_and_there_s_only_one,_cluster_is_wedged.patch,
  
 HBASE-3210-When_the_Master_s_session_times_out_and_there_s_only_one_cluster_is_wedged-2.patch,
  
 HBASE-3210-When_the_Master_s_session_times_out_and_there_s_only_one_cluster_is_wedged-3.patch


 HBASE-1921 was lost when writing the new master code. I guess it's going to 
 be much harder to implement now, but I think it's a critical feature to have 
 considering the reasons that brought me do it in the old master. There's 
 already a test in TestZooKeeper which has been disabled a while ago.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1502) Remove need for heartbeats in HBase

2011-04-26 Thread jirapos...@reviews.apache.org (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025526#comment-13025526
 ] 

jirapos...@reviews.apache.org commented on HBASE-1502:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/647/#review568
---

Ship it!


woohoo!  glad HMsg is dead!


src/main/java/org/apache/hadoop/hbase/ClusterStatus.java
https://reviews.apache.org/r/647/#comment1206

Can just use this.liveServers.values() for here and below?



src/main/java/org/apache/hadoop/hbase/HServerAddress.java
https://reviews.apache.org/r/647/#comment1207

Where is this actually used now?  Should point it out here so it's clear 
and so that when it goes away we know we can get rid of this.



src/main/java/org/apache/hadoop/hbase/HServerInfo.java
https://reviews.apache.org/r/647/#comment1208

i see webuiport below, does this TODO still apply?



src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java
https://reviews.apache.org/r/647/#comment1209

why String and not ServerName?  because master has no startcode?  (i see 
use of ServerName for master above tho)



src/main/java/org/apache/hadoop/hbase/ServerName.java
https://reviews.apache.org/r/647/#comment1210

awesome that this is tucked away in here now



src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java
https://reviews.apache.org/r/647/#comment1211

this is because HSA actually makes a connection or does the lookup?


- Jonathan


On 2011-04-26 23:50:23, Michael Stack wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/647/
bq.  ---
bq.  
bq.  (Updated 2011-04-26 23:50:23)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  This patch does not completely remove heartbeats.  It unburdens the 
heartbeat of control messages; now heartbeat is used to
bq.  send the master load only (At most recent hackathon we had rough agreement 
that we'd keep heartbeat to carry load)... if we miss some, no biggie.   
bq.  
bq.  RPC version changed on HMasterRegionInfo since the regionServerStartup and 
regionServerReport arguments have changed.
bq.  We pass a String now instead of HServerAddress so this should help with 
our DNS issues where the two sides disagree.
bq.  
bq.  Removed HMsg.
bq.  
bq.  HServerAddress as been sort_of_deprecated.  Its in our API so can't remove 
it easily (its embedded inside HRegionLocation).
bq.  Otherwise, we don't use it internally anymore.
bq.  
bq.  HServerInfo is deprecated.  Server meta data is now available in new class 
ServerName and load lives apart from HSI now.
bq.  
bq.  Fixed up regionserver and master startup so they now look the same.
bq.  
bq.  New tests
bq.  
bq.  Cruft cleanup.
bq.  
bq.  
bq.  This addresses bug hbase-1502.
bq.  https://issues.apache.org/jira/browse/hbase-1502
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/ClusterStatus.java 26a8bef 
bq.src/main/java/org/apache/hadoop/hbase/HConstants.java 5701639 
bq.src/main/java/org/apache/hadoop/hbase/HMsg.java 87beb00 
bq.src/main/java/org/apache/hadoop/hbase/HRegionLocation.java bd353b8 
bq.src/main/java/org/apache/hadoop/hbase/HServerAddress.java 7f8a472 
bq.src/main/java/org/apache/hadoop/hbase/HServerInfo.java 0b5bd94 
bq.src/main/java/org/apache/hadoop/hbase/HServerLoad.java 2372053 
bq.src/main/java/org/apache/hadoop/hbase/LocalHBaseCluster.java 0d696ab 
bq.src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1da9742 
bq.src/main/java/org/apache/hadoop/hbase/Server.java df396fa 
bq.src/main/java/org/apache/hadoop/hbase/ServerName.java PRE-CREATION 
bq.src/main/java/org/apache/hadoop/hbase/avro/AvroUtil.java d7a1e67 
bq.src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 
be31179 
bq.src/main/java/org/apache/hadoop/hbase/catalog/MetaEditor.java c2ee031 
bq.src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 6e22cf5 
bq.src/main/java/org/apache/hadoop/hbase/catalog/RootLocationEditor.java 
aee64c5 
bq.src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java afb666a 
bq.src/main/java/org/apache/hadoop/hbase/client/HConnection.java 2bb4725 
bq.src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java 
c348f7a 
bq.src/main/java/org/apache/hadoop/hbase/client/HTable.java edacf56 
bq.
src/main/java/org/apache/hadoop/hbase/client/RetriesExhaustedWithDetailsException.java
 6c62024 
bq.
src/main/java/org/apache/hadoop/hbase/coprocessor/BaseMasterObserver.java 
8df6aa4 
bq.

[jira] [Commented] (HBASE-3777) Redefine Identity Of HBase Configuration

2011-04-26 Thread jirapos...@reviews.apache.org (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025543#comment-13025543
 ] 

jirapos...@reviews.apache.org commented on HBASE-3777:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/643/#review569
---



src/main/java/org/apache/hadoop/hbase/master/HMaster.java
https://reviews.apache.org/r/643/#comment1213

Same comment as in HRS, I think this is creating a second connection for 
the master.



src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
https://reviews.apache.org/r/643/#comment1212

IIUC, we are creating an additional connection here since CT will do a 
getConnection with the passed conf instead of using a connection that the RS 
already has.


- Jean-Daniel


On 2011-04-22 21:16:59, Karthick Sankarachary wrote:
bq.  
bq.  ---
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/643/
bq.  ---
bq.  
bq.  (Updated 2011-04-22 21:16:59)
bq.  
bq.  
bq.  Review request for hbase and Ted Yu.
bq.  
bq.  
bq.  Summary
bq.  ---
bq.  
bq.  Judging from the javadoc in HConnectionManager, sharing connections across 
multiple clients going to the same cluster is supposedly a good thing. However, 
the fact that there is a one-to-one mapping between a configuration and 
connection instance, kind of works against that goal. Specifically, when you 
create HTable instances using a given Configuration instance and a copy 
thereof, we end up with two distinct HConnection instances under the covers. Is 
this really expected behavior, especially given that the configuration instance 
gets cloned a lot?
bq.  
bq.  Here, I'd like to play devil's advocate and propose that we deep-compare 
HBaseConfiguration instances, so that multiple HBaseConfiguration instances 
that have the same properties map to the same HConnection instance. In case one 
is concerned that a single HConnection is insufficient for sharing amongst 
clients, to quote the javadoc, then one should be able to mark a given 
HBaseConfiguration instance as being uniquely identifiable.
bq.  
bq.  
bq.  This addresses bug HBASE-3777.
bq.  https://issues.apache.org/jira/browse/HBASE-3777
bq.  
bq.  
bq.  Diffs
bq.  -
bq.  
bq.src/main/java/org/apache/hadoop/hbase/HConstants.java 5701639 
bq.src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 
be31179 
bq.src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java afb666a 
bq.src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java 
c348f7a 
bq.src/main/java/org/apache/hadoop/hbase/client/HTable.java edacf56 
bq.src/main/java/org/apache/hadoop/hbase/client/HTablePool.java 88827a8 
bq.src/main/java/org/apache/hadoop/hbase/client/MetaScanner.java 9e3f4d1 
bq.
src/main/java/org/apache/hadoop/hbase/client/replication/ReplicationAdmin.java 
d76e333 
bq.
src/main/java/org/apache/hadoop/hbase/mapreduce/replication/VerifyReplication.java
 ed88bfa 
bq.src/main/java/org/apache/hadoop/hbase/master/HMaster.java 79a48ba 
bq.src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 
d0a1e11 
bq.
src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
 78c3b42 
bq.src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java 5da5e34 
bq.src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java b624d28 
bq.src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java 7f5b377 
bq.src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java 
dc471c4 
bq.src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java 
e25184e 
bq.src/test/java/org/apache/hadoop/hbase/catalog/TestMetaReaderEditor.java 
60320a3 
bq.src/test/java/org/apache/hadoop/hbase/client/TestHCM.java b01a2d2 
bq.src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduce.java 
624f4a8 
bq.src/test/java/org/apache/hadoop/hbase/util/TestMergeTable.java 8992dbb 
bq.  
bq.  Diff: https://reviews.apache.org/r/643/diff
bq.  
bq.  
bq.  Testing
bq.  ---
bq.  
bq.  mvn test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Karthick
bq.  
bq.



 Redefine Identity Of HBase Configuration
 

 Key: HBASE-3777
 URL: https://issues.apache.org/jira/browse/HBASE-3777
 Project: HBase
  Issue Type: Improvement
  Components: client, ipc
Affects Versions: 0.90.2
Reporter: Karthick Sankarachary
Assignee: Karthick Sankarachary
Priority: Minor
 Fix For: 0.92.0

[jira] [Updated] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour


 [ 
https://issues.apache.org/jira/browse/HBASE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoushuaifeng updated HBASE-3821:
-

Attachment: HBase-3821 v1.txt

I think there are several ways to fix it:
1, in the roll back handling , like this:
  case CREATE_SPLIT_DIR:
+this.parent.writestate.writesEnabled = true;
cleanupSplitDir(fs, this.splitdir);
break;
2, catch ioException after doclose or preflush.
I think the first one is better. Do you think? And if there is anything else 
should be done?
The patch is the first way.

  NOT flushing memstore for region keep on printing for half an hour
 -

 Key: HBASE-3821
 URL: https://issues.apache.org/jira/browse/HBASE-3821
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.90.1
Reporter: zhoushuaifeng
 Fix For: 0.90.3

 Attachments: HBase-3821 v1.txt


 NOT flushing memstore for region keep on printing for half an hour in the 
 regionserver. Then I restart hbase. I think there may be deadlock or cycling.
 I know that when splitting region, it will doclose of region, and set 
 writestate.writesEnabled = false  and may run close preflush. This will make 
 flush fail and print NOT flushing memstore for region. But It should be 
 finished after a while.
 logs:
 2011-04-18 16:28:27,960 DEBUG 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested 
 for ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. because 
 regionserver60020.cacheFlusher; priority=-1, compaction queue size=1
 2011-04-18 16:28:30,171 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:30,171 WARN 
 org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. has too many store 
 files; delaying flush up to 9ms
 2011-04-18 16:28:32,119 INFO 
 org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs 
 -- HDFS-200
 2011-04-18 16:28:32,285 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
 Roll 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124206693, 
 entries=5226, filesize=255913736. New hlog 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303124311822
 2011-04-18 16:28:32,287 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
 Found 1 hlogs to remove out of total 2; oldest outstanding sequenceid is 
 11037 from region 031f37c9c23fcab17797b06b90205610
 2011-04-18 16:28:32,288 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
 moving old hlog file 
 /hbase/.logs/linux253,60020,1303123943360/linux253%3A60020.1303123945481 
 whose highest sequenceid is 6052 to 
 /hbase/.oldlogs/linux253%3A60020.1303123945481
 2011-04-18 16:28:42,701 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Completed major compaction of 4 file(s), new 
 file=hdfs://10.18.52.108:9000/hbase/ufdr/031f37c9c23fcab17797b06b90205610/value/4398465741579485290,
  size=281.4m; total size for store is 468.8m
 2011-04-18 16:28:42,712 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 completed compaction on region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610. after 1mins, 40sec
 2011-04-18 16:28:42,741 INFO 
 org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of 
 region ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:42,770 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Closing ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.: disabling 
 compactions  flushes
 2011-04-18 16:28:42,770 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Running close preflush of 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:42,771 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., current region 
 memstore size 105.6m
 2011-04-18 16:28:42,818 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished snapshotting, commencing flushing stores
 2011-04-18 16:28:42,846 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 NOT flushing memstore for region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, 
 writesEnabled=false
 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Flush requested on ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610.
 2011-04-18 16:28:42,849 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 NOT flushing memstore for region 
 ufdr,,1303124043153.031f37c9c23fcab17797b06b90205610., flushing=false, 
 writesEnabled=false ..
 2011-04-18 17:04:08,803 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 NOT flushing memstore for region

[jira] [Commented] (HBASE-3821) NOT flushing memstore for region keep on printing for half an hour