[jira] [Commented] (HBASE-8039) Make HDFS replication number configurable for a column family

2014-05-21 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004836#comment-14004836
 ] 

Maryann Xue commented on HBASE-8039:


This was meant for use cases that would like to set a smaller number of 
replications for those less important but more consuming column families. For 
example, large image files.
So I assume how to resolve this issue depends on how we evaluate such use cases.

 Make HDFS replication number configurable for a column family
 -

 Key: HBASE-8039
 URL: https://issues.apache.org/jira/browse/HBASE-8039
 Project: HBase
  Issue Type: Improvement
  Components: HFile
Reporter: Maryann Xue
Priority: Minor
 Fix For: 0.99.0, 0.98.4


 To allow users to decide which column family's data is more important and 
 which is less important by specifying a replica number instead of using the 
 default replica number.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable

2013-03-27 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615013#comment-13615013
 ] 

Maryann Xue commented on HBASE-8024:


Modification of HStore#internalFlushCache() for our LOB use case:
{code}
  private Path internalFlushCacheToBlobStore(final SortedSetKeyValue set,
  final long logCacheFlushId,
  TimeRangeTracker snapshotTimeRangeTracker,
  AtomicLong flushedSize,
  MonitoredTask status)
  throws IOException {
StoreFile.Writer writer;

// Find the smallest read point across all the Scanners.
long smallestReadPoint = region.getSmallestReadPoint();
long flushed = 0;
Path referenceFilePath = null;
Path blobFilePath = null;
// Don't flush if there are no entries.
if (set.size() == 0) {
  return null;
}
Scan scan = new Scan();
scan.setMaxVersions(scanInfo.getMaxVersions());
// Use a store scanner to find which rows to flush.
// Note that we need to retain deletes, hence
// treat this as a minor compaction.
InternalScanner scanner = new StoreScanner(this, scan, Collections
.singletonList(new CollectionBackedScanner(set, this.comparator)),
ScanType.MINOR_COMPACT, this.region.getSmallestReadPoint(),
HConstants.OLDEST_TIMESTAMP);

BlobStore blobStore = 
BlobStoreManager.getInstance().getBlobStore(getTableName(), 
family.getNameAsString());
if (null == blobStore) {
  blobStore = 
BlobStoreManager.getInstance().createBlobStore(getTableName(), family);
}

StoreFile.Writer blobWriter = null;
try {
  // TODO:  We can fail in the below block before we complete adding this
  // flush to list of store files.  Add cleanup of anything put on 
filesystem
  // if we fail.
  synchronized (flushLock) {
status.setStatus(Flushing  + this + : creating writer);

int referenceKeyValueCount = set.size();
int blobKeyValueCount = 0;

// A. Write the map out to the disk
writer = createWriterInTmp(referenceKeyValueCount);
writer.setTimeRangeTracker(snapshotTimeRangeTracker);
referenceFilePath = writer.getPath();

IteratorKeyValue iter = set.iterator();

while(null != iter  iter.hasNext()) {
  if (iter.next().getType() == KeyValue.Type.Put.getCode()) {
blobKeyValueCount++;
  }
}

blobWriter = blobStore.createWriterInTmp(blobKeyValueCount, 
this.compression, 
region.getRegionInfo());
blobFilePath = blobWriter.getPath();
String targetPathName = dateFormatter.format(new Date());
Path targetPath = new Path(blobStore.getHomePath(), targetPathName);

String relativePath =  targetPathName + Path.SEPARATOR +  
blobFilePath.getName();

// Append the BLOB_STORE_VERSION before the relative path name
byte[] referenceValue = Bytes.add(
new byte[] { BlobStoreConstants.BLOB_STORE_VERSION },
Bytes.toBytes(relativePath));

try {
  ListKeyValue kvs = new ArrayListKeyValue();
  boolean hasMore;
  do {
hasMore = scanner.next(kvs);
if (!kvs.isEmpty()) {
  for (KeyValue kv : kvs) {
// If we know that this KV is going to be included always, then 
let us
// set its memstoreTS to 0. This will help us save space when 
writing to disk.
if (kv.getMemstoreTS() = smallestReadPoint) {
  // let us not change the original KV. It could be in the 
memstore
  // changing its memstoreTS could affect other 
threads/scanners.
  kv = kv.shallowCopy();
  kv.setMemstoreTS(0);
}

if (kv.getType() == KeyValue.Type.Reference.getCode()) {
  writer.append(kv);
}
else {

  // append the original keyValue in the blob file.
  blobWriter.append(kv);

  // append reference KeyValue.
  // The key is same, the value is the blobfile's filename
  KeyValue reference = new KeyValue(kv.getBuffer(), 
  kv.getRowOffset(), 
  kv.getRowLength(),
  kv.getBuffer(),
  kv.getFamilyOffset(),
  kv.getFamilyLength(),
  kv.getBuffer(),
  kv.getQualifierOffset(),
  kv.getQualifierLength(),
  kv.getTimestamp(),
  KeyValue.Type.Reference,
  referenceValue, 
  0, 
  referenceValue.length);
  

[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable

2013-03-27 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615015#comment-13615015
 ] 

Maryann Xue commented on HBASE-8024:


In the LOB use case: add an independent LOB writer for real LOB data, and 
replace the original value of the KeyValue with LOB file path.

 Make Store flush algorithm pluggable
 

 Key: HBASE-8024
 URL: https://issues.apache.org/jira/browse/HBASE-8024
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Affects Versions: 0.95.0, 0.96.0, 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
 Attachments: HBASE-8024-trunk.patch, HBASE-8024.v2.patch


 The idea is to make StoreFlusher an interface instead of an implementation 
 class, and have the original StoreFlusher as the default store flush impl.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable

2013-03-21 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609857#comment-13609857
 ] 

Maryann Xue commented on HBASE-8024:


@Sergey, thank you for the ideas! these are very good suggestions. I will 
reorganize the relationship between flusher, flushrequest and store, optimally 
making one flusher per store with different flushrequests each time.

the motivation is to enable our LOB implementation as a plug-in to HBase core. 
we already have customization on compactions, now with custom flush, we can 
write LOB data in independent HFiles.
the requirement of our use case for customized flush is simple, which only adds 
a few lines into internalFlushCache(). but i totally agree with you on having 
something more flexible into this patch.

@Ted, thank you for the comments! will cleanup documentation accordingly.

 Make Store flush algorithm pluggable
 

 Key: HBASE-8024
 URL: https://issues.apache.org/jira/browse/HBASE-8024
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Affects Versions: 0.95.0, 0.96.0, 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
 Attachments: HBASE-8024-trunk.patch, HBASE-8024.v2.patch


 The idea is to make StoreFlusher an interface instead of an implementation 
 class, and have the original StoreFlusher as the default store flush impl.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8024) Make Store flush algorithm pluggable

2013-03-20 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-8024:
---

Attachment: HBASE-8024-trunk.patch

Changes: 
1. Remove the orignal internal class HStore$StoreFlusherImpl
2. Create class DefaultStoreFlusher which implements StoreFlusher
3. Move all implementation of flushCache from HStore to DefaultStoreFlusher
4. Add method createStoreFlusher in StoreEngine

 Make Store flush algorithm pluggable
 

 Key: HBASE-8024
 URL: https://issues.apache.org/jira/browse/HBASE-8024
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Affects Versions: 0.95.0, 0.96.0, 0.94.5
Reporter: Maryann Xue
 Attachments: HBASE-8024-trunk.patch


 The idea is to make StoreFlusher an interface instead of an implementation 
 class, and have the original StoreFlusher as the default store flush impl.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-8024) Make Store flush algorithm pluggable

2013-03-20 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-8024:
---

Attachment: HBASE-8024.v2.patch

Yes, one file missing in the patch. Sorry for the mistake

 Make Store flush algorithm pluggable
 

 Key: HBASE-8024
 URL: https://issues.apache.org/jira/browse/HBASE-8024
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Affects Versions: 0.95.0, 0.96.0, 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
 Attachments: HBASE-8024-trunk.patch, HBASE-8024.v2.patch


 The idea is to make StoreFlusher an interface instead of an implementation 
 class, and have the original StoreFlusher as the default store flush impl.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable

2013-03-20 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608593#comment-13608593
 ] 

Maryann Xue commented on HBASE-8024:


@sergey it was simply copypaste from the original inner class impl. so i 
update the code comment along in this patch?

 Make Store flush algorithm pluggable
 

 Key: HBASE-8024
 URL: https://issues.apache.org/jira/browse/HBASE-8024
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Affects Versions: 0.95.0, 0.96.0, 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
 Attachments: HBASE-8024-trunk.patch, HBASE-8024.v2.patch


 The idea is to make StoreFlusher an interface instead of an implementation 
 class, and have the original StoreFlusher as the default store flush impl.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7876) Got exception when manually triggers a split on an empty region

2013-03-14 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602136#comment-13602136
 ] 

Maryann Xue commented on HBASE-7876:


@stack agree~

 Got exception when manually triggers a split on an empty region
 ---

 Key: HBASE-7876
 URL: https://issues.apache.org/jira/browse/HBASE-7876
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor
 Attachments: HBASE-7876-0.94V2.patch, HBASE-7876-trunk.patch


 We should allow a region to split successfully even if it does not yet have 
 storefiles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable

2013-03-12 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599815#comment-13599815
 ] 

Maryann Xue commented on HBASE-8024:


Andy, any planned date for 0.96? I will submit a patch soon anyway :)

 Make Store flush algorithm pluggable
 

 Key: HBASE-8024
 URL: https://issues.apache.org/jira/browse/HBASE-8024
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Affects Versions: 0.95.0, 0.96.0, 0.94.5
Reporter: Maryann Xue

 The idea is to make StoreFlusher an interface instead of an implementation 
 class, and have the original StoreFlusher as the default store flush impl.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7949) Enable big content store in HBase

2013-03-11 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598642#comment-13598642
 ] 

Maryann Xue commented on HBASE-7949:


@chenning, as enis has clarified, the actual data move does not happen on the 
split point. instead, it happens in later compactions. and in the approach we 
proposed, the LOB family does not participate in split or minor compactions at 
all.

@enis, the problem is not when the read and write happens, it is more of the 
unnecessary I/O overhead in splitting. and if the data is seldom updated, why 
compact them (for split) anyway?

yes, utilizing level compactions could be a good approach. still, our approach 
can have three advantages over level compaction: 
1. i/o overhead by split and minor compactions are completely eliminated; 
2. clean-up is only done for those file that has reached a certain level of 
invalidation rate, during major compactions;
3. not every file reader is instantiated and kept in regionserver memory. 
instead, we'll have an LRU cache for frequently read LOB files.

however, i suggest this issue not be committed into HBase trunk. instead we'd 
like to make the implementation a use case over HBase. and the only facility we 
need in HBase trunk is a pluggable flush process HBASE-8024.

 Enable big content store in HBase
 -

 Key: HBASE-7949
 URL: https://issues.apache.org/jira/browse/HBASE-7949
 Project: HBase
  Issue Type: Brainstorming
Reporter: chenning
 Attachments: HBase_LOB.pdf


 Big content stored in hbase consumes a lot of system resource when region 
 split or compaction operation happens.
 How HBase can be used to store big content along with some self descriptive 
 meta-data. 
 The general idea is to add a new type of column family, and the content of 
 this kind of column family doesn't participate the region split and 
 compaction. An index(rowkey-location) is introduced in this new column family 
 and the split and compaction are only applied to this index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8039) Make HDFS replication number configurable for a column family

2013-03-11 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598644#comment-13598644
 ] 

Maryann Xue commented on HBASE-8039:


Yes, Sergey, that would be a necessary part of the solution. but meanwhile the 
other part is to pass down the replication number into the HFile writer.

 Make HDFS replication number configurable for a column family
 -

 Key: HBASE-8039
 URL: https://issues.apache.org/jira/browse/HBASE-8039
 Project: HBase
  Issue Type: Improvement
  Components: HFile
Affects Versions: 0.94.5
Reporter: Maryann Xue
Priority: Minor

 To allow users to decide which column family's data is more important and 
 which is less important by specifying a replica number instead of using the 
 default replica number.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7876) Got exception when manually triggers a split on an empty region

2013-03-10 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598510#comment-13598510
 ] 

Maryann Xue commented on HBASE-7876:


@ramkrishna, u might be using the wrong patch file? coz we removed this test 
case since it won't be useful anyway.

 Got exception when manually triggers a split on an empty region
 ---

 Key: HBASE-7876
 URL: https://issues.apache.org/jira/browse/HBASE-7876
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor
 Attachments: HBASE-7876-0.94.patch, HBASE-7876-0.94V2.patch, 
 HBASE-7876-trunk.patch


 We should allow a region to split successfully even if it does not yet have 
 storefiles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7876) Got exception when manually triggers a split on an empty region

2013-03-07 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595662#comment-13595662
 ] 

Maryann Xue commented on HBASE-7876:


agree with clockfly. 
@ramakrishna to me it makes no sense for the user to configure to get such an 
exception for this reasonable and no harm operation. and as clockfly said, 
splitting an empty region with no midkey specifies still behaves as before.

 Got exception when manually triggers a split on an empty region
 ---

 Key: HBASE-7876
 URL: https://issues.apache.org/jira/browse/HBASE-7876
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor
 Attachments: HBASE-7876-0.94.patch


 We should allow a region to split successfully even if it does not yet have 
 storefiles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8024) Make Store flush algorithm pluggable

2013-03-07 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-8024:
--

 Summary: Make Store flush algorithm pluggable
 Key: HBASE-8024
 URL: https://issues.apache.org/jira/browse/HBASE-8024
 Project: HBase
  Issue Type: Sub-task
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue


The idea is to make StoreFlusher an interface instead of an implementation 
class, and have the original StoreFlusher as the default store flush impl.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7949) Enable big content store in HBase

2013-03-07 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596880#comment-13596880
 ] 

Maryann Xue commented on HBASE-7949:


@Enis well, the constant reading and writing of the same set of large content 
data happens in two ways: compaction and split.
1. during compaction, the data is read from small files and writing to a 
combined new large file.
2. during split, the data is read from the parent region storefiles and written 
into two daughter regions' storefiles.

to avoid I/O overhead caused by 1 (compaction), we can disable minor compaction 
for this family, but this would lead to another big problem: bad get/scan 
performance. like for a get operation, we need to compare against too many 
bloomfilters for each storefile to locate our record; and for a scan operation, 
we need to perform seek in all these storefiles. the performance decline of 
Get throughput with the storefile number increase is shown in the slides.

to avoid I/O overhead caused by 2 (split), we can have pre-split regions for a 
table, but this cannot always be done for customer use-cases.

The idea is large content data are very probably loaded once and not frequently 
modified, there is literally no need to move or merge the data all the time, as 
would happen in normal region compactions and splittings, and in order to 
maintain region independence and read efficiency.
so having a storage independent of hbase regions would make sense for such 
use-cases, and meanwhile we leverage the major compaction process to do cleanup 
and merge at a reasonable frequency level -- only perform merge when a certain 
file has exceeded the configured threshold.

 Enable big content store in HBase
 -

 Key: HBASE-7949
 URL: https://issues.apache.org/jira/browse/HBASE-7949
 Project: HBase
  Issue Type: Brainstorming
Reporter: chenning
 Attachments: HBase_LOB.pdf


 Big content stored in hbase consumes a lot of system resource when region 
 split or compaction operation happens.
 How HBase can be used to store big content along with some self descriptive 
 meta-data. 
 The general idea is to add a new type of column family, and the content of 
 this kind of column family doesn't participate the region split and 
 compaction. An index(rowkey-location) is introduced in this new column family 
 and the split and compaction are only applied to this index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7891) Add an index number prior to each table region in table.jsp

2013-03-07 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7891:
---

Attachment: HBASE-7891-trunk.patch

attach the trunk patch

 Add an index number prior to each table region in table.jsp
 ---

 Key: HBASE-7891
 URL: https://issues.apache.org/jira/browse/HBASE-7891
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial
 Attachments: HBASE-7891-0.94.patch, HBASE-7891-trunk.patch


 Adding an index number for each table region in table.jsp would make it 
 easier to locate a region or to count regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7891) Add an index number prior to each table region in table.jsp

2013-03-07 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596892#comment-13596892
 ] 

Maryann Xue commented on HBASE-7891:


@nick i understand your point. but one would soon realize the indexes are more 
for counting purposes rather than of any real meaning as the indexes are always 
sequential in the page and the regions may move among different servers.

 Add an index number prior to each table region in table.jsp
 ---

 Key: HBASE-7891
 URL: https://issues.apache.org/jira/browse/HBASE-7891
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial
 Attachments: HBASE-7891-0.94.patch, HBASE-7891-trunk.patch


 Adding an index number for each table region in table.jsp would make it 
 easier to locate a region or to count regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7890) Add an index number for each region in the region list on the RegionServer web page

2013-03-07 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7890:
---

Attachment: HBASE-7890-trunk.patch

attach the trunk patch

 Add an index number for each region in the region list on the RegionServer 
 web page
 ---

 Key: HBASE-7890
 URL: https://issues.apache.org/jira/browse/HBASE-7890
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial
 Attachments: HBASE-7890-0.94.patch, HBASE-7890-trunk.patch


 Add an index number before each region would make it easier to locate a 
 region on the page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-8039) Make HDFS replication number configurable for a column family

2013-03-07 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-8039:
--

 Summary: Make HDFS replication number configurable for a column 
family
 Key: HBASE-8039
 URL: https://issues.apache.org/jira/browse/HBASE-8039
 Project: HBase
  Issue Type: Improvement
  Components: HFile
Affects Versions: 0.94.5
Reporter: Maryann Xue
Priority: Minor


To allow users to decide which column family's data is more important and which 
is less important by specifying a replica number instead of using the default 
replica number.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7876) Got exception when manually triggers a split on an empty region

2013-03-07 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7876:
---

Attachment: HBASE-7876-trunk.patch
HBASE-7876-0.94V2.patch

update patch -- revert HBASE-6853

 Got exception when manually triggers a split on an empty region
 ---

 Key: HBASE-7876
 URL: https://issues.apache.org/jira/browse/HBASE-7876
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor
 Attachments: HBASE-7876-0.94.patch, HBASE-7876-0.94V2.patch, 
 HBASE-7876-trunk.patch


 We should allow a region to split successfully even if it does not yet have 
 storefiles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7949) Enable big content store in HBase

2013-03-07 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596929#comment-13596929
 ] 

Maryann Xue commented on HBASE-7949:


yes, you've made a good point here. flush would happen more frequently and 
compactions for the meta data family will involve more small storefiles. 
however,
1. this approach best guarantees consistency.
2. several large content records get flushed into one file in one process, 
which means more efficient I/O usage.
3. meta data is very small compared to large content data. moreover, one minor 
compaction can handle a bunch of small meta data storefiles.

 Enable big content store in HBase
 -

 Key: HBASE-7949
 URL: https://issues.apache.org/jira/browse/HBASE-7949
 Project: HBase
  Issue Type: Brainstorming
Reporter: chenning
 Attachments: HBase_LOB.pdf


 Big content stored in hbase consumes a lot of system resource when region 
 split or compaction operation happens.
 How HBase can be used to store big content along with some self descriptive 
 meta-data. 
 The general idea is to add a new type of column family, and the content of 
 this kind of column family doesn't participate the region split and 
 compaction. An index(rowkey-location) is introduced in this new column family 
 and the split and compaction are only applied to this index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7949) Enable big content store in HBase

2013-03-06 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7949:
---

Attachment: HBase_LOB.pdf

At the recent hbase meetup, we just gave an introduction of an implementation 
for storing large objects.
The idea is to store the real content onto HDFS and let customized major 
compaction for this family handle the management work for these large contents.
And we need a customizable flush() process for this approach.

 Enable big content store in HBase
 -

 Key: HBASE-7949
 URL: https://issues.apache.org/jira/browse/HBASE-7949
 Project: HBase
  Issue Type: Brainstorming
Reporter: chenning
 Attachments: HBase_LOB.pdf


 Big content stored in hbase consumes a lot of system resource when region 
 split or compaction operation happens.
 How HBase can be used to store big content along with some self descriptive 
 meta-data. 
 The general idea is to add a new type of column family, and the content of 
 this kind of column family doesn't participate the region split and 
 compaction. An index(rowkey-location) is introduced in this new column family 
 and the split and compaction are only applied to this index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7890) Add an index number for each region in the region list on the RegionServer web page

2013-02-20 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-7890:
--

 Summary: Add an index number for each region in the region list on 
the RegionServer web page
 Key: HBASE-7890
 URL: https://issues.apache.org/jira/browse/HBASE-7890
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial


Add an index number before each region would make it easier to locate a region 
on the page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7890) Add an index number for each region in the region list on the RegionServer web page

2013-02-20 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7890:
---

Attachment: HBASE-7890-0.94.patch

add index column in the RegionServer web page

 Add an index number for each region in the region list on the RegionServer 
 web page
 ---

 Key: HBASE-7890
 URL: https://issues.apache.org/jira/browse/HBASE-7890
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial
 Attachments: HBASE-7890-0.94.patch


 Add an index number before each region would make it easier to locate a 
 region on the page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7890) Add an index number for each region in the region list on the RegionServer web page

2013-02-20 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7890:
---

Status: Patch Available  (was: Open)

 Add an index number for each region in the region list on the RegionServer 
 web page
 ---

 Key: HBASE-7890
 URL: https://issues.apache.org/jira/browse/HBASE-7890
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial
 Attachments: HBASE-7890-0.94.patch


 Add an index number before each region would make it easier to locate a 
 region on the page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7891) Add an index number prior to each table region in table.jsp

2013-02-20 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-7891:
--

 Summary: Add an index number prior to each table region in 
table.jsp
 Key: HBASE-7891
 URL: https://issues.apache.org/jira/browse/HBASE-7891
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial
 Attachments: HBASE-7891-0.94.patch

Adding an index number for each table region in table.jsp would make it easier 
to locate a region or to count regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7891) Add an index number prior to each table region in table.jsp

2013-02-20 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7891:
---

Status: Patch Available  (was: Open)

 Add an index number prior to each table region in table.jsp
 ---

 Key: HBASE-7891
 URL: https://issues.apache.org/jira/browse/HBASE-7891
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial
 Attachments: HBASE-7891-0.94.patch


 Adding an index number for each table region in table.jsp would make it 
 easier to locate a region or to count regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7891) Add an index number prior to each table region in table.jsp

2013-02-20 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7891:
---

Attachment: HBASE-7891-0.94.patch

Add index column to table region list in table.jsp

 Add an index number prior to each table region in table.jsp
 ---

 Key: HBASE-7891
 URL: https://issues.apache.org/jira/browse/HBASE-7891
 Project: HBase
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Trivial
 Attachments: HBASE-7891-0.94.patch


 Adding an index number for each table region in table.jsp would make it 
 easier to locate a region or to count regions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7892) FuzzyRowFilter would have wrong behaviors if user gives an arbitary byte for an unfixed position instead of byte 0

2013-02-20 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-7892:
--

 Summary: FuzzyRowFilter would have wrong behaviors if user gives 
an arbitary byte for an unfixed position instead of byte 0
 Key: HBASE-7892
 URL: https://issues.apache.org/jira/browse/HBASE-7892
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.94.5, 0.96.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor


An actual case can be:
we want to match a?ex, so we give a?ex as input of key bytes, and 0100 
as input of meta bytes.
if we start with row = \0\0\0\0, the next hint would turn out to be a?ex
while actually the right hint should be a\0ex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7892) FuzzyRowFilter would have wrong behaviors if user gives an arbitary byte for an unfixed position instead of byte 0

2013-02-20 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7892:
---

Attachment: HBASE-7892-0.94.patch

1. initialize the unfixed positions with byte 0
2. remove copying of row - improve performance
3. add corresponding test cases

 FuzzyRowFilter would have wrong behaviors if user gives an arbitary byte for 
 an unfixed position instead of byte 0
 

 Key: HBASE-7892
 URL: https://issues.apache.org/jira/browse/HBASE-7892
 Project: HBase
  Issue Type: Improvement
  Components: Filters
Affects Versions: 0.96.0, 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor
 Attachments: HBASE-7892-0.94.patch


 An actual case can be:
 we want to match a?ex, so we give a?ex as input of key bytes, and 
 0100 as input of meta bytes.
 if we start with row = \0\0\0\0, the next hint would turn out to be a?ex
 while actually the right hint should be a\0ex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7874) Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected.

2013-02-18 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-7874:
--

 Summary: Allow RegionServer to abort Put if it detects that the 
HBase Client got SocketTimeoutException and disconnected.
 Key: HBASE-7874
 URL: https://issues.apache.org/jira/browse/HBASE-7874
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor


Usually, when regionserver cannot catch up with the put load given by the 
client, what happens is region server starts to block update requests from the 
client until required resource has been reclaimed (i.e. memstore has been 
flushed). 
But in more severe situations, the blocking time gets so long that the client 
begins to have SocketTimeoutException and then decides to retry, while in fact 
the updates are written into memstore later after they are unblocked. Even 
though the client has something like a binary rollback for retry intervals, 
this can still lead to a vicious circle, leaving the client to have very low 
throughput.
Think we can enable an option to allow regionserver to check if the client has 
disconnected (just like what we do in scan) after coming back from blocking, 
so that the regionserver has the same view as the client on whether updates are 
successfully committed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7876) Got exception when splitting a region that contains no storefile

2013-02-18 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-7876:
--

 Summary: Got exception when splitting a region that contains no 
storefile
 Key: HBASE-7876
 URL: https://issues.apache.org/jira/browse/HBASE-7876
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue


We should allow a region to split successfully even if it does not yet have 
storefiles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7874) Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected.

2013-02-18 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7874:
---

Attachment: HBASE-7874-0.94.patch

add check channel closed in HRegion#batchMutate()

 Allow RegionServer to abort Put if it detects that the HBase Client got 
 SocketTimeoutException and disconnected.
 --

 Key: HBASE-7874
 URL: https://issues.apache.org/jira/browse/HBASE-7874
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor
 Attachments: HBASE-7874-0.94.patch


 Usually, when regionserver cannot catch up with the put load given by the 
 client, what happens is region server starts to block update requests from 
 the client until required resource has been reclaimed (i.e. memstore has been 
 flushed). 
 But in more severe situations, the blocking time gets so long that the client 
 begins to have SocketTimeoutException and then decides to retry, while in 
 fact the updates are written into memstore later after they are unblocked. 
 Even though the client has something like a binary rollback for retry 
 intervals, this can still lead to a vicious circle, leaving the client to 
 have very low throughput.
 Think we can enable an option to allow regionserver to check if the client 
 has disconnected (just like what we do in scan) after coming back from 
 blocking, so that the regionserver has the same view as the client on 
 whether updates are successfully committed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7874) Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected.

2013-02-18 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7874:
---

Status: Patch Available  (was: Open)

 Allow RegionServer to abort Put if it detects that the HBase Client got 
 SocketTimeoutException and disconnected.
 --

 Key: HBASE-7874
 URL: https://issues.apache.org/jira/browse/HBASE-7874
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor
 Attachments: HBASE-7874-0.94.patch


 Usually, when regionserver cannot catch up with the put load given by the 
 client, what happens is region server starts to block update requests from 
 the client until required resource has been reclaimed (i.e. memstore has been 
 flushed). 
 But in more severe situations, the blocking time gets so long that the client 
 begins to have SocketTimeoutException and then decides to retry, while in 
 fact the updates are written into memstore later after they are unblocked. 
 Even though the client has something like a binary rollback for retry 
 intervals, this can still lead to a vicious circle, leaving the client to 
 have very low throughput.
 Think we can enable an option to allow regionserver to check if the client 
 has disconnected (just like what we do in scan) after coming back from 
 blocking, so that the regionserver has the same view as the client on 
 whether updates are successfully committed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7876) Got exception when manually triggers a split on an empty region

2013-02-18 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7876:
---

Summary: Got exception when manually triggers a split on an empty region  
(was: Got exception when splitting a region that contains no storefile)

 Got exception when manually triggers a split on an empty region
 ---

 Key: HBASE-7876
 URL: https://issues.apache.org/jira/browse/HBASE-7876
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue

 We should allow a region to split successfully even if it does not yet have 
 storefiles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7876) Got exception when manually triggers a split on an empty region

2013-02-18 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7876:
---

Priority: Minor  (was: Major)

 Got exception when manually triggers a split on an empty region
 ---

 Key: HBASE-7876
 URL: https://issues.apache.org/jira/browse/HBASE-7876
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor

 We should allow a region to split successfully even if it does not yet have 
 storefiles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7876) Got exception when manually triggers a split on an empty region

2013-02-18 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-7876:
---

Attachment: HBASE-7876-0.94.patch

return if no storefile.

 Got exception when manually triggers a split on an empty region
 ---

 Key: HBASE-7876
 URL: https://issues.apache.org/jira/browse/HBASE-7876
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.5
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Minor
 Attachments: HBASE-7876-0.94.patch


 We should allow a region to split successfully even if it does not yet have 
 storefiles.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassign the same region

2012-11-14 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497797#comment-13497797
 ] 

Maryann Xue commented on HBASE-5816:


Since this is not fully addressed in HBASE-6060, how about test/reproduce it 
against Jimmy's fix?

 Balancer and ServerShutdownHandler concurrently reassign the same region
 

 Key: HBASE-5816
 URL: https://issues.apache.org/jira/browse/HBASE-5816
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: Maryann Xue
Assignee: ramkrishna.s.vasudevan
Priority: Critical
 Attachments: HBASE-5816.patch


 The first assign thread exits with success after updating the RegionState to 
 PENDING_OPEN, while the second assign follows immediately into assign and 
 fails the RegionState check in setOfflineInZooKeeper(). This causes the 
 master to abort.
 In the below case, the two concurrent assigns occurred when AM tried to 
 assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
 tried to assign this region (from the region plan) spontaneously.
 {code}
 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
 src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 (offlining)
 2012-04-17 05:44:57,648 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0) for region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
 2012-04-17 05:44:57,666 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
 (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
 2012-04-17 05:52:58,984 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=CLOSED, ts=1334612697672, 
 server=hadoop05.sh.intel.com,60020,1334544902186
 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x236b912e9b3000e Creating (or updating) unassigned node for 
 fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
 plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
  src=hadoop05.sh.intel.com,60020,1334544902186, 
 dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:52:59,096 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:19,159 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
 state=PENDING_OPEN, ts=1334613179096, 
 server=xmlqa-clv16.sh.intel.com,60020,1334612497253
 2012-04-17 05:54:59,033 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
 serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
 regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 12 millis timeout 
 while waiting for channel to be ready for read. ch : 
 java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
 remote=/10.239.47.87:60020]
 at 
 org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
 at 
 org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
 at $Proxy7.openRegion(Unknown Source)
 at 
 org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127)
 at 
 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-09-14 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Attachment: (was: HBASE-6299-v3.patch)

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Fix For: 0.96.0, 0.92.3, 0.94.3

 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 java.net.SocketTimeoutException: Call to 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-09-14 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Attachment: HBASE-6299-v3.patch

@ramkrishna, updated the patch. misunderstood the exception handling in 
HBaseClient. thank you for pointing this out!

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Fix For: 0.96.0, 0.92.3, 0.94.3

 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch, 
 HBASE-6299-v3.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-09-13 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Attachment: HBASE-6299-v3.patch

Considering a live RS would most likely eventually get to the openRegion() 
request and process, it might be good just to return on SocketTimeoutException, 
for SocketTimeoutException indicates an uncertain state in the assign process, 
with potential race conditions. And this can happen if a RS is temporarily 
running out of IPC handlers, or if the RS's response is lost on the line.

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch, 
 HBASE-6299-v3.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-09-13 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13454840#comment-13454840
 ] 

Maryann Xue commented on HBASE-6299:


updated the patch as HBASE-6299-v3.patch

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch, 
 HBASE-6299-v3.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-09-13 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Attachment: (was: HBASE-6299-v3.patch)

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Fix For: 0.96.0, 0.92.3, 0.94.3

 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 java.net.SocketTimeoutException: Call to 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-09-13 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Attachment: HBASE-6299-v3.patch

@Lars the original unwrap should not work.
@Ted please review the patch.
@ramkrishna How about we apply this fix first and then update the patch for 
HBASE-6438? for as i can see HBASE-6438 is about another problem but the patch 
includes my old fix.

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Fix For: 0.96.0, 0.92.3, 0.94.3

 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch, 
 HBASE-6299-v3.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-12 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412618#comment-13412618
 ] 

Maryann Xue commented on HBASE-6299:


bq. I'm not sure what synchronize does. I suppose it prevents double assign
The interesting thing is we check RegionState is OFFLINE or CLOSED before 
setting OFFLINE in zk and abort if the check fails; while we allow any 
RegionState before setting RegionState OFFLINE. And since this synchronize on 
RegionState does not guard the whole process (state change from PEND_OPEN to 
OPENED), double assignment is not prevented at all, though there's some check 
in setOfflineInZookeeper, but only when hijack=true.
So far i've seen two error cases with double assign:
1. HBASE-5816: The second assign comes in almost at the same time with the 
first assignment,but gets locked by sychronized(state). After the first 
assignment succeeds with sendRegionOpen() and exits the synchronized block, the 
second assignment goes into the block and calls setOfflineInZookeeper() which 
fails the RegionState Offline check and leads to master abort.
2. The second assignment kicks in after the first assignment succeeded and 
deleted the ZK node but before regionOnline() is called (which removes the 
region from AM.regionsInTransition and adds the region to AM.regions). The 
second assignment starts a normal assign process, setting RegionState OFFLINE, 
setting ZK OFFLINE, and calls sendRegionOpen() to the same dest RS. Then, when 
the first assignment calls AM.regionOnline(), this region get removed from 
AM.regionsInTranistion. This is a double assignment to the RS. if RS chooses to 
cleanUpFailedOpen() as in 0.90, this region will be served nowhere and does not 
even exist in master's regionsInTransition; if RS chooses to proceed on with 
openRegion() as in trunk, master will get RS events OPENING, OPENED related 
to NO RegionState, as in HBASE-6300.
I can see we check if ZK node exists in setOfflineInZookeeper to prevent double 
assignment, but this check is only effective when hijack=true.
Is it possible that we can do something in an earlier stage to prevent double 
assignment? like in forceRegionStateToOffline()?

bq. @Mary Is HBASE-5396 committed?
No... but explicitly calling assign() from HBaseAdmin can cause the same 
problem.

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-11 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411273#comment-13411273
 ] 

Maryann Xue commented on HBASE-6299:


Currently we don't check concurrent double assignment, while it can happen 
quite easily after HBASE-5396.
{code}
RegionState state = addToRegionsInTransition(region,
hijack);
synchronized (state) {
  assign(region, state, setOfflineInZK, forceNewPlan, hijack);
}
{code}
We now set RegionState OFFLINE in addToRegionsInTransition(), and set ZK node 
OFFLINE after we get into the critical section. Why don't we set these two 
OFFLINE together in addToRegionsInTransition() and after getting into the 
critical section check if RegionState is OFFLINE?

And with double assignment, we go directly with assignment() without checking 
its current RegionState in addToRegionsInTransition() with calls 
forceRegionStateToOffline(). and forceRegionStateToOffline() simply force a 
RegionState Offline.
{code}
  RegionState state = this.regionsInTransition.get(encodedName);
  if (state == null) {
state = new RegionState(region, RegionState.State.OFFLINE);
this.regionsInTransition.put(encodedName, state);
  } else {
// If we are reassigning the node do not force in-memory state to 
OFFLINE.
// Based on the znode state we will decide if to change in-memory state 
to
// OFFLINE or not. It will be done before setting znode to OFFLINE 
state.

// We often get here with state == CLOSED because ClosedRegionHandler 
will
// assign on its tail as part of the handling of a region close.
if (!hijack) {
  LOG.debug(Forcing OFFLINE; was= + state);
  state.update(RegionState.State.OFFLINE);
}
  }
{code}
With this piece of code, we normally see logs like Forcing OFFLINE; 
was=regionName state=CLOSED with load balance. but in double assignment, we 
can see Forcing OFFLINE; was=regionName state=OPEN. Should we ensure the 
state is CLOSED or OFFLINE before proceeding to assignment?

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-05 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406900#comment-13406900
 ] 

Maryann Xue commented on HBASE-6299:


@stack but assign() checks RegionState OFFLINE at the beginning of each 
attempt, and not setting it OFFLINE might cause master to abort, as in 
HBASE-5816:
{code}
for (int i = 0; i  this.maximumAssignmentAttempts; i++) {
  int versionOfOfflineNode = -1;
  if (setOfflineInZK) {
// get the version of the znode after setting it to OFFLINE.
// versionOfOfflineNode will be -1 if the znode was not set to OFFLINE
versionOfOfflineNode = setOfflineInZooKeeper(state, hijack);
{code}

{code}
  int setOfflineInZooKeeper(final RegionState state,
  boolean hijack) {
// In case of reassignment the current state in memory need not be
// OFFLINE. 
if (!hijack  !state.isClosed()  !state.isOffline()) {
  String msg = Unexpected state :  + state +  .. Cannot transit it to 
OFFLINE.;
  this.master.abort(msg, new IllegalStateException(msg));
  return -1;
}
{code}

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-05 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407153#comment-13407153
 ] 

Maryann Xue commented on HBASE-6299:


Yes, agree! 

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-05 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407204#comment-13407204
 ] 

Maryann Xue commented on HBASE-6299:


And i'm thinking to move this setOfflineInZK logic into 
forceRegionStateToOffline(). what do you think?

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-04 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Attachment: HBASE-6299-v2.patch

Make handling of RegionAlreadyInTransitionException work.

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 java.net.SocketTimeoutException: Call to 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-04 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Status: Patch Available  (was: Open)

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.0, 0.90.6
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket 
 timeout exception: 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-04 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Status: Open  (was: Patch Available)

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.0, 0.90.6
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket 
 timeout exception: 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-04 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406862#comment-13406862
 ] 

Maryann Xue commented on HBASE-6299:


Agree, ramkrishna! You've made a good point here. My original idea was to 
directly return in the else branch, and leave it to the TimeoutMonitor to 
assign this region if the RS did not open the region. I changed to the current 
version, thinking to bring the assign retrial earlier. But regarding the region 
in transition problem you pointed out, the original return solution looks 
better.
{code}
else {
+// The destination region server is probably processing the region 
open, so it
+// might be safer to try this region server again to avoid having 
two region
+// servers open the same region.
+LOG.error(Call openRegion() to  + plan.getDestination() +
+ has timed out when trying to assign  + 
region.getRegionNameAsString() +
+., t);
+return;
+  }
{code}
And if we are considering removing the assign retry in HBASE-6060, problems 
like this one and the one in HBASE-5816 can be avoided.
Think triggering SSH in case of SocketTimeout should be a different problem. 
There are several places in HMaster where we should consider whether to start 
SSH, but currently only RegionServerTracker will start SSH. Shall we open 
another JIRA entry to discuss this issue?

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299-v2.patch, HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-02 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Attachment: HBASE-6299.patch

Add handling of SocketTimeoutException in assign().
1. return if region is already opened on this RS.
2. try assigning on the same RS again otherwise.

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign 

[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-02 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6299:
---

Status: Patch Available  (was: Open)

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.0, 0.90.6
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket 
 timeout exception: java.net.SocketTimeoutException: 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-02 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405459#comment-13405459
 ] 

Maryann Xue commented on HBASE-6299:


stack, thank you for pointing this out. I was thinking the innermost assign() 
would handle RegionAlreadyInTransitionException and return, in the code block 
as follows:
{code}
if (t instanceof RemoteException) {
  t = ((RemoteException) t).unwrapRemoteException();
  if (t instanceof RegionAlreadyInTransitionException) {
String errorMsg = Failed assignment in:  + plan.getDestination()
+  due to  + t.getMessage();
LOG.error(errorMsg, t);
return;
  }
}
{code}
I just looked again at HRegionServer.openRegion(), and found that 
RegionAlreadyInTransitionException is wrapped as ServiceException:
{code}
  } catch (RegionAlreadyInTransitionException rie) {
LOG.warn(Region is already in transition, rie);
if (isBulkAssign) {
  builder.addOpeningState(RegionOpeningState.OPENED);
} else {
  throw new ServiceException(rie);
}
{code}
But i don't see why in assign() HMaster does not unwrap RemoteException and 
then ServiceException as well. And since RegionAlreadyInTransitionException is 
always wrapped, i don't see at what situation the first code block will be 
called.

I might be missing something or need a closer look?

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 

[jira] [Commented] (HBASE-6300) Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected.

2012-07-02 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405465#comment-13405465
 ] 

Maryann Xue commented on HBASE-6300:


Apart from what happened in HBASE-6299, so far i see nothing will cause this 
RegionState null warning. But in case it happens to go into there, there must 
be a serious inconsistent state, i suppose, two region servers are having this 
region, and very likely master's region info is different from META.

 Master should not ignore event RS_ZK_REGION_OPENED when regionState is null 
 or unexpected.
 --

 Key: HBASE-6300
 URL: https://issues.apache.org/jira/browse/HBASE-6300
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue

 When RS updates an unassigned ZK node to RS_ZK_REGION_OPENED, it will most 
 probably proceed to update the region location in META. This would cause 
 inconsistency between the region's location in HMaster and that in META. Not 
 deleting this ZK node would also make further region transitions fail with ZK 
 exception node already exists.
 So the master should either abort or fix this inconsistency.
 {code}
 case RS_ZK_REGION_OPENED:
   hri = checkIfInFailover(regionState, encodedName, regionName);
   if (hri != null) {
 regionState = new RegionState(hri, RegionState.State.OPEN, 
 createTime, sn);
 regionsInTransition.put(encodedName, regionState);
 new OpenedRegionHandler(master, this, regionState.getRegion(), 
 sn, expectedVersion).process();
 failoverProcessedRegions.put(encodedName, hri);
 break;
   }
   // Should see OPENED after OPENING but possible after PENDING_OPEN
   if (regionState == null ||
   (!regionState.isPendingOpen()  !regionState.isOpening())) {
 LOG.warn(Received OPENED for region  +
 prettyPrintedRegionName +
  from server  + sn +  but region was in  +
  the state  + regionState +  and not  +
 in expected PENDING_OPEN or OPENING states);
 return;
   }
   // Handle OPENED by removing from transition and deleted zk node
   regionState.update(RegionState.State.OPEN, createTime, sn);
   this.executorService.submit(
 new OpenedRegionHandler(master, this, regionState.getRegion(), 
 sn, expectedVersion));
   break;
 {code}
 Error logs:
 {code}
 2012-06-29 07:07:41,149 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:07:41,150 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region 
 b713fd655fa02395496c5a6e39ddf568 from server 
 swbss-hadoop-164,60020,1340888346294 but region was in  the state null and 
 not in expected PENDING_OPEN or OPENING states
 2012-06-29 07:07:41,296 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:07:41,296 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region 
 b713fd655fa02395496c5a6e39ddf568 from server 
 swbss-hadoop-164,60020,1340888346294 but region was in  the state null and 
 not in expected PENDING_OPEN or OPENING states
 2012-06-29 07:07:41,302 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-164,60020,1340888346294, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:07:41,302 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 
 b713fd655fa02395496c5a6e39ddf568 from server 
 swbss-hadoop-164,60020,1340888346294 but region was in  the state null and 
 not in expected PENDING_OPEN or OPENING states
 2012-06-29 07:08:38,872 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-006,60020,1340890678078, 
 dest=swbss-hadoop-008,60020,1340891085175
 2012-06-29 07:08:38,872 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  (offlining)
 2012-06-29 07:08:47,875 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=0, 
 usedHeap=0, maxHeap=0) for 

[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-07-02 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405500#comment-13405500
 ] 

Maryann Xue commented on HBASE-6289:


@Jieshan, doable i think. but currently CatalogTracker acts more of an hbase 
client role, and talks to zookeeper and region servers only. don't know if this 
is its desired semantics.

 ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is 
 still working but only the RS's ZK node expires.
 --

 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6289-v2.patch, HBASE-6289-v2.patch, 
 HBASE-6289.patch


 The ROOT RS has some network problem and its ZK node expires first, which 
 kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
 re-assign ROOT. At that time, the RS is actually still working and passes the 
 verifyRootRegionLocation() check, so the ROOT region is skipped from 
 re-assignment.
 {code}
   private void verifyAndAssignRoot()
   throws InterruptedException, IOException, KeeperException {
 long timeout = this.server.getConfiguration().
   getLong(hbase.catalog.verification.timeout, 1000);
 if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
   this.services.getAssignmentManager().assignRoot();
 }
   }
 {code}
 After a few moments, this RS encounters DFS write problem and decides to 
 abort. The RS then soon gets restarted from commandline, and constantly 
 report:
 {code}
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,630 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-02 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405507#comment-13405507
 ] 

Maryann Xue commented on HBASE-6299:


Thank you, Zhihong! then i suppose the exception handling should be modified as:
{code}
  if (t instanceof RegionAlreadyInTransitionException) {
String errorMsg = Failed assignment in:  + plan.getDestination()
+  due to  + t.getMessage();
LOG.error(errorMsg, t);
return;
  }
{code}

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-07-02 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405582#comment-13405582
 ] 

Maryann Xue commented on HBASE-6299:


it happened on a 0.90 cluster. and i checked trunk code and assume the issue 
still exists.

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6299.patch


 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; 
 retry=0
 

[jira] [Created] (HBASE-6300) Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected (not in failover).

2012-07-01 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-6300:
--

 Summary: Master should not ignore event RS_ZK_REGION_OPENED when 
regionState is null or unexpected (not in failover).
 Key: HBASE-6300
 URL: https://issues.apache.org/jira/browse/HBASE-6300
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.0, 0.90.6
Reporter: Maryann Xue
Assignee: Maryann Xue


When RS updates an unassigned ZK node to RS_ZK_REGION_OPENED, it will most 
probably proceed to update the region location in META. This would cause 
inconsistency between the region's location in HMaster and that in META. Not 
deleting this ZK node would also make further region transitions fail with ZK 
exception node already exists.
So the master should either abort or fix this inconsistency.
{code}
case RS_ZK_REGION_OPENED:
  hri = checkIfInFailover(regionState, encodedName, regionName);
  if (hri != null) {
regionState = new RegionState(hri, RegionState.State.OPEN, 
createTime, sn);
regionsInTransition.put(encodedName, regionState);
new OpenedRegionHandler(master, this, regionState.getRegion(), sn, 
expectedVersion).process();
failoverProcessedRegions.put(encodedName, hri);
break;
  }
  // Should see OPENED after OPENING but possible after PENDING_OPEN
  if (regionState == null ||
  (!regionState.isPendingOpen()  !regionState.isOpening())) {
LOG.warn(Received OPENED for region  +
prettyPrintedRegionName +
 from server  + sn +  but region was in  +
 the state  + regionState +  and not  +
in expected PENDING_OPEN or OPENING states);
return;
  }
  // Handle OPENED by removing from transition and deleted zk node
  regionState.update(RegionState.State.OPEN, createTime, sn);
  this.executorService.submit(
new OpenedRegionHandler(master, this, regionState.getRegion(), sn, 
expectedVersion));
  break;
{code}

Error logs:
{code}
2012-06-29 07:07:41,149 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=swbss-hadoop-164,60020,1340888346294, 
region=b713fd655fa02395496c5a6e39ddf568
2012-06-29 07:07:41,150 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server 
swbss-hadoop-164,60020,1340888346294 but region was in  the state null and not 
in expected PENDING_OPEN or OPENING states
2012-06-29 07:07:41,296 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=swbss-hadoop-164,60020,1340888346294, 
region=b713fd655fa02395496c5a6e39ddf568
2012-06-29 07:07:41,296 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server 
swbss-hadoop-164,60020,1340888346294 but region was in  the state null and not 
in expected PENDING_OPEN or OPENING states
2012-06-29 07:07:41,302 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENED, 
server=swbss-hadoop-164,60020,1340888346294, 
region=b713fd655fa02395496c5a6e39ddf568
2012-06-29 07:07:41,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received OPENED for region b713fd655fa02395496c5a6e39ddf568 from server 
swbss-hadoop-164,60020,1340888346294 but region was in  the state null and not 
in expected PENDING_OPEN or OPENING states
2012-06-29 07:08:38,872 INFO org.apache.hadoop.hbase.master.HMaster: balance 
hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
 src=swbss-hadoop-006,60020,1340890678078, 
dest=swbss-hadoop-008,60020,1340891085175
2012-06-29 07:08:38,872 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Starting unassignment of region 
CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
 (offlining)
2012-06-29 07:08:47,875 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Sent CLOSE to serverName=swbss-hadoop-006,60020,1340890678078, 
load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region 
CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
2012-06-29 08:04:37,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  
CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
 state=PENDING_CLOSE, ts=1340926468331, server=null
2012-06-29 08:04:37,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been PENDING_CLOSE for too long, running forced unassign again on 
region=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.

[jira] [Updated] (HBASE-6300) Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected.

2012-07-01 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6300:
---

Summary: Master should not ignore event RS_ZK_REGION_OPENED when 
regionState is null or unexpected.  (was: Master should not ignore event 
RS_ZK_REGION_OPENED when regionState is null or unexpected (not in failover).)

 Master should not ignore event RS_ZK_REGION_OPENED when regionState is null 
 or unexpected.
 --

 Key: HBASE-6300
 URL: https://issues.apache.org/jira/browse/HBASE-6300
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue

 When RS updates an unassigned ZK node to RS_ZK_REGION_OPENED, it will most 
 probably proceed to update the region location in META. This would cause 
 inconsistency between the region's location in HMaster and that in META. Not 
 deleting this ZK node would also make further region transitions fail with ZK 
 exception node already exists.
 So the master should either abort or fix this inconsistency.
 {code}
 case RS_ZK_REGION_OPENED:
   hri = checkIfInFailover(regionState, encodedName, regionName);
   if (hri != null) {
 regionState = new RegionState(hri, RegionState.State.OPEN, 
 createTime, sn);
 regionsInTransition.put(encodedName, regionState);
 new OpenedRegionHandler(master, this, regionState.getRegion(), 
 sn, expectedVersion).process();
 failoverProcessedRegions.put(encodedName, hri);
 break;
   }
   // Should see OPENED after OPENING but possible after PENDING_OPEN
   if (regionState == null ||
   (!regionState.isPendingOpen()  !regionState.isOpening())) {
 LOG.warn(Received OPENED for region  +
 prettyPrintedRegionName +
  from server  + sn +  but region was in  +
  the state  + regionState +  and not  +
 in expected PENDING_OPEN or OPENING states);
 return;
   }
   // Handle OPENED by removing from transition and deleted zk node
   regionState.update(RegionState.State.OPEN, createTime, sn);
   this.executorService.submit(
 new OpenedRegionHandler(master, this, regionState.getRegion(), 
 sn, expectedVersion));
   break;
 {code}
 Error logs:
 {code}
 2012-06-29 07:07:41,149 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:07:41,150 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region 
 b713fd655fa02395496c5a6e39ddf568 from server 
 swbss-hadoop-164,60020,1340888346294 but region was in  the state null and 
 not in expected PENDING_OPEN or OPENING states
 2012-06-29 07:07:41,296 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:07:41,296 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region 
 b713fd655fa02395496c5a6e39ddf568 from server 
 swbss-hadoop-164,60020,1340888346294 but region was in  the state null and 
 not in expected PENDING_OPEN or OPENING states
 2012-06-29 07:07:41,302 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-164,60020,1340888346294, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:07:41,302 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 
 b713fd655fa02395496c5a6e39ddf568 from server 
 swbss-hadoop-164,60020,1340888346294 but region was in  the state null and 
 not in expected PENDING_OPEN or OPENING states
 2012-06-29 07:08:38,872 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-006,60020,1340890678078, 
 dest=swbss-hadoop-008,60020,1340891085175
 2012-06-29 07:08:38,872 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  (offlining)
 2012-06-29 07:08:47,875 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=0, 
 usedHeap=0, maxHeap=0) for region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
 2012-06-29 08:04:37,681 INFO 
 

[jira] [Created] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-06-30 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-6299:
--

 Summary: RS starts region open while fails ack to 
HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a 
series of successive problems.
 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.0, 0.90.6
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical


1. HMaster tries to assign a region to an RS.
2. HMaster creates a RegionState for this region and puts it into 
regionsInTransition.
3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives 
the open region request and starts to proceed, with success eventually. 
However, due to network problems, HMaster fails to receive the response for the 
openRegion() call, and the call times out.
4. HMaster attemps to assign for a second time, choosing another RS. 
5. But since the HMaster's OpenedRegionHandler has been triggered by the region 
open of the previous RS, and the RegionState has already been removed from 
regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node 
RS_ZK_REGION_OPENING updated by the second attempt.
6. The unassigned ZK node stays and a later unassign fails coz 
RS_ZK_REGION_CLOSING cannot be created.
{code}
2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Using pre-existing plan for region 
CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
 
plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
 src=swbss-hadoop-004,60020,1340890123243, 
dest=swbss-hadoop-006,60020,1340890678078
2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region 
CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
 to swbss-hadoop-006,60020,1340890678078
2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
region=b713fd655fa02395496c5a6e39ddf568
2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=swbss-hadoop-006,60020,1340890678078, 
region=b713fd655fa02395496c5a6e39ddf568
2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENING, 
server=swbss-hadoop-006,60020,1340890678078, 
region=b713fd655fa02395496c5a6e39ddf568
2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_OPENED, 
server=swbss-hadoop-006,60020,1340890678078, 
region=b713fd655fa02395496c5a6e39ddf568
2012-06-29 07:06:32,299 DEBUG 
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
event for 
CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
 from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region 
b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
2012-06-29 07:06:32,301 DEBUG 
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
opened the region 
CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
 that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Failed assignment of 
CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
 to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket 
timeout exception: java.net.SocketTimeoutException: 12 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/172.16.0.2:53765 
remote=/172.16.0.6:60020]
at 
org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
at 

[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.

2012-06-30 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404682#comment-13404682
 ] 

Maryann Xue commented on HBASE-6299:


Think a good option can be checking if the region has been assigned 
successfully already when dealing with the RPC failure, so that there is no 
need to start another attempt.

 RS starts region open while fails ack to HMaster.sendRegionOpen() causes 
 inconsistency in HMaster's region state and a series of successive problems.
 -

 Key: HBASE-6299
 URL: https://issues.apache.org/jira/browse/HBASE-6299
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical

 1. HMaster tries to assign a region to an RS.
 2. HMaster creates a RegionState for this region and puts it into 
 regionsInTransition.
 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS 
 receives the open region request and starts to proceed, with success 
 eventually. However, due to network problems, HMaster fails to receive the 
 response for the openRegion() call, and the call times out.
 4. HMaster attemps to assign for a second time, choosing another RS. 
 5. But since the HMaster's OpenedRegionHandler has been triggered by the 
 region open of the previous RS, and the RegionState has already been removed 
 from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK 
 node RS_ZK_REGION_OPENING updated by the second attempt.
 6. The unassigned ZK node stays and a later unassign fails coz 
 RS_ZK_REGION_CLOSING cannot be created.
 {code}
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
 region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.;
  
 plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.,
  src=swbss-hadoop-004,60020,1340890123243, 
 dest=swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to swbss-hadoop-006,60020,1340890678078
 2012-06-29 07:03:38,870 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:28,882 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,291 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, 
 region=b713fd655fa02395496c5a6e39ddf568
 2012-06-29 07:06:32,299 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, 
 regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node
 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Deleting existing unassigned node for 
 b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x2377fee2ae80007 Successfully deleted unassigned node for 
 region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED
 2012-06-29 07:06:32,301 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has 
 opened the region 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  that was online on serverName=swbss-hadoop-006,60020,1340890678078, 
 load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301)
 2012-06-29 07:07:41,140 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
 CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
  to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, 
 regions=575, usedHeap=0, maxHeap=0), trying to 

[jira] [Updated] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-06-29 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6289:
---

Attachment: HBASE-6289-v2.patch

Updated the patch.

 ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is 
 still working but only the RS's ZK node expires.
 --

 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6289-v2.patch, HBASE-6289.patch


 The ROOT RS has some network problem and its ZK node expires first, which 
 kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
 re-assign ROOT. At that time, the RS is actually still working and passes the 
 verifyRootRegionLocation() check, so the ROOT region is skipped from 
 re-assignment.
 {code}
   private void verifyAndAssignRoot()
   throws InterruptedException, IOException, KeeperException {
 long timeout = this.server.getConfiguration().
   getLong(hbase.catalog.verification.timeout, 1000);
 if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
   this.services.getAssignmentManager().assignRoot();
 }
   }
 {code}
 After a few moments, this RS encounters DFS write problem and decides to 
 abort. The RS then soon gets restarted from commandline, and constantly 
 report:
 {code}
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,630 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-06-29 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404365#comment-13404365
 ] 

Maryann Xue commented on HBASE-6289:


@stack thanks for the explanation!
@Ted sorry for my carelessness.

 ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is 
 still working but only the RS's ZK node expires.
 --

 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6289-v2.patch, HBASE-6289.patch


 The ROOT RS has some network problem and its ZK node expires first, which 
 kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
 re-assign ROOT. At that time, the RS is actually still working and passes the 
 verifyRootRegionLocation() check, so the ROOT region is skipped from 
 re-assignment.
 {code}
   private void verifyAndAssignRoot()
   throws InterruptedException, IOException, KeeperException {
 long timeout = this.server.getConfiguration().
   getLong(hbase.catalog.verification.timeout, 1000);
 if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
   this.services.getAssignmentManager().assignRoot();
 }
   }
 {code}
 After a few moments, this RS encounters DFS write problem and decides to 
 abort. The RS then soon gets restarted from commandline, and constantly 
 report:
 {code}
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,630 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-06-28 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-6289:
--

 Summary: ROOT region doesn't get re-assigned in 
ServerShutdownHandler if the RS is still working but only the RS's ZK node 
expires.
 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.0, 0.90.6
Reporter: Maryann Xue
Priority: Critical


The ROOT RS has some network problem and its ZK node expires first, which kicks 
off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
re-assign ROOT. At that time, the RS is actually still working and passes the 
verifyRootRegionLocation() check, so the ROOT region is skipped from 
re-assignment.
  private void verifyAndAssignRoot()
  throws InterruptedException, IOException, KeeperException {
long timeout = this.server.getConfiguration().
  getLong(hbase.catalog.verification.timeout, 1000);
if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
  this.services.getAssignmentManager().assignRoot();
}
  }

After a few moments, this RS encounters DFS write problem and decides to abort. 
The RS then soon gets restarted from commandline, and constantly report:
2012-06-27 23:13:08,627 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
Region is not online: -ROOT-,,0
2012-06-27 23:13:08,627 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
Region is not online: -ROOT-,,0
2012-06-27 23:13:08,628 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
Region is not online: -ROOT-,,0
2012-06-27 23:13:08,628 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
Region is not online: -ROOT-,,0
2012-06-27 23:13:08,630 DEBUG 
org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; 
Region is not online: -ROOT-,,0



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-06-28 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6289:
---

Attachment: HBASE-6289.patch

Add excluded server in verifyRootRegionLocation().

 ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is 
 still working but only the RS's ZK node expires.
 --

 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Priority: Critical
 Attachments: HBASE-6289.patch


 The ROOT RS has some network problem and its ZK node expires first, which 
 kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
 re-assign ROOT. At that time, the RS is actually still working and passes the 
 verifyRootRegionLocation() check, so the ROOT region is skipped from 
 re-assignment.
   private void verifyAndAssignRoot()
   throws InterruptedException, IOException, KeeperException {
 long timeout = this.server.getConfiguration().
   getLong(hbase.catalog.verification.timeout, 1000);
 if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
   this.services.getAssignmentManager().assignRoot();
 }
   }
 After a few moments, this RS encounters DFS write problem and decides to 
 abort. The RS then soon gets restarted from commandline, and constantly 
 report:
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,630 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-06-28 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6289:
---

Assignee: Maryann Xue
  Status: Patch Available  (was: Open)

 ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is 
 still working but only the RS's ZK node expires.
 --

 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.0, 0.90.6
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6289.patch


 The ROOT RS has some network problem and its ZK node expires first, which 
 kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
 re-assign ROOT. At that time, the RS is actually still working and passes the 
 verifyRootRegionLocation() check, so the ROOT region is skipped from 
 re-assignment.
   private void verifyAndAssignRoot()
   throws InterruptedException, IOException, KeeperException {
 long timeout = this.server.getConfiguration().
   getLong(hbase.catalog.verification.timeout, 1000);
 if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
   this.services.getAssignmentManager().assignRoot();
 }
   }
 After a few moments, this RS encounters DFS write problem and decides to 
 abort. The RS then soon gets restarted from commandline, and constantly 
 report:
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,630 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-06-28 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13403105#comment-13403105
 ] 

Maryann Xue commented on HBASE-6289:


@ramkrishna: Yes, i thought of this too. but i this comment before 
verifyAndAssignRoot(): Before assign the ROOT region, ensure it haven't been 
assigned by other place. Not sure if this ROOT assigned elsewhere situation 
will actually possibly occur, but we seem to have seen META assigned on several 
Region Servers at the same time when there was chaos going on in our lab's 
network. There can be only one single search path for any region (incl. meta 
and root), though, regardless of client cache. And this is the thing i don't 
understand, why we try to treat ROOT differently?


 ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is 
 still working but only the RS's ZK node expires.
 --

 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6289.patch


 The ROOT RS has some network problem and its ZK node expires first, which 
 kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
 re-assign ROOT. At that time, the RS is actually still working and passes the 
 verifyRootRegionLocation() check, so the ROOT region is skipped from 
 re-assignment.
   private void verifyAndAssignRoot()
   throws InterruptedException, IOException, KeeperException {
 long timeout = this.server.getConfiguration().
   getLong(hbase.catalog.verification.timeout, 1000);
 if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
   this.services.getAssignmentManager().assignRoot();
 }
   }
 After a few moments, this RS encounters DFS write problem and decides to 
 abort. The RS then soon gets restarted from commandline, and constantly 
 report:
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,630 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-06-28 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13403106#comment-13403106
 ] 

Maryann Xue commented on HBASE-6289:


@ramkrishna: Yes, i thought of this too. but i saw this comment here before 
verifyAndAssignRoot(): Before assign the ROOT region, ensure it haven't been 
assigned by other place. Not sure if this ROOT assigned elsewhere situation 
will actually possibly occur, but we seem to have seen META assigned on several 
Region Servers at the same time when there was chaos going on in our lab's 
network. There can be only one single search path for any region (incl. meta 
and root), though, regardless of client cache. And this is the thing i don't 
understand, why we try to treat ROOT differently?

 ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is 
 still working but only the RS's ZK node expires.
 --

 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6289.patch


 The ROOT RS has some network problem and its ZK node expires first, which 
 kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
 re-assign ROOT. At that time, the RS is actually still working and passes the 
 verifyRootRegionLocation() check, so the ROOT region is skipped from 
 re-assignment.
   private void verifyAndAssignRoot()
   throws InterruptedException, IOException, KeeperException {
 long timeout = this.server.getConfiguration().
   getLong(hbase.catalog.verification.timeout, 1000);
 if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
   this.services.getAssignmentManager().assignRoot();
 }
   }
 After a few moments, this RS encounters DFS write problem and decides to 
 abort. The RS then soon gets restarted from commandline, and constantly 
 report:
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,630 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.

2012-06-28 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13403613#comment-13403613
 ] 

Maryann Xue commented on HBASE-6289:


@stack Thanks for the comments! if getRootServerLocation() returns null, 
verifyRootRegionLocation() will return false, so assignRoot() can be called. 
thus, verifyAndAssignRoot() returns with success and there won't be a loop or 
retry here.
{code}
if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout, 
this.serverName)) {
  this.services.getAssignmentManager().assignRoot();
}
{code}

I think ramkrishna was asking why we only verify root before trying to assign 
it while we directly assign META? that's my question as well.

 ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is 
 still working but only the RS's ZK node expires.
 --

 Key: HBASE-6289
 URL: https://issues.apache.org/jira/browse/HBASE-6289
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue
Assignee: Maryann Xue
Priority: Critical
 Attachments: HBASE-6289.patch


 The ROOT RS has some network problem and its ZK node expires first, which 
 kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to 
 re-assign ROOT. At that time, the RS is actually still working and passes the 
 verifyRootRegionLocation() check, so the ROOT region is skipped from 
 re-assignment.
 {code}
   private void verifyAndAssignRoot()
   throws InterruptedException, IOException, KeeperException {
 long timeout = this.server.getConfiguration().
   getLong(hbase.catalog.verification.timeout, 1000);
 if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) {
   this.services.getAssignmentManager().assignRoot();
 }
   }
 {code}
 After a few moments, this RS encounters DFS write problem and decides to 
 abort. The RS then soon gets restarted from commandline, and constantly 
 report:
 {code}
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,627 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,628 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 2012-06-27 23:13:08,630 DEBUG 
 org.apache.hadoop.hbase.regionserver.HRegionServer: 
 NotServingRegionException; Region is not online: -ROOT-,,0
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.

2012-06-17 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13393648#comment-13393648
 ] 

Maryann Xue commented on HBASE-6169:


Yes, you are right. looks like this problem only exists with 0.90.

 When a RS aborts without finishing closing a region, this region will always 
 remain in transition.
 

 Key: HBASE-6169
 URL: https://issues.apache.org/jira/browse/HBASE-6169
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: Maryann Xue

 When RS got an ZK error when trying to create a CLOSING node in the process 
 of closing a region, it hence aborts without completing closing of the region.
 RS is then discovered dead by HMaster. ServerShutdownHandler does not try to 
 reassign this region for it is in PENDING_CLOSE state; while all regions that 
 originally belong to the dead RS get removed from the regions map.
 TimeoutMonitor then endlessly tries to unassign this region with LOG 
 message Region has been PENDING_CLOSE for too long. The unassign returns 
 without doing anything, for this region does not exist in the regions map:
   public void unassign(HRegionInfo region, boolean force, ServerName dest) {
 // TODO: Method needs refactoring.  Ugly buried returns throughout.  
 Beware!
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized (this.regions) {
   // Check if this region is currently assigned
   if (!regions.containsKey(region)) {
 LOG.debug(Attempted to unassign region  +
   region.getRegionNameAsString() +  but it is not  +
   currently assigned anywhere);
 return;
   }
 }
   ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.

2012-06-17 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6169:
---

Affects Version/s: (was: 0.94.0)

 When a RS aborts without finishing closing a region, this region will always 
 remain in transition.
 

 Key: HBASE-6169
 URL: https://issues.apache.org/jira/browse/HBASE-6169
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: Maryann Xue

 When RS got an ZK error when trying to create a CLOSING node in the process 
 of closing a region, it hence aborts without completing closing of the region.
 RS is then discovered dead by HMaster. ServerShutdownHandler does not try to 
 reassign this region for it is in PENDING_CLOSE state; while all regions that 
 originally belong to the dead RS get removed from the regions map.
 TimeoutMonitor then endlessly tries to unassign this region with LOG 
 message Region has been PENDING_CLOSE for too long. The unassign returns 
 without doing anything, for this region does not exist in the regions map:
   public void unassign(HRegionInfo region, boolean force, ServerName dest) {
 // TODO: Method needs refactoring.  Ugly buried returns throughout.  
 Beware!
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized (this.regions) {
   // Check if this region is currently assigned
   if (!regions.containsKey(region)) {
 LOG.debug(Attempted to unassign region  +
   region.getRegionNameAsString() +  but it is not  +
   currently assigned anywhere);
 return;
   }
 }
   ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.

2012-06-07 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290857#comment-13290857
 ] 

Maryann Xue commented on HBASE-6169:


@ramkrishna, we found this problem with disabling table actually, against 0.90. 
but suppose with trunk, this region would be cleared from RIT in 
ServerShutdownHandler. but i assume in load balancing, while 
ServerShutdownHandler does nothing with PENDING_CLOSE or CLOSING regions, the 
above situation will be triggered by TimeoutMonitor.

 When a RS aborts without finishing closing a region, this region will always 
 remain in transition.
 

 Key: HBASE-6169
 URL: https://issues.apache.org/jira/browse/HBASE-6169
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue

 When RS got an ZK error when trying to create a CLOSING node in the process 
 of closing a region, it hence aborts without completing closing of the region.
 RS is then discovered dead by HMaster. ServerShutdownHandler does not try to 
 reassign this region for it is in PENDING_CLOSE state; while all regions that 
 originally belong to the dead RS get removed from the regions map.
 TimeoutMonitor then endlessly tries to unassign this region with LOG 
 message Region has been PENDING_CLOSE for too long. The unassign returns 
 without doing anything, for this region does not exist in the regions map:
   public void unassign(HRegionInfo region, boolean force, ServerName dest) {
 // TODO: Method needs refactoring.  Ugly buried returns throughout.  
 Beware!
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized (this.regions) {
   // Check if this region is currently assigned
   if (!regions.containsKey(region)) {
 LOG.debug(Attempted to unassign region  +
   region.getRegionNameAsString() +  but it is not  +
   currently assigned anywhere);
 return;
   }
 }
   ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.

2012-06-06 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-6169:
--

 Summary: When a RS aborts without finishing closing a region, this 
region will always remain in transition.
 Key: HBASE-6169
 URL: https://issues.apache.org/jira/browse/HBASE-6169
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.0, 0.90.6
Reporter: Maryann Xue


When RS got an ZK error when trying to create a CLOSING node in the process 
of closing a region, it hence aborts without completing closing of the region.
RS is then discovered dead by HMaster. ServerShutdownHandler does not try to 
reassign this region for it is in PENDING_CLOSE state; while all regions that 
originally belong to the dead RS get removed from the regions map.
TimeoutMonitor then endlessly tries to unassign this region with LOG message 
Region has been PENDING_CLOSE for too long. The unassign returns without 
doing anything, for this region does not exist in the regions map:
  public void unassign(HRegionInfo region, boolean force, ServerName dest) {
// TODO: Method needs refactoring.  Ugly buried returns throughout.  Beware!
LOG.debug(Starting unassignment of region  +
  region.getRegionNameAsString() +  (offlining));

synchronized (this.regions) {
  // Check if this region is currently assigned
  if (!regions.containsKey(region)) {
LOG.debug(Attempted to unassign region  +
  region.getRegionNameAsString() +  but it is not  +
  currently assigned anywhere);
return;
  }
}
  ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.

2012-06-06 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1329#comment-1329
 ] 

Maryann Xue commented on HBASE-6169:


I'm wondering if it is safe to call AM.assign(region) if we know this 
unassign request is coming from TimeoutMonitor, instead of just return.

 When a RS aborts without finishing closing a region, this region will always 
 remain in transition.
 

 Key: HBASE-6169
 URL: https://issues.apache.org/jira/browse/HBASE-6169
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6, 0.94.0
Reporter: Maryann Xue

 When RS got an ZK error when trying to create a CLOSING node in the process 
 of closing a region, it hence aborts without completing closing of the region.
 RS is then discovered dead by HMaster. ServerShutdownHandler does not try to 
 reassign this region for it is in PENDING_CLOSE state; while all regions that 
 originally belong to the dead RS get removed from the regions map.
 TimeoutMonitor then endlessly tries to unassign this region with LOG 
 message Region has been PENDING_CLOSE for too long. The unassign returns 
 without doing anything, for this region does not exist in the regions map:
   public void unassign(HRegionInfo region, boolean force, ServerName dest) {
 // TODO: Method needs refactoring.  Ugly buried returns throughout.  
 Beware!
 LOG.debug(Starting unassignment of region  +
   region.getRegionNameAsString() +  (offlining));
 synchronized (this.regions) {
   // Check if this region is currently assigned
   if (!regions.containsKey(region)) {
 LOG.debug(Attempted to unassign region  +
   region.getRegionNameAsString() +  but it is not  +
   currently assigned anywhere);
 return;
   }
 }
   ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()

2012-05-30 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6049:
---

Attachment: HBASE-6049-v3.patch

@stack, yes, there was a mistake. updated the patch.

 Serializing List containing null elements will cause NullPointerException 
 in HbaseObjectWritable.writeObject()
 

 Key: HBASE-6049
 URL: https://issues.apache.org/jira/browse/HBASE-6049
 Project: HBase
  Issue Type: Bug
  Components: io
Affects Versions: 0.94.0
Reporter: Maryann Xue
 Attachments: HBASE-6049-v2.patch, HBASE-6049-v3.patch, 
 HBASE-6049.patch


 An error case could be in Coprocessor AggregationClient, the median() 
 function handles an empty region and returns a List Object with the first 
 element as a Null value. NPE occurs in the RPC response stage and the 
 response never gets sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()

2012-05-21 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6049:
---

Attachment: HBASE-6049-v2.patch

@Zhihong updated the patch with modification to the test case. how does this 
look?

 Serializing List containing null elements will cause NullPointerException 
 in HbaseObjectWritable.writeObject()
 

 Key: HBASE-6049
 URL: https://issues.apache.org/jira/browse/HBASE-6049
 Project: HBase
  Issue Type: Bug
  Components: io
Affects Versions: 0.94.0
Reporter: Maryann Xue
 Attachments: HBASE-6049-v2.patch, HBASE-6049.patch


 An error case could be in Coprocessor AggregationClient, the median() 
 function handles an empty region and returns a List Object with the first 
 element as a Null value. NPE occurs in the RPC response stage and the 
 response never gets sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()

2012-05-18 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-6049:
--

 Summary: Serializing List containing null elements will cause 
NullPointerException in HbaseObjectWritable.writeObject()
 Key: HBASE-6049
 URL: https://issues.apache.org/jira/browse/HBASE-6049
 Project: HBase
  Issue Type: Bug
  Components: io
Affects Versions: 0.94.0
Reporter: Maryann Xue


An error case could be in Coprocessor AggregationClient, the median() function 
handles an empty region and returns a List Object with the first element as a 
Null value. NPE occurs in the RPC response stage and the response never gets 
sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()

2012-05-18 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6049:
---

Attachment: HBASE-6049.patch

handle null values in a list in writeObject()

 Serializing List containing null elements will cause NullPointerException 
 in HbaseObjectWritable.writeObject()
 

 Key: HBASE-6049
 URL: https://issues.apache.org/jira/browse/HBASE-6049
 Project: HBase
  Issue Type: Bug
  Components: io
Affects Versions: 0.94.0
Reporter: Maryann Xue
 Attachments: HBASE-6049.patch


 An error case could be in Coprocessor AggregationClient, the median() 
 function handles an empty region and returns a List Object with the first 
 element as a Null value. NPE occurs in the RPC response stage and the 
 response never gets sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()

2012-05-18 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6049:
---

Status: Patch Available  (was: Open)

 Serializing List containing null elements will cause NullPointerException 
 in HbaseObjectWritable.writeObject()
 

 Key: HBASE-6049
 URL: https://issues.apache.org/jira/browse/HBASE-6049
 Project: HBase
  Issue Type: Bug
  Components: io
Affects Versions: 0.94.0
Reporter: Maryann Xue
 Attachments: HBASE-6049.patch


 An error case could be in Coprocessor AggregationClient, the median() 
 function handles an empty region and returns a List Object with the first 
 element as a Null value. NPE occurs in the RPC response stage and the 
 response never gets sent.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6029) HBCK doesn't recover Balance switch if exception occurs in onlineHbck().

2012-05-16 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6029:
---

Affects Version/s: 0.94.0

 HBCK doesn't recover Balance switch if exception occurs in onlineHbck().
 

 Key: HBASE-6029
 URL: https://issues.apache.org/jira/browse/HBASE-6029
 Project: HBase
  Issue Type: Bug
  Components: hbck
Affects Versions: 0.94.0
Reporter: Maryann Xue



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6029) HBCK doesn't recover Balance switch if exception occurs in onlineHbck().

2012-05-16 Thread Maryann Xue (JIRA)
Maryann Xue created HBASE-6029:
--

 Summary: HBCK doesn't recover Balance switch if exception occurs 
in onlineHbck().
 Key: HBASE-6029
 URL: https://issues.apache.org/jira/browse/HBASE-6029
 Project: HBase
  Issue Type: Bug
  Components: hbck
Reporter: Maryann Xue




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6029) HBCK doesn't recover Balance switch if exception occurs in onlineHbck().

2012-05-16 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6029:
---

Attachment: HBASE-6029.patch

add try-finally block to recover balance switch.

 HBCK doesn't recover Balance switch if exception occurs in onlineHbck().
 

 Key: HBASE-6029
 URL: https://issues.apache.org/jira/browse/HBASE-6029
 Project: HBase
  Issue Type: Bug
  Components: hbck
Affects Versions: 0.94.0
Reporter: Maryann Xue
 Attachments: HBASE-6029.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6029) HBCK doesn't recover Balance switch if exception occurs in onlineHbck().

2012-05-16 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-6029:
---

Status: Patch Available  (was: Open)

 HBCK doesn't recover Balance switch if exception occurs in onlineHbck().
 

 Key: HBASE-6029
 URL: https://issues.apache.org/jira/browse/HBASE-6029
 Project: HBase
  Issue Type: Bug
  Components: hbck
Affects Versions: 0.94.0
Reporter: Maryann Xue
 Attachments: HBASE-6029.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5829) Inconsistency between the regions map and the servers map in AssignmentManager

2012-04-25 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-5829:
---

Attachment: HBASE-5829-trunk.patch
HBASE-5829-0.90.patch

Add corresponding operations to this.servers

 Inconsistency between the regions map and the servers map in 
 AssignmentManager
 --

 Key: HBASE-5829
 URL: https://issues.apache.org/jira/browse/HBASE-5829
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.92.1
Reporter: Maryann Xue
 Attachments: HBASE-5829-0.90.patch, HBASE-5829-trunk.patch


 There are occurrences in AM where this.servers is not kept consistent with 
 this.regions. This might cause balancer to offline a region from the RS that 
 already returned NotServingRegionException at a previous offline attempt.
 In AssignmentManager.unassign(HRegionInfo, boolean)
 try {
   // TODO: We should consider making this look more like it does for the
   // region open where we catch all throwables and never abort
   if (serverManager.sendRegionClose(server, state.getRegion(),
 versionOfClosingNode)) {
 LOG.debug(Sent CLOSE to  + server +  for region  +
   region.getRegionNameAsString());
 return;
   }
   // This never happens. Currently regionserver close always return true.
   LOG.warn(Server  + server +  region CLOSE RPC returned false for  +
 region.getRegionNameAsString());
 } catch (NotServingRegionException nsre) {
   LOG.info(Server  + server +  returned  + nsre +  for  +
 region.getRegionNameAsString());
   // Presume that master has stale data.  Presume remote side just split.
   // Presume that the split message when it comes in will fix up the 
 master's
   // in memory cluster state.
 } catch (Throwable t) {
   if (t instanceof RemoteException) {
 t = ((RemoteException)t).unwrapRemoteException();
 if (t instanceof NotServingRegionException) {
   if (checkIfRegionBelongsToDisabling(region)) {
 // Remove from the regionsinTransition map
 LOG.info(While trying to recover the table 
 + region.getTableNameAsString()
 +  to DISABLED state the region  + region
 +  was offlined but the table was in DISABLING state);
 synchronized (this.regionsInTransition) {
   this.regionsInTransition.remove(region.getEncodedName());
 }
 // Remove from the regionsMap
 synchronized (this.regions) {
   this.regions.remove(region);
 }
 deleteClosingOrClosedNode(region);
   }
 }
 // RS is already processing this region, only need to update the 
 timestamp
 if (t instanceof RegionAlreadyInTransitionException) {
   LOG.debug(update  + state +  the timestamp.);
   state.update(state.getState());
 }
   }
 In AssignmentManager.assign(HRegionInfo, RegionState, boolean, boolean, 
 boolean)
   synchronized (this.regions) {
 this.regions.put(plan.getRegionInfo(), plan.getDestination());
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5829) Inconsistency between the regions map and the servers map in AssignmentManager

2012-04-25 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13261327#comment-13261327
 ] 

Maryann Xue commented on HBASE-5829:


@ for the second, think we should guarantee that it is also added to the map 
this.servers.

 Inconsistency between the regions map and the servers map in 
 AssignmentManager
 --

 Key: HBASE-5829
 URL: https://issues.apache.org/jira/browse/HBASE-5829
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6, 0.92.1
Reporter: Maryann Xue
 Attachments: HBASE-5829-0.90.patch, HBASE-5829-trunk.patch


 There are occurrences in AM where this.servers is not kept consistent with 
 this.regions. This might cause balancer to offline a region from the RS that 
 already returned NotServingRegionException at a previous offline attempt.
 In AssignmentManager.unassign(HRegionInfo, boolean)
 try {
   // TODO: We should consider making this look more like it does for the
   // region open where we catch all throwables and never abort
   if (serverManager.sendRegionClose(server, state.getRegion(),
 versionOfClosingNode)) {
 LOG.debug(Sent CLOSE to  + server +  for region  +
   region.getRegionNameAsString());
 return;
   }
   // This never happens. Currently regionserver close always return true.
   LOG.warn(Server  + server +  region CLOSE RPC returned false for  +
 region.getRegionNameAsString());
 } catch (NotServingRegionException nsre) {
   LOG.info(Server  + server +  returned  + nsre +  for  +
 region.getRegionNameAsString());
   // Presume that master has stale data.  Presume remote side just split.
   // Presume that the split message when it comes in will fix up the 
 master's
   // in memory cluster state.
 } catch (Throwable t) {
   if (t instanceof RemoteException) {
 t = ((RemoteException)t).unwrapRemoteException();
 if (t instanceof NotServingRegionException) {
   if (checkIfRegionBelongsToDisabling(region)) {
 // Remove from the regionsinTransition map
 LOG.info(While trying to recover the table 
 + region.getTableNameAsString()
 +  to DISABLED state the region  + region
 +  was offlined but the table was in DISABLING state);
 synchronized (this.regionsInTransition) {
   this.regionsInTransition.remove(region.getEncodedName());
 }
 // Remove from the regionsMap
 synchronized (this.regions) {
   this.regions.remove(region);
 }
 deleteClosingOrClosedNode(region);
   }
 }
 // RS is already processing this region, only need to update the 
 timestamp
 if (t instanceof RegionAlreadyInTransitionException) {
   LOG.debug(update  + state +  the timestamp.);
   state.update(state.getState());
 }
   }
 In AssignmentManager.assign(HRegionInfo, RegionState, boolean, boolean, 
 boolean)
   synchronized (this.regions) {
 this.regions.put(plan.getRegionInfo(), plan.getDestination());
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5829) Inconsistency between the regions map and the servers map in AssignmentManager

2012-04-25 Thread Maryann Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maryann Xue updated HBASE-5829:
---

Status: Patch Available  (was: Open)

 Inconsistency between the regions map and the servers map in 
 AssignmentManager
 --

 Key: HBASE-5829
 URL: https://issues.apache.org/jira/browse/HBASE-5829
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.1, 0.90.6
Reporter: Maryann Xue
 Attachments: HBASE-5829-0.90.patch, HBASE-5829-trunk.patch


 There are occurrences in AM where this.servers is not kept consistent with 
 this.regions. This might cause balancer to offline a region from the RS that 
 already returned NotServingRegionException at a previous offline attempt.
 In AssignmentManager.unassign(HRegionInfo, boolean)
 try {
   // TODO: We should consider making this look more like it does for the
   // region open where we catch all throwables and never abort
   if (serverManager.sendRegionClose(server, state.getRegion(),
 versionOfClosingNode)) {
 LOG.debug(Sent CLOSE to  + server +  for region  +
   region.getRegionNameAsString());
 return;
   }
   // This never happens. Currently regionserver close always return true.
   LOG.warn(Server  + server +  region CLOSE RPC returned false for  +
 region.getRegionNameAsString());
 } catch (NotServingRegionException nsre) {
   LOG.info(Server  + server +  returned  + nsre +  for  +
 region.getRegionNameAsString());
   // Presume that master has stale data.  Presume remote side just split.
   // Presume that the split message when it comes in will fix up the 
 master's
   // in memory cluster state.
 } catch (Throwable t) {
   if (t instanceof RemoteException) {
 t = ((RemoteException)t).unwrapRemoteException();
 if (t instanceof NotServingRegionException) {
   if (checkIfRegionBelongsToDisabling(region)) {
 // Remove from the regionsinTransition map
 LOG.info(While trying to recover the table 
 + region.getTableNameAsString()
 +  to DISABLED state the region  + region
 +  was offlined but the table was in DISABLING state);
 synchronized (this.regionsInTransition) {
   this.regionsInTransition.remove(region.getEncodedName());
 }
 // Remove from the regionsMap
 synchronized (this.regions) {
   this.regions.remove(region);
 }
 deleteClosingOrClosedNode(region);
   }
 }
 // RS is already processing this region, only need to update the 
 timestamp
 if (t instanceof RegionAlreadyInTransitionException) {
   LOG.debug(update  + state +  the timestamp.);
   state.update(state.getState());
 }
   }
 In AssignmentManager.assign(HRegionInfo, RegionState, boolean, boolean, 
 boolean)
   synchronized (this.regions) {
 this.regions.put(plan.getRegionInfo(), plan.getDestination());
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira