[jira] [Commented] (HBASE-8039) Make HDFS replication number configurable for a column family
[ https://issues.apache.org/jira/browse/HBASE-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004836#comment-14004836 ] Maryann Xue commented on HBASE-8039: This was meant for use cases that would like to set a smaller number of replications for those less important but more consuming column families. For example, large image files. So I assume how to resolve this issue depends on how we evaluate such use cases. Make HDFS replication number configurable for a column family - Key: HBASE-8039 URL: https://issues.apache.org/jira/browse/HBASE-8039 Project: HBase Issue Type: Improvement Components: HFile Reporter: Maryann Xue Priority: Minor Fix For: 0.99.0, 0.98.4 To allow users to decide which column family's data is more important and which is less important by specifying a replica number instead of using the default replica number. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable
[ https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615013#comment-13615013 ] Maryann Xue commented on HBASE-8024: Modification of HStore#internalFlushCache() for our LOB use case: {code} private Path internalFlushCacheToBlobStore(final SortedSetKeyValue set, final long logCacheFlushId, TimeRangeTracker snapshotTimeRangeTracker, AtomicLong flushedSize, MonitoredTask status) throws IOException { StoreFile.Writer writer; // Find the smallest read point across all the Scanners. long smallestReadPoint = region.getSmallestReadPoint(); long flushed = 0; Path referenceFilePath = null; Path blobFilePath = null; // Don't flush if there are no entries. if (set.size() == 0) { return null; } Scan scan = new Scan(); scan.setMaxVersions(scanInfo.getMaxVersions()); // Use a store scanner to find which rows to flush. // Note that we need to retain deletes, hence // treat this as a minor compaction. InternalScanner scanner = new StoreScanner(this, scan, Collections .singletonList(new CollectionBackedScanner(set, this.comparator)), ScanType.MINOR_COMPACT, this.region.getSmallestReadPoint(), HConstants.OLDEST_TIMESTAMP); BlobStore blobStore = BlobStoreManager.getInstance().getBlobStore(getTableName(), family.getNameAsString()); if (null == blobStore) { blobStore = BlobStoreManager.getInstance().createBlobStore(getTableName(), family); } StoreFile.Writer blobWriter = null; try { // TODO: We can fail in the below block before we complete adding this // flush to list of store files. Add cleanup of anything put on filesystem // if we fail. synchronized (flushLock) { status.setStatus(Flushing + this + : creating writer); int referenceKeyValueCount = set.size(); int blobKeyValueCount = 0; // A. Write the map out to the disk writer = createWriterInTmp(referenceKeyValueCount); writer.setTimeRangeTracker(snapshotTimeRangeTracker); referenceFilePath = writer.getPath(); IteratorKeyValue iter = set.iterator(); while(null != iter iter.hasNext()) { if (iter.next().getType() == KeyValue.Type.Put.getCode()) { blobKeyValueCount++; } } blobWriter = blobStore.createWriterInTmp(blobKeyValueCount, this.compression, region.getRegionInfo()); blobFilePath = blobWriter.getPath(); String targetPathName = dateFormatter.format(new Date()); Path targetPath = new Path(blobStore.getHomePath(), targetPathName); String relativePath = targetPathName + Path.SEPARATOR + blobFilePath.getName(); // Append the BLOB_STORE_VERSION before the relative path name byte[] referenceValue = Bytes.add( new byte[] { BlobStoreConstants.BLOB_STORE_VERSION }, Bytes.toBytes(relativePath)); try { ListKeyValue kvs = new ArrayListKeyValue(); boolean hasMore; do { hasMore = scanner.next(kvs); if (!kvs.isEmpty()) { for (KeyValue kv : kvs) { // If we know that this KV is going to be included always, then let us // set its memstoreTS to 0. This will help us save space when writing to disk. if (kv.getMemstoreTS() = smallestReadPoint) { // let us not change the original KV. It could be in the memstore // changing its memstoreTS could affect other threads/scanners. kv = kv.shallowCopy(); kv.setMemstoreTS(0); } if (kv.getType() == KeyValue.Type.Reference.getCode()) { writer.append(kv); } else { // append the original keyValue in the blob file. blobWriter.append(kv); // append reference KeyValue. // The key is same, the value is the blobfile's filename KeyValue reference = new KeyValue(kv.getBuffer(), kv.getRowOffset(), kv.getRowLength(), kv.getBuffer(), kv.getFamilyOffset(), kv.getFamilyLength(), kv.getBuffer(), kv.getQualifierOffset(), kv.getQualifierLength(), kv.getTimestamp(), KeyValue.Type.Reference, referenceValue, 0, referenceValue.length);
[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable
[ https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615015#comment-13615015 ] Maryann Xue commented on HBASE-8024: In the LOB use case: add an independent LOB writer for real LOB data, and replace the original value of the KeyValue with LOB file path. Make Store flush algorithm pluggable Key: HBASE-8024 URL: https://issues.apache.org/jira/browse/HBASE-8024 Project: HBase Issue Type: Sub-task Components: regionserver Affects Versions: 0.95.0, 0.96.0, 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Attachments: HBASE-8024-trunk.patch, HBASE-8024.v2.patch The idea is to make StoreFlusher an interface instead of an implementation class, and have the original StoreFlusher as the default store flush impl. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable
[ https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13609857#comment-13609857 ] Maryann Xue commented on HBASE-8024: @Sergey, thank you for the ideas! these are very good suggestions. I will reorganize the relationship between flusher, flushrequest and store, optimally making one flusher per store with different flushrequests each time. the motivation is to enable our LOB implementation as a plug-in to HBase core. we already have customization on compactions, now with custom flush, we can write LOB data in independent HFiles. the requirement of our use case for customized flush is simple, which only adds a few lines into internalFlushCache(). but i totally agree with you on having something more flexible into this patch. @Ted, thank you for the comments! will cleanup documentation accordingly. Make Store flush algorithm pluggable Key: HBASE-8024 URL: https://issues.apache.org/jira/browse/HBASE-8024 Project: HBase Issue Type: Sub-task Components: regionserver Affects Versions: 0.95.0, 0.96.0, 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Attachments: HBASE-8024-trunk.patch, HBASE-8024.v2.patch The idea is to make StoreFlusher an interface instead of an implementation class, and have the original StoreFlusher as the default store flush impl. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8024) Make Store flush algorithm pluggable
[ https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-8024: --- Attachment: HBASE-8024-trunk.patch Changes: 1. Remove the orignal internal class HStore$StoreFlusherImpl 2. Create class DefaultStoreFlusher which implements StoreFlusher 3. Move all implementation of flushCache from HStore to DefaultStoreFlusher 4. Add method createStoreFlusher in StoreEngine Make Store flush algorithm pluggable Key: HBASE-8024 URL: https://issues.apache.org/jira/browse/HBASE-8024 Project: HBase Issue Type: Sub-task Components: regionserver Affects Versions: 0.95.0, 0.96.0, 0.94.5 Reporter: Maryann Xue Attachments: HBASE-8024-trunk.patch The idea is to make StoreFlusher an interface instead of an implementation class, and have the original StoreFlusher as the default store flush impl. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-8024) Make Store flush algorithm pluggable
[ https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-8024: --- Attachment: HBASE-8024.v2.patch Yes, one file missing in the patch. Sorry for the mistake Make Store flush algorithm pluggable Key: HBASE-8024 URL: https://issues.apache.org/jira/browse/HBASE-8024 Project: HBase Issue Type: Sub-task Components: regionserver Affects Versions: 0.95.0, 0.96.0, 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Attachments: HBASE-8024-trunk.patch, HBASE-8024.v2.patch The idea is to make StoreFlusher an interface instead of an implementation class, and have the original StoreFlusher as the default store flush impl. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable
[ https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608593#comment-13608593 ] Maryann Xue commented on HBASE-8024: @sergey it was simply copypaste from the original inner class impl. so i update the code comment along in this patch? Make Store flush algorithm pluggable Key: HBASE-8024 URL: https://issues.apache.org/jira/browse/HBASE-8024 Project: HBase Issue Type: Sub-task Components: regionserver Affects Versions: 0.95.0, 0.96.0, 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Attachments: HBASE-8024-trunk.patch, HBASE-8024.v2.patch The idea is to make StoreFlusher an interface instead of an implementation class, and have the original StoreFlusher as the default store flush impl. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7876) Got exception when manually triggers a split on an empty region
[ https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602136#comment-13602136 ] Maryann Xue commented on HBASE-7876: @stack agree~ Got exception when manually triggers a split on an empty region --- Key: HBASE-7876 URL: https://issues.apache.org/jira/browse/HBASE-7876 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Attachments: HBASE-7876-0.94V2.patch, HBASE-7876-trunk.patch We should allow a region to split successfully even if it does not yet have storefiles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8024) Make Store flush algorithm pluggable
[ https://issues.apache.org/jira/browse/HBASE-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13599815#comment-13599815 ] Maryann Xue commented on HBASE-8024: Andy, any planned date for 0.96? I will submit a patch soon anyway :) Make Store flush algorithm pluggable Key: HBASE-8024 URL: https://issues.apache.org/jira/browse/HBASE-8024 Project: HBase Issue Type: Sub-task Components: regionserver Affects Versions: 0.95.0, 0.96.0, 0.94.5 Reporter: Maryann Xue The idea is to make StoreFlusher an interface instead of an implementation class, and have the original StoreFlusher as the default store flush impl. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7949) Enable big content store in HBase
[ https://issues.apache.org/jira/browse/HBASE-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598642#comment-13598642 ] Maryann Xue commented on HBASE-7949: @chenning, as enis has clarified, the actual data move does not happen on the split point. instead, it happens in later compactions. and in the approach we proposed, the LOB family does not participate in split or minor compactions at all. @enis, the problem is not when the read and write happens, it is more of the unnecessary I/O overhead in splitting. and if the data is seldom updated, why compact them (for split) anyway? yes, utilizing level compactions could be a good approach. still, our approach can have three advantages over level compaction: 1. i/o overhead by split and minor compactions are completely eliminated; 2. clean-up is only done for those file that has reached a certain level of invalidation rate, during major compactions; 3. not every file reader is instantiated and kept in regionserver memory. instead, we'll have an LRU cache for frequently read LOB files. however, i suggest this issue not be committed into HBase trunk. instead we'd like to make the implementation a use case over HBase. and the only facility we need in HBase trunk is a pluggable flush process HBASE-8024. Enable big content store in HBase - Key: HBASE-7949 URL: https://issues.apache.org/jira/browse/HBASE-7949 Project: HBase Issue Type: Brainstorming Reporter: chenning Attachments: HBase_LOB.pdf Big content stored in hbase consumes a lot of system resource when region split or compaction operation happens. How HBase can be used to store big content along with some self descriptive meta-data. The general idea is to add a new type of column family, and the content of this kind of column family doesn't participate the region split and compaction. An index(rowkey-location) is introduced in this new column family and the split and compaction are only applied to this index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8039) Make HDFS replication number configurable for a column family
[ https://issues.apache.org/jira/browse/HBASE-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598644#comment-13598644 ] Maryann Xue commented on HBASE-8039: Yes, Sergey, that would be a necessary part of the solution. but meanwhile the other part is to pass down the replication number into the HFile writer. Make HDFS replication number configurable for a column family - Key: HBASE-8039 URL: https://issues.apache.org/jira/browse/HBASE-8039 Project: HBase Issue Type: Improvement Components: HFile Affects Versions: 0.94.5 Reporter: Maryann Xue Priority: Minor To allow users to decide which column family's data is more important and which is less important by specifying a replica number instead of using the default replica number. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7876) Got exception when manually triggers a split on an empty region
[ https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598510#comment-13598510 ] Maryann Xue commented on HBASE-7876: @ramkrishna, u might be using the wrong patch file? coz we removed this test case since it won't be useful anyway. Got exception when manually triggers a split on an empty region --- Key: HBASE-7876 URL: https://issues.apache.org/jira/browse/HBASE-7876 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Attachments: HBASE-7876-0.94.patch, HBASE-7876-0.94V2.patch, HBASE-7876-trunk.patch We should allow a region to split successfully even if it does not yet have storefiles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7876) Got exception when manually triggers a split on an empty region
[ https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13595662#comment-13595662 ] Maryann Xue commented on HBASE-7876: agree with clockfly. @ramakrishna to me it makes no sense for the user to configure to get such an exception for this reasonable and no harm operation. and as clockfly said, splitting an empty region with no midkey specifies still behaves as before. Got exception when manually triggers a split on an empty region --- Key: HBASE-7876 URL: https://issues.apache.org/jira/browse/HBASE-7876 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Attachments: HBASE-7876-0.94.patch We should allow a region to split successfully even if it does not yet have storefiles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8024) Make Store flush algorithm pluggable
Maryann Xue created HBASE-8024: -- Summary: Make Store flush algorithm pluggable Key: HBASE-8024 URL: https://issues.apache.org/jira/browse/HBASE-8024 Project: HBase Issue Type: Sub-task Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue The idea is to make StoreFlusher an interface instead of an implementation class, and have the original StoreFlusher as the default store flush impl. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7949) Enable big content store in HBase
[ https://issues.apache.org/jira/browse/HBASE-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596880#comment-13596880 ] Maryann Xue commented on HBASE-7949: @Enis well, the constant reading and writing of the same set of large content data happens in two ways: compaction and split. 1. during compaction, the data is read from small files and writing to a combined new large file. 2. during split, the data is read from the parent region storefiles and written into two daughter regions' storefiles. to avoid I/O overhead caused by 1 (compaction), we can disable minor compaction for this family, but this would lead to another big problem: bad get/scan performance. like for a get operation, we need to compare against too many bloomfilters for each storefile to locate our record; and for a scan operation, we need to perform seek in all these storefiles. the performance decline of Get throughput with the storefile number increase is shown in the slides. to avoid I/O overhead caused by 2 (split), we can have pre-split regions for a table, but this cannot always be done for customer use-cases. The idea is large content data are very probably loaded once and not frequently modified, there is literally no need to move or merge the data all the time, as would happen in normal region compactions and splittings, and in order to maintain region independence and read efficiency. so having a storage independent of hbase regions would make sense for such use-cases, and meanwhile we leverage the major compaction process to do cleanup and merge at a reasonable frequency level -- only perform merge when a certain file has exceeded the configured threshold. Enable big content store in HBase - Key: HBASE-7949 URL: https://issues.apache.org/jira/browse/HBASE-7949 Project: HBase Issue Type: Brainstorming Reporter: chenning Attachments: HBase_LOB.pdf Big content stored in hbase consumes a lot of system resource when region split or compaction operation happens. How HBase can be used to store big content along with some self descriptive meta-data. The general idea is to add a new type of column family, and the content of this kind of column family doesn't participate the region split and compaction. An index(rowkey-location) is introduced in this new column family and the split and compaction are only applied to this index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7891) Add an index number prior to each table region in table.jsp
[ https://issues.apache.org/jira/browse/HBASE-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7891: --- Attachment: HBASE-7891-trunk.patch attach the trunk patch Add an index number prior to each table region in table.jsp --- Key: HBASE-7891 URL: https://issues.apache.org/jira/browse/HBASE-7891 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Attachments: HBASE-7891-0.94.patch, HBASE-7891-trunk.patch Adding an index number for each table region in table.jsp would make it easier to locate a region or to count regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7891) Add an index number prior to each table region in table.jsp
[ https://issues.apache.org/jira/browse/HBASE-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596892#comment-13596892 ] Maryann Xue commented on HBASE-7891: @nick i understand your point. but one would soon realize the indexes are more for counting purposes rather than of any real meaning as the indexes are always sequential in the page and the regions may move among different servers. Add an index number prior to each table region in table.jsp --- Key: HBASE-7891 URL: https://issues.apache.org/jira/browse/HBASE-7891 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Attachments: HBASE-7891-0.94.patch, HBASE-7891-trunk.patch Adding an index number for each table region in table.jsp would make it easier to locate a region or to count regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7890) Add an index number for each region in the region list on the RegionServer web page
[ https://issues.apache.org/jira/browse/HBASE-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7890: --- Attachment: HBASE-7890-trunk.patch attach the trunk patch Add an index number for each region in the region list on the RegionServer web page --- Key: HBASE-7890 URL: https://issues.apache.org/jira/browse/HBASE-7890 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Attachments: HBASE-7890-0.94.patch, HBASE-7890-trunk.patch Add an index number before each region would make it easier to locate a region on the page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-8039) Make HDFS replication number configurable for a column family
Maryann Xue created HBASE-8039: -- Summary: Make HDFS replication number configurable for a column family Key: HBASE-8039 URL: https://issues.apache.org/jira/browse/HBASE-8039 Project: HBase Issue Type: Improvement Components: HFile Affects Versions: 0.94.5 Reporter: Maryann Xue Priority: Minor To allow users to decide which column family's data is more important and which is less important by specifying a replica number instead of using the default replica number. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7876) Got exception when manually triggers a split on an empty region
[ https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7876: --- Attachment: HBASE-7876-trunk.patch HBASE-7876-0.94V2.patch update patch -- revert HBASE-6853 Got exception when manually triggers a split on an empty region --- Key: HBASE-7876 URL: https://issues.apache.org/jira/browse/HBASE-7876 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Attachments: HBASE-7876-0.94.patch, HBASE-7876-0.94V2.patch, HBASE-7876-trunk.patch We should allow a region to split successfully even if it does not yet have storefiles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7949) Enable big content store in HBase
[ https://issues.apache.org/jira/browse/HBASE-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596929#comment-13596929 ] Maryann Xue commented on HBASE-7949: yes, you've made a good point here. flush would happen more frequently and compactions for the meta data family will involve more small storefiles. however, 1. this approach best guarantees consistency. 2. several large content records get flushed into one file in one process, which means more efficient I/O usage. 3. meta data is very small compared to large content data. moreover, one minor compaction can handle a bunch of small meta data storefiles. Enable big content store in HBase - Key: HBASE-7949 URL: https://issues.apache.org/jira/browse/HBASE-7949 Project: HBase Issue Type: Brainstorming Reporter: chenning Attachments: HBase_LOB.pdf Big content stored in hbase consumes a lot of system resource when region split or compaction operation happens. How HBase can be used to store big content along with some self descriptive meta-data. The general idea is to add a new type of column family, and the content of this kind of column family doesn't participate the region split and compaction. An index(rowkey-location) is introduced in this new column family and the split and compaction are only applied to this index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7949) Enable big content store in HBase
[ https://issues.apache.org/jira/browse/HBASE-7949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7949: --- Attachment: HBase_LOB.pdf At the recent hbase meetup, we just gave an introduction of an implementation for storing large objects. The idea is to store the real content onto HDFS and let customized major compaction for this family handle the management work for these large contents. And we need a customizable flush() process for this approach. Enable big content store in HBase - Key: HBASE-7949 URL: https://issues.apache.org/jira/browse/HBASE-7949 Project: HBase Issue Type: Brainstorming Reporter: chenning Attachments: HBase_LOB.pdf Big content stored in hbase consumes a lot of system resource when region split or compaction operation happens. How HBase can be used to store big content along with some self descriptive meta-data. The general idea is to add a new type of column family, and the content of this kind of column family doesn't participate the region split and compaction. An index(rowkey-location) is introduced in this new column family and the split and compaction are only applied to this index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7890) Add an index number for each region in the region list on the RegionServer web page
Maryann Xue created HBASE-7890: -- Summary: Add an index number for each region in the region list on the RegionServer web page Key: HBASE-7890 URL: https://issues.apache.org/jira/browse/HBASE-7890 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Add an index number before each region would make it easier to locate a region on the page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7890) Add an index number for each region in the region list on the RegionServer web page
[ https://issues.apache.org/jira/browse/HBASE-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7890: --- Attachment: HBASE-7890-0.94.patch add index column in the RegionServer web page Add an index number for each region in the region list on the RegionServer web page --- Key: HBASE-7890 URL: https://issues.apache.org/jira/browse/HBASE-7890 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Attachments: HBASE-7890-0.94.patch Add an index number before each region would make it easier to locate a region on the page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7890) Add an index number for each region in the region list on the RegionServer web page
[ https://issues.apache.org/jira/browse/HBASE-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7890: --- Status: Patch Available (was: Open) Add an index number for each region in the region list on the RegionServer web page --- Key: HBASE-7890 URL: https://issues.apache.org/jira/browse/HBASE-7890 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Attachments: HBASE-7890-0.94.patch Add an index number before each region would make it easier to locate a region on the page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7891) Add an index number prior to each table region in table.jsp
Maryann Xue created HBASE-7891: -- Summary: Add an index number prior to each table region in table.jsp Key: HBASE-7891 URL: https://issues.apache.org/jira/browse/HBASE-7891 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Attachments: HBASE-7891-0.94.patch Adding an index number for each table region in table.jsp would make it easier to locate a region or to count regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7891) Add an index number prior to each table region in table.jsp
[ https://issues.apache.org/jira/browse/HBASE-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7891: --- Status: Patch Available (was: Open) Add an index number prior to each table region in table.jsp --- Key: HBASE-7891 URL: https://issues.apache.org/jira/browse/HBASE-7891 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Attachments: HBASE-7891-0.94.patch Adding an index number for each table region in table.jsp would make it easier to locate a region or to count regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7891) Add an index number prior to each table region in table.jsp
[ https://issues.apache.org/jira/browse/HBASE-7891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7891: --- Attachment: HBASE-7891-0.94.patch Add index column to table region list in table.jsp Add an index number prior to each table region in table.jsp --- Key: HBASE-7891 URL: https://issues.apache.org/jira/browse/HBASE-7891 Project: HBase Issue Type: Improvement Components: UI Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Trivial Attachments: HBASE-7891-0.94.patch Adding an index number for each table region in table.jsp would make it easier to locate a region or to count regions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7892) FuzzyRowFilter would have wrong behaviors if user gives an arbitary byte for an unfixed position instead of byte 0
Maryann Xue created HBASE-7892: -- Summary: FuzzyRowFilter would have wrong behaviors if user gives an arbitary byte for an unfixed position instead of byte 0 Key: HBASE-7892 URL: https://issues.apache.org/jira/browse/HBASE-7892 Project: HBase Issue Type: Improvement Components: Filters Affects Versions: 0.94.5, 0.96.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor An actual case can be: we want to match a?ex, so we give a?ex as input of key bytes, and 0100 as input of meta bytes. if we start with row = \0\0\0\0, the next hint would turn out to be a?ex while actually the right hint should be a\0ex. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7892) FuzzyRowFilter would have wrong behaviors if user gives an arbitary byte for an unfixed position instead of byte 0
[ https://issues.apache.org/jira/browse/HBASE-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7892: --- Attachment: HBASE-7892-0.94.patch 1. initialize the unfixed positions with byte 0 2. remove copying of row - improve performance 3. add corresponding test cases FuzzyRowFilter would have wrong behaviors if user gives an arbitary byte for an unfixed position instead of byte 0 Key: HBASE-7892 URL: https://issues.apache.org/jira/browse/HBASE-7892 Project: HBase Issue Type: Improvement Components: Filters Affects Versions: 0.96.0, 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Attachments: HBASE-7892-0.94.patch An actual case can be: we want to match a?ex, so we give a?ex as input of key bytes, and 0100 as input of meta bytes. if we start with row = \0\0\0\0, the next hint would turn out to be a?ex while actually the right hint should be a\0ex. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7874) Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected.
Maryann Xue created HBASE-7874: -- Summary: Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected. Key: HBASE-7874 URL: https://issues.apache.org/jira/browse/HBASE-7874 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Usually, when regionserver cannot catch up with the put load given by the client, what happens is region server starts to block update requests from the client until required resource has been reclaimed (i.e. memstore has been flushed). But in more severe situations, the blocking time gets so long that the client begins to have SocketTimeoutException and then decides to retry, while in fact the updates are written into memstore later after they are unblocked. Even though the client has something like a binary rollback for retry intervals, this can still lead to a vicious circle, leaving the client to have very low throughput. Think we can enable an option to allow regionserver to check if the client has disconnected (just like what we do in scan) after coming back from blocking, so that the regionserver has the same view as the client on whether updates are successfully committed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-7876) Got exception when splitting a region that contains no storefile
Maryann Xue created HBASE-7876: -- Summary: Got exception when splitting a region that contains no storefile Key: HBASE-7876 URL: https://issues.apache.org/jira/browse/HBASE-7876 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue We should allow a region to split successfully even if it does not yet have storefiles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7874) Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected.
[ https://issues.apache.org/jira/browse/HBASE-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7874: --- Attachment: HBASE-7874-0.94.patch add check channel closed in HRegion#batchMutate() Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected. -- Key: HBASE-7874 URL: https://issues.apache.org/jira/browse/HBASE-7874 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Attachments: HBASE-7874-0.94.patch Usually, when regionserver cannot catch up with the put load given by the client, what happens is region server starts to block update requests from the client until required resource has been reclaimed (i.e. memstore has been flushed). But in more severe situations, the blocking time gets so long that the client begins to have SocketTimeoutException and then decides to retry, while in fact the updates are written into memstore later after they are unblocked. Even though the client has something like a binary rollback for retry intervals, this can still lead to a vicious circle, leaving the client to have very low throughput. Think we can enable an option to allow regionserver to check if the client has disconnected (just like what we do in scan) after coming back from blocking, so that the regionserver has the same view as the client on whether updates are successfully committed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7874) Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected.
[ https://issues.apache.org/jira/browse/HBASE-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7874: --- Status: Patch Available (was: Open) Allow RegionServer to abort Put if it detects that the HBase Client got SocketTimeoutException and disconnected. -- Key: HBASE-7874 URL: https://issues.apache.org/jira/browse/HBASE-7874 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Attachments: HBASE-7874-0.94.patch Usually, when regionserver cannot catch up with the put load given by the client, what happens is region server starts to block update requests from the client until required resource has been reclaimed (i.e. memstore has been flushed). But in more severe situations, the blocking time gets so long that the client begins to have SocketTimeoutException and then decides to retry, while in fact the updates are written into memstore later after they are unblocked. Even though the client has something like a binary rollback for retry intervals, this can still lead to a vicious circle, leaving the client to have very low throughput. Think we can enable an option to allow regionserver to check if the client has disconnected (just like what we do in scan) after coming back from blocking, so that the regionserver has the same view as the client on whether updates are successfully committed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7876) Got exception when manually triggers a split on an empty region
[ https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7876: --- Summary: Got exception when manually triggers a split on an empty region (was: Got exception when splitting a region that contains no storefile) Got exception when manually triggers a split on an empty region --- Key: HBASE-7876 URL: https://issues.apache.org/jira/browse/HBASE-7876 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue We should allow a region to split successfully even if it does not yet have storefiles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7876) Got exception when manually triggers a split on an empty region
[ https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7876: --- Priority: Minor (was: Major) Got exception when manually triggers a split on an empty region --- Key: HBASE-7876 URL: https://issues.apache.org/jira/browse/HBASE-7876 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor We should allow a region to split successfully even if it does not yet have storefiles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-7876) Got exception when manually triggers a split on an empty region
[ https://issues.apache.org/jira/browse/HBASE-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-7876: --- Attachment: HBASE-7876-0.94.patch return if no storefile. Got exception when manually triggers a split on an empty region --- Key: HBASE-7876 URL: https://issues.apache.org/jira/browse/HBASE-7876 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.5 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Minor Attachments: HBASE-7876-0.94.patch We should allow a region to split successfully even if it does not yet have storefiles. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassign the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13497797#comment-13497797 ] Maryann Xue commented on HBASE-5816: Since this is not fully addressed in HBASE-6060, how about test/reproduce it against Jimmy's fix? Balancer and ServerShutdownHandler concurrently reassign the same region Key: HBASE-5816 URL: https://issues.apache.org/jira/browse/HBASE-5816 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6 Reporter: Maryann Xue Assignee: ramkrishna.s.vasudevan Priority: Critical Attachments: HBASE-5816.patch The first assign thread exits with success after updating the RegionState to PENDING_OPEN, while the second assign follows immediately into assign and fails the RegionState check in setOfflineInZooKeeper(). This causes the master to abort. In the below case, the two concurrent assigns occurred when AM tried to assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler tried to assign this region (from the region plan) spontaneously. {code} 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. (offlining) 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 2012-04-17 05:44:57,666 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=CLOSED, ts=1334612697672, server=hadoop05.sh.intel.com,60020,1334544902186 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x236b912e9b3000e Creating (or updating) unassigned node for fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:54:19,159 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=PENDING_OPEN, ts=1334613179096, server=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:54:59,033 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 12 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 remote=/10.239.47.87:60020] at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) at $Proxy7.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127) at
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Attachment: (was: HBASE-6299-v3.patch) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Fix For: 0.96.0, 0.92.3, 0.94.3 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Attachment: HBASE-6299-v3.patch @ramkrishna, updated the patch. misunderstood the exception handling in HBaseClient. thank you for pointing this out! RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Fix For: 0.96.0, 0.92.3, 0.94.3 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch, HBASE-6299-v3.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0,
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Attachment: HBASE-6299-v3.patch Considering a live RS would most likely eventually get to the openRegion() request and process, it might be good just to return on SocketTimeoutException, for SocketTimeoutException indicates an uncertain state in the assign process, with potential race conditions. And this can happen if a RS is temporarily running out of IPC handlers, or if the RS's response is lost on the line. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299.patch, HBASE-6299-v2.patch, HBASE-6299-v3.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13454840#comment-13454840 ] Maryann Xue commented on HBASE-6299: updated the patch as HBASE-6299-v3.patch RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299.patch, HBASE-6299-v2.patch, HBASE-6299-v3.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Attachment: (was: HBASE-6299-v3.patch) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Fix For: 0.96.0, 0.92.3, 0.94.3 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Attachment: HBASE-6299-v3.patch @Lars the original unwrap should not work. @Ted please review the patch. @ramkrishna How about we apply this fix first and then update the patch for HBASE-6438? for as i can see HBASE-6438 is about another problem but the patch includes my old fix. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Fix For: 0.96.0, 0.92.3, 0.94.3 Attachments: HBASE-6299.patch, HBASE-6299-v2.patch, HBASE-6299-v3.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13412618#comment-13412618 ] Maryann Xue commented on HBASE-6299: bq. I'm not sure what synchronize does. I suppose it prevents double assign The interesting thing is we check RegionState is OFFLINE or CLOSED before setting OFFLINE in zk and abort if the check fails; while we allow any RegionState before setting RegionState OFFLINE. And since this synchronize on RegionState does not guard the whole process (state change from PEND_OPEN to OPENED), double assignment is not prevented at all, though there's some check in setOfflineInZookeeper, but only when hijack=true. So far i've seen two error cases with double assign: 1. HBASE-5816: The second assign comes in almost at the same time with the first assignment,but gets locked by sychronized(state). After the first assignment succeeds with sendRegionOpen() and exits the synchronized block, the second assignment goes into the block and calls setOfflineInZookeeper() which fails the RegionState Offline check and leads to master abort. 2. The second assignment kicks in after the first assignment succeeded and deleted the ZK node but before regionOnline() is called (which removes the region from AM.regionsInTransition and adds the region to AM.regions). The second assignment starts a normal assign process, setting RegionState OFFLINE, setting ZK OFFLINE, and calls sendRegionOpen() to the same dest RS. Then, when the first assignment calls AM.regionOnline(), this region get removed from AM.regionsInTranistion. This is a double assignment to the RS. if RS chooses to cleanUpFailedOpen() as in 0.90, this region will be served nowhere and does not even exist in master's regionsInTransition; if RS chooses to proceed on with openRegion() as in trunk, master will get RS events OPENING, OPENED related to NO RegionState, as in HBASE-6300. I can see we check if ZK node exists in setOfflineInZookeeper to prevent double assignment, but this check is only effective when hijack=true. Is it possible that we can do something in an earlier stage to prevent double assignment? like in forceRegionStateToOffline()? bq. @Mary Is HBASE-5396 committed? No... but explicitly calling assign() from HBaseAdmin can cause the same problem. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING,
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411273#comment-13411273 ] Maryann Xue commented on HBASE-6299: Currently we don't check concurrent double assignment, while it can happen quite easily after HBASE-5396. {code} RegionState state = addToRegionsInTransition(region, hijack); synchronized (state) { assign(region, state, setOfflineInZK, forceNewPlan, hijack); } {code} We now set RegionState OFFLINE in addToRegionsInTransition(), and set ZK node OFFLINE after we get into the critical section. Why don't we set these two OFFLINE together in addToRegionsInTransition() and after getting into the critical section check if RegionState is OFFLINE? And with double assignment, we go directly with assignment() without checking its current RegionState in addToRegionsInTransition() with calls forceRegionStateToOffline(). and forceRegionStateToOffline() simply force a RegionState Offline. {code} RegionState state = this.regionsInTransition.get(encodedName); if (state == null) { state = new RegionState(region, RegionState.State.OFFLINE); this.regionsInTransition.put(encodedName, state); } else { // If we are reassigning the node do not force in-memory state to OFFLINE. // Based on the znode state we will decide if to change in-memory state to // OFFLINE or not. It will be done before setting znode to OFFLINE state. // We often get here with state == CLOSED because ClosedRegionHandler will // assign on its tail as part of the handling of a region close. if (!hijack) { LOG.debug(Forcing OFFLINE; was= + state); state.update(RegionState.State.OFFLINE); } } {code} With this piece of code, we normally see logs like Forcing OFFLINE; was=regionName state=CLOSED with load balance. but in double assignment, we can see Forcing OFFLINE; was=regionName state=OPEN. Should we ensure the state is CLOSED or OFFLINE before proceeding to assignment? RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING,
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406900#comment-13406900 ] Maryann Xue commented on HBASE-6299: @stack but assign() checks RegionState OFFLINE at the beginning of each attempt, and not setting it OFFLINE might cause master to abort, as in HBASE-5816: {code} for (int i = 0; i this.maximumAssignmentAttempts; i++) { int versionOfOfflineNode = -1; if (setOfflineInZK) { // get the version of the znode after setting it to OFFLINE. // versionOfOfflineNode will be -1 if the znode was not set to OFFLINE versionOfOfflineNode = setOfflineInZooKeeper(state, hijack); {code} {code} int setOfflineInZooKeeper(final RegionState state, boolean hijack) { // In case of reassignment the current state in memory need not be // OFFLINE. if (!hijack !state.isClosed() !state.isOffline()) { String msg = Unexpected state : + state + .. Cannot transit it to OFFLINE.; this.master.abort(msg, new IllegalStateException(msg)); return -1; } {code} RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407153#comment-13407153 ] Maryann Xue commented on HBASE-6299: Yes, agree! RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407204#comment-13407204 ] Maryann Xue commented on HBASE-6299: And i'm thinking to move this setOfflineInZK logic into forceRegionStateToOffline(). what do you think? RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Attachment: HBASE-6299-v2.patch Make handling of RegionAlreadyInTransitionException work. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Status: Patch Available (was: Open) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket timeout exception:
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Status: Open (was: Patch Available) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket timeout exception:
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13406862#comment-13406862 ] Maryann Xue commented on HBASE-6299: Agree, ramkrishna! You've made a good point here. My original idea was to directly return in the else branch, and leave it to the TimeoutMonitor to assign this region if the RS did not open the region. I changed to the current version, thinking to bring the assign retrial earlier. But regarding the region in transition problem you pointed out, the original return solution looks better. {code} else { +// The destination region server is probably processing the region open, so it +// might be safer to try this region server again to avoid having two region +// servers open the same region. +LOG.error(Call openRegion() to + plan.getDestination() + + has timed out when trying to assign + region.getRegionNameAsString() + +., t); +return; + } {code} And if we are considering removing the assign retry in HBASE-6060, problems like this one and the one in HBASE-5816 can be avoided. Think triggering SSH in case of SocketTimeout should be a different problem. There are several places in HMaster where we should consider whether to start SSH, but currently only RegionServerTracker will start SSH. Shall we open another JIRA entry to discuss this issue? RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299-v2.patch, HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Attachment: HBASE-6299.patch Add handling of SocketTimeoutException in assign(). 1. return if region is already opened on this RS. 2. try assigning on the same RS again otherwise. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign
[jira] [Updated] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6299: --- Status: Patch Available (was: Open) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket timeout exception: java.net.SocketTimeoutException:
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405459#comment-13405459 ] Maryann Xue commented on HBASE-6299: stack, thank you for pointing this out. I was thinking the innermost assign() would handle RegionAlreadyInTransitionException and return, in the code block as follows: {code} if (t instanceof RemoteException) { t = ((RemoteException) t).unwrapRemoteException(); if (t instanceof RegionAlreadyInTransitionException) { String errorMsg = Failed assignment in: + plan.getDestination() + due to + t.getMessage(); LOG.error(errorMsg, t); return; } } {code} I just looked again at HRegionServer.openRegion(), and found that RegionAlreadyInTransitionException is wrapped as ServiceException: {code} } catch (RegionAlreadyInTransitionException rie) { LOG.warn(Region is already in transition, rie); if (isBulkAssign) { builder.addOpeningState(RegionOpeningState.OPENED); } else { throw new ServiceException(rie); } {code} But i don't see why in assign() HMaster does not unwrap RemoteException and then ServiceException as well. And since RegionAlreadyInTransitionException is always wrapped, i don't see at what situation the first code block will be called. I might be missing something or need a closer look? RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29
[jira] [Commented] (HBASE-6300) Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected.
[ https://issues.apache.org/jira/browse/HBASE-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405465#comment-13405465 ] Maryann Xue commented on HBASE-6300: Apart from what happened in HBASE-6299, so far i see nothing will cause this RegionState null warning. But in case it happens to go into there, there must be a serious inconsistent state, i suppose, two region servers are having this region, and very likely master's region info is different from META. Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected. -- Key: HBASE-6300 URL: https://issues.apache.org/jira/browse/HBASE-6300 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue When RS updates an unassigned ZK node to RS_ZK_REGION_OPENED, it will most probably proceed to update the region location in META. This would cause inconsistency between the region's location in HMaster and that in META. Not deleting this ZK node would also make further region transitions fail with ZK exception node already exists. So the master should either abort or fix this inconsistency. {code} case RS_ZK_REGION_OPENED: hri = checkIfInFailover(regionState, encodedName, regionName); if (hri != null) { regionState = new RegionState(hri, RegionState.State.OPEN, createTime, sn); regionsInTransition.put(encodedName, regionState); new OpenedRegionHandler(master, this, regionState.getRegion(), sn, expectedVersion).process(); failoverProcessedRegions.put(encodedName, hri); break; } // Should see OPENED after OPENING but possible after PENDING_OPEN if (regionState == null || (!regionState.isPendingOpen() !regionState.isOpening())) { LOG.warn(Received OPENED for region + prettyPrintedRegionName + from server + sn + but region was in + the state + regionState + and not + in expected PENDING_OPEN or OPENING states); return; } // Handle OPENED by removing from transition and deleted zk node regionState.update(RegionState.State.OPEN, createTime, sn); this.executorService.submit( new OpenedRegionHandler(master, this, regionState.getRegion(), sn, expectedVersion)); break; {code} Error logs: {code} 2012-06-29 07:07:41,149 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,150 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:07:41,296 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,296 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:07:41,302 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:08:38,872 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-006,60020,1340890678078, dest=swbss-hadoop-008,60020,1340891085175 2012-06-29 07:08:38,872 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. (offlining) 2012-06-29 07:08:47,875 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for
[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
[ https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405500#comment-13405500 ] Maryann Xue commented on HBASE-6289: @Jieshan, doable i think. but currently CatalogTracker acts more of an hbase client role, and talks to zookeeper and region servers only. don't know if this is its desired semantics. ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. -- Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6289-v2.patch, HBASE-6289-v2.patch, HBASE-6289.patch The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. {code} private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } {code} After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: {code} 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405507#comment-13405507 ] Maryann Xue commented on HBASE-6299: Thank you, Zhihong! then i suppose the exception handling should be modified as: {code} if (t instanceof RegionAlreadyInTransitionException) { String errorMsg = Failed assignment in: + plan.getDestination() + due to + t.getMessage(); LOG.error(errorMsg, t); return; } {code} RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405582#comment-13405582 ] Maryann Xue commented on HBASE-6299: it happened on a 0.90 cluster. and i checked trunk code and assume the issue still exists. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6299.patch 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
[jira] [Created] (HBASE-6300) Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected (not in failover).
Maryann Xue created HBASE-6300: -- Summary: Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected (not in failover). Key: HBASE-6300 URL: https://issues.apache.org/jira/browse/HBASE-6300 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Assignee: Maryann Xue When RS updates an unassigned ZK node to RS_ZK_REGION_OPENED, it will most probably proceed to update the region location in META. This would cause inconsistency between the region's location in HMaster and that in META. Not deleting this ZK node would also make further region transitions fail with ZK exception node already exists. So the master should either abort or fix this inconsistency. {code} case RS_ZK_REGION_OPENED: hri = checkIfInFailover(regionState, encodedName, regionName); if (hri != null) { regionState = new RegionState(hri, RegionState.State.OPEN, createTime, sn); regionsInTransition.put(encodedName, regionState); new OpenedRegionHandler(master, this, regionState.getRegion(), sn, expectedVersion).process(); failoverProcessedRegions.put(encodedName, hri); break; } // Should see OPENED after OPENING but possible after PENDING_OPEN if (regionState == null || (!regionState.isPendingOpen() !regionState.isOpening())) { LOG.warn(Received OPENED for region + prettyPrintedRegionName + from server + sn + but region was in + the state + regionState + and not + in expected PENDING_OPEN or OPENING states); return; } // Handle OPENED by removing from transition and deleted zk node regionState.update(RegionState.State.OPEN, createTime, sn); this.executorService.submit( new OpenedRegionHandler(master, this, regionState.getRegion(), sn, expectedVersion)); break; {code} Error logs: {code} 2012-06-29 07:07:41,149 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,150 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:07:41,296 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,296 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:07:41,302 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:08:38,872 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-006,60020,1340890678078, dest=swbss-hadoop-008,60020,1340891085175 2012-06-29 07:08:38,872 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. (offlining) 2012-06-29 07:08:47,875 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. 2012-06-29 08:04:37,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. state=PENDING_CLOSE, ts=1340926468331, server=null 2012-06-29 08:04:37,681 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.
[jira] [Updated] (HBASE-6300) Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected.
[ https://issues.apache.org/jira/browse/HBASE-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6300: --- Summary: Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected. (was: Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected (not in failover).) Master should not ignore event RS_ZK_REGION_OPENED when regionState is null or unexpected. -- Key: HBASE-6300 URL: https://issues.apache.org/jira/browse/HBASE-6300 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue When RS updates an unassigned ZK node to RS_ZK_REGION_OPENED, it will most probably proceed to update the region location in META. This would cause inconsistency between the region's location in HMaster and that in META. Not deleting this ZK node would also make further region transitions fail with ZK exception node already exists. So the master should either abort or fix this inconsistency. {code} case RS_ZK_REGION_OPENED: hri = checkIfInFailover(regionState, encodedName, regionName); if (hri != null) { regionState = new RegionState(hri, RegionState.State.OPEN, createTime, sn); regionsInTransition.put(encodedName, regionState); new OpenedRegionHandler(master, this, regionState.getRegion(), sn, expectedVersion).process(); failoverProcessedRegions.put(encodedName, hri); break; } // Should see OPENED after OPENING but possible after PENDING_OPEN if (regionState == null || (!regionState.isPendingOpen() !regionState.isOpening())) { LOG.warn(Received OPENED for region + prettyPrintedRegionName + from server + sn + but region was in + the state + regionState + and not + in expected PENDING_OPEN or OPENING states); return; } // Handle OPENED by removing from transition and deleted zk node regionState.update(RegionState.State.OPEN, createTime, sn); this.executorService.submit( new OpenedRegionHandler(master, this, regionState.getRegion(), sn, expectedVersion)); break; {code} Error logs: {code} 2012-06-29 07:07:41,149 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,150 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:07:41,296 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,296 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENING for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:07:41,302 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-164,60020,1340888346294, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:07:41,302 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region b713fd655fa02395496c5a6e39ddf568 from server swbss-hadoop-164,60020,1340888346294 but region was in the state null and not in expected PENDING_OPEN or OPENING states 2012-06-29 07:08:38,872 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-006,60020,1340890678078, dest=swbss-hadoop-008,60020,1340891085175 2012-06-29 07:08:38,872 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. (offlining) 2012-06-29 07:08:47,875 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. 2012-06-29 08:04:37,681 INFO
[jira] [Created] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
Maryann Xue created HBASE-6299: -- Summary: RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /172.16.0.6:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 12 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.16.0.2:53765 remote=/172.16.0.6:60020] at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) at
[jira] [Commented] (HBASE-6299) RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems.
[ https://issues.apache.org/jira/browse/HBASE-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404682#comment-13404682 ] Maryann Xue commented on HBASE-6299: Think a good option can be checking if the region has been assigned successfully already when dealing with the RPC failure, so that there is no need to start another attempt. RS starts region open while fails ack to HMaster.sendRegionOpen() causes inconsistency in HMaster's region state and a series of successive problems. - Key: HBASE-6299 URL: https://issues.apache.org/jira/browse/HBASE-6299 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical 1. HMaster tries to assign a region to an RS. 2. HMaster creates a RegionState for this region and puts it into regionsInTransition. 3. In the first assign attempt, HMaster calls RS.openRegion(). The RS receives the open region request and starts to proceed, with success eventually. However, due to network problems, HMaster fails to receive the response for the openRegion() call, and the call times out. 4. HMaster attemps to assign for a second time, choosing another RS. 5. But since the HMaster's OpenedRegionHandler has been triggered by the region open of the previous RS, and the RegionState has already been removed from regionsInTransition, HMaster finds invalid and ignores the unassigned ZK node RS_ZK_REGION_OPENING updated by the second attempt. 6. The unassigned ZK node stays and a later unassign fails coz RS_ZK_REGION_CLOSING cannot be created. {code} 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568.; plan=hri=CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568., src=swbss-hadoop-004,60020,1340890123243, dest=swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to swbss-hadoop-006,60020,1340890678078 2012-06-29 07:03:38,870 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=swbss-hadoop-002:6, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:28,882 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,291 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=swbss-hadoop-006,60020,1340890678078, region=b713fd655fa02395496c5a6e39ddf568 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. from serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301); deleting unassigned node 2012-06-29 07:06:32,299 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Deleting existing unassigned node for b713fd655fa02395496c5a6e39ddf568 that is in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x2377fee2ae80007 Successfully deleted unassigned node for region b713fd655fa02395496c5a6e39ddf568 in expected state RS_ZK_REGION_OPENED 2012-06-29 07:06:32,301 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: The master has opened the region CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. that was online on serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=518945, regions=575, usedHeap=15282, maxHeap=31301) 2012-06-29 07:07:41,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of CDR_STATS_TRAFFIC,13184390567|20120508|17||2|3|913,1337256975556.b713fd655fa02395496c5a6e39ddf568. to serverName=swbss-hadoop-006,60020,1340890678078, load=(requests=0, regions=575, usedHeap=0, maxHeap=0), trying to
[jira] [Updated] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
[ https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6289: --- Attachment: HBASE-6289-v2.patch Updated the patch. ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. -- Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6289-v2.patch, HBASE-6289.patch The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. {code} private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } {code} After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: {code} 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
[ https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404365#comment-13404365 ] Maryann Xue commented on HBASE-6289: @stack thanks for the explanation! @Ted sorry for my carelessness. ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. -- Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6289-v2.patch, HBASE-6289.patch The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. {code} private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } {code} After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: {code} 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
Maryann Xue created HBASE-6289: -- Summary: ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Priority: Critical The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
[ https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6289: --- Attachment: HBASE-6289.patch Add excluded server in verifyRootRegionLocation(). ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. -- Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Priority: Critical Attachments: HBASE-6289.patch The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
[ https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6289: --- Assignee: Maryann Xue Status: Patch Available (was: Open) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. -- Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6289.patch The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
[ https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13403105#comment-13403105 ] Maryann Xue commented on HBASE-6289: @ramkrishna: Yes, i thought of this too. but i this comment before verifyAndAssignRoot(): Before assign the ROOT region, ensure it haven't been assigned by other place. Not sure if this ROOT assigned elsewhere situation will actually possibly occur, but we seem to have seen META assigned on several Region Servers at the same time when there was chaos going on in our lab's network. There can be only one single search path for any region (incl. meta and root), though, regardless of client cache. And this is the thing i don't understand, why we try to treat ROOT differently? ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. -- Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6289.patch The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
[ https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13403106#comment-13403106 ] Maryann Xue commented on HBASE-6289: @ramkrishna: Yes, i thought of this too. but i saw this comment here before verifyAndAssignRoot(): Before assign the ROOT region, ensure it haven't been assigned by other place. Not sure if this ROOT assigned elsewhere situation will actually possibly occur, but we seem to have seen META assigned on several Region Servers at the same time when there was chaos going on in our lab's network. There can be only one single search path for any region (incl. meta and root), though, regardless of client cache. And this is the thing i don't understand, why we try to treat ROOT differently? ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. -- Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6289.patch The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6289) ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires.
[ https://issues.apache.org/jira/browse/HBASE-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13403613#comment-13403613 ] Maryann Xue commented on HBASE-6289: @stack Thanks for the comments! if getRootServerLocation() returns null, verifyRootRegionLocation() will return false, so assignRoot() can be called. thus, verifyAndAssignRoot() returns with success and there won't be a loop or retry here. {code} if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout, this.serverName)) { this.services.getAssignmentManager().assignRoot(); } {code} I think ramkrishna was asking why we only verify root before trying to assign it while we directly assign META? that's my question as well. ROOT region doesn't get re-assigned in ServerShutdownHandler if the RS is still working but only the RS's ZK node expires. -- Key: HBASE-6289 URL: https://issues.apache.org/jira/browse/HBASE-6289 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue Assignee: Maryann Xue Priority: Critical Attachments: HBASE-6289.patch The ROOT RS has some network problem and its ZK node expires first, which kicks off the ServerShutdownHandler. it calls verifyAndAssignRoot() to try to re-assign ROOT. At that time, the RS is actually still working and passes the verifyRootRegionLocation() check, so the ROOT region is skipped from re-assignment. {code} private void verifyAndAssignRoot() throws InterruptedException, IOException, KeeperException { long timeout = this.server.getConfiguration(). getLong(hbase.catalog.verification.timeout, 1000); if (!this.server.getCatalogTracker().verifyRootRegionLocation(timeout)) { this.services.getAssignmentManager().assignRoot(); } } {code} After a few moments, this RS encounters DFS write problem and decides to abort. The RS then soon gets restarted from commandline, and constantly report: {code} 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,627 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,628 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 2012-06-27 23:13:08,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; Region is not online: -ROOT-,,0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.
[ https://issues.apache.org/jira/browse/HBASE-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13393648#comment-13393648 ] Maryann Xue commented on HBASE-6169: Yes, you are right. looks like this problem only exists with 0.90. When a RS aborts without finishing closing a region, this region will always remain in transition. Key: HBASE-6169 URL: https://issues.apache.org/jira/browse/HBASE-6169 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: Maryann Xue When RS got an ZK error when trying to create a CLOSING node in the process of closing a region, it hence aborts without completing closing of the region. RS is then discovered dead by HMaster. ServerShutdownHandler does not try to reassign this region for it is in PENDING_CLOSE state; while all regions that originally belong to the dead RS get removed from the regions map. TimeoutMonitor then endlessly tries to unassign this region with LOG message Region has been PENDING_CLOSE for too long. The unassign returns without doing anything, for this region does not exist in the regions map: public void unassign(HRegionInfo region, boolean force, ServerName dest) { // TODO: Method needs refactoring. Ugly buried returns throughout. Beware! LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.
[ https://issues.apache.org/jira/browse/HBASE-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6169: --- Affects Version/s: (was: 0.94.0) When a RS aborts without finishing closing a region, this region will always remain in transition. Key: HBASE-6169 URL: https://issues.apache.org/jira/browse/HBASE-6169 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: Maryann Xue When RS got an ZK error when trying to create a CLOSING node in the process of closing a region, it hence aborts without completing closing of the region. RS is then discovered dead by HMaster. ServerShutdownHandler does not try to reassign this region for it is in PENDING_CLOSE state; while all regions that originally belong to the dead RS get removed from the regions map. TimeoutMonitor then endlessly tries to unassign this region with LOG message Region has been PENDING_CLOSE for too long. The unassign returns without doing anything, for this region does not exist in the regions map: public void unassign(HRegionInfo region, boolean force, ServerName dest) { // TODO: Method needs refactoring. Ugly buried returns throughout. Beware! LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.
[ https://issues.apache.org/jira/browse/HBASE-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13290857#comment-13290857 ] Maryann Xue commented on HBASE-6169: @ramkrishna, we found this problem with disabling table actually, against 0.90. but suppose with trunk, this region would be cleared from RIT in ServerShutdownHandler. but i assume in load balancing, while ServerShutdownHandler does nothing with PENDING_CLOSE or CLOSING regions, the above situation will be triggered by TimeoutMonitor. When a RS aborts without finishing closing a region, this region will always remain in transition. Key: HBASE-6169 URL: https://issues.apache.org/jira/browse/HBASE-6169 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue When RS got an ZK error when trying to create a CLOSING node in the process of closing a region, it hence aborts without completing closing of the region. RS is then discovered dead by HMaster. ServerShutdownHandler does not try to reassign this region for it is in PENDING_CLOSE state; while all regions that originally belong to the dead RS get removed from the regions map. TimeoutMonitor then endlessly tries to unassign this region with LOG message Region has been PENDING_CLOSE for too long. The unassign returns without doing anything, for this region does not exist in the regions map: public void unassign(HRegionInfo region, boolean force, ServerName dest) { // TODO: Method needs refactoring. Ugly buried returns throughout. Beware! LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.
Maryann Xue created HBASE-6169: -- Summary: When a RS aborts without finishing closing a region, this region will always remain in transition. Key: HBASE-6169 URL: https://issues.apache.org/jira/browse/HBASE-6169 Project: HBase Issue Type: Bug Affects Versions: 0.94.0, 0.90.6 Reporter: Maryann Xue When RS got an ZK error when trying to create a CLOSING node in the process of closing a region, it hence aborts without completing closing of the region. RS is then discovered dead by HMaster. ServerShutdownHandler does not try to reassign this region for it is in PENDING_CLOSE state; while all regions that originally belong to the dead RS get removed from the regions map. TimeoutMonitor then endlessly tries to unassign this region with LOG message Region has been PENDING_CLOSE for too long. The unassign returns without doing anything, for this region does not exist in the regions map: public void unassign(HRegionInfo region, boolean force, ServerName dest) { // TODO: Method needs refactoring. Ugly buried returns throughout. Beware! LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6169) When a RS aborts without finishing closing a region, this region will always remain in transition.
[ https://issues.apache.org/jira/browse/HBASE-6169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1329#comment-1329 ] Maryann Xue commented on HBASE-6169: I'm wondering if it is safe to call AM.assign(region) if we know this unassign request is coming from TimeoutMonitor, instead of just return. When a RS aborts without finishing closing a region, this region will always remain in transition. Key: HBASE-6169 URL: https://issues.apache.org/jira/browse/HBASE-6169 Project: HBase Issue Type: Bug Affects Versions: 0.90.6, 0.94.0 Reporter: Maryann Xue When RS got an ZK error when trying to create a CLOSING node in the process of closing a region, it hence aborts without completing closing of the region. RS is then discovered dead by HMaster. ServerShutdownHandler does not try to reassign this region for it is in PENDING_CLOSE state; while all regions that originally belong to the dead RS get removed from the regions map. TimeoutMonitor then endlessly tries to unassign this region with LOG message Region has been PENDING_CLOSE for too long. The unassign returns without doing anything, for this region does not exist in the regions map: public void unassign(HRegionInfo region, boolean force, ServerName dest) { // TODO: Method needs refactoring. Ugly buried returns throughout. Beware! LOG.debug(Starting unassignment of region + region.getRegionNameAsString() + (offlining)); synchronized (this.regions) { // Check if this region is currently assigned if (!regions.containsKey(region)) { LOG.debug(Attempted to unassign region + region.getRegionNameAsString() + but it is not + currently assigned anywhere); return; } } ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()
[ https://issues.apache.org/jira/browse/HBASE-6049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6049: --- Attachment: HBASE-6049-v3.patch @stack, yes, there was a mistake. updated the patch. Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject() Key: HBASE-6049 URL: https://issues.apache.org/jira/browse/HBASE-6049 Project: HBase Issue Type: Bug Components: io Affects Versions: 0.94.0 Reporter: Maryann Xue Attachments: HBASE-6049-v2.patch, HBASE-6049-v3.patch, HBASE-6049.patch An error case could be in Coprocessor AggregationClient, the median() function handles an empty region and returns a List Object with the first element as a Null value. NPE occurs in the RPC response stage and the response never gets sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()
[ https://issues.apache.org/jira/browse/HBASE-6049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6049: --- Attachment: HBASE-6049-v2.patch @Zhihong updated the patch with modification to the test case. how does this look? Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject() Key: HBASE-6049 URL: https://issues.apache.org/jira/browse/HBASE-6049 Project: HBase Issue Type: Bug Components: io Affects Versions: 0.94.0 Reporter: Maryann Xue Attachments: HBASE-6049-v2.patch, HBASE-6049.patch An error case could be in Coprocessor AggregationClient, the median() function handles an empty region and returns a List Object with the first element as a Null value. NPE occurs in the RPC response stage and the response never gets sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()
Maryann Xue created HBASE-6049: -- Summary: Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject() Key: HBASE-6049 URL: https://issues.apache.org/jira/browse/HBASE-6049 Project: HBase Issue Type: Bug Components: io Affects Versions: 0.94.0 Reporter: Maryann Xue An error case could be in Coprocessor AggregationClient, the median() function handles an empty region and returns a List Object with the first element as a Null value. NPE occurs in the RPC response stage and the response never gets sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()
[ https://issues.apache.org/jira/browse/HBASE-6049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6049: --- Attachment: HBASE-6049.patch handle null values in a list in writeObject() Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject() Key: HBASE-6049 URL: https://issues.apache.org/jira/browse/HBASE-6049 Project: HBase Issue Type: Bug Components: io Affects Versions: 0.94.0 Reporter: Maryann Xue Attachments: HBASE-6049.patch An error case could be in Coprocessor AggregationClient, the median() function handles an empty region and returns a List Object with the first element as a Null value. NPE occurs in the RPC response stage and the response never gets sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6049) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject()
[ https://issues.apache.org/jira/browse/HBASE-6049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6049: --- Status: Patch Available (was: Open) Serializing List containing null elements will cause NullPointerException in HbaseObjectWritable.writeObject() Key: HBASE-6049 URL: https://issues.apache.org/jira/browse/HBASE-6049 Project: HBase Issue Type: Bug Components: io Affects Versions: 0.94.0 Reporter: Maryann Xue Attachments: HBASE-6049.patch An error case could be in Coprocessor AggregationClient, the median() function handles an empty region and returns a List Object with the first element as a Null value. NPE occurs in the RPC response stage and the response never gets sent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6029) HBCK doesn't recover Balance switch if exception occurs in onlineHbck().
[ https://issues.apache.org/jira/browse/HBASE-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6029: --- Affects Version/s: 0.94.0 HBCK doesn't recover Balance switch if exception occurs in onlineHbck(). Key: HBASE-6029 URL: https://issues.apache.org/jira/browse/HBASE-6029 Project: HBase Issue Type: Bug Components: hbck Affects Versions: 0.94.0 Reporter: Maryann Xue -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-6029) HBCK doesn't recover Balance switch if exception occurs in onlineHbck().
Maryann Xue created HBASE-6029: -- Summary: HBCK doesn't recover Balance switch if exception occurs in onlineHbck(). Key: HBASE-6029 URL: https://issues.apache.org/jira/browse/HBASE-6029 Project: HBase Issue Type: Bug Components: hbck Reporter: Maryann Xue -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6029) HBCK doesn't recover Balance switch if exception occurs in onlineHbck().
[ https://issues.apache.org/jira/browse/HBASE-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6029: --- Attachment: HBASE-6029.patch add try-finally block to recover balance switch. HBCK doesn't recover Balance switch if exception occurs in onlineHbck(). Key: HBASE-6029 URL: https://issues.apache.org/jira/browse/HBASE-6029 Project: HBase Issue Type: Bug Components: hbck Affects Versions: 0.94.0 Reporter: Maryann Xue Attachments: HBASE-6029.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-6029) HBCK doesn't recover Balance switch if exception occurs in onlineHbck().
[ https://issues.apache.org/jira/browse/HBASE-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-6029: --- Status: Patch Available (was: Open) HBCK doesn't recover Balance switch if exception occurs in onlineHbck(). Key: HBASE-6029 URL: https://issues.apache.org/jira/browse/HBASE-6029 Project: HBase Issue Type: Bug Components: hbck Affects Versions: 0.94.0 Reporter: Maryann Xue Attachments: HBASE-6029.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5829) Inconsistency between the regions map and the servers map in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-5829: --- Attachment: HBASE-5829-trunk.patch HBASE-5829-0.90.patch Add corresponding operations to this.servers Inconsistency between the regions map and the servers map in AssignmentManager -- Key: HBASE-5829 URL: https://issues.apache.org/jira/browse/HBASE-5829 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.92.1 Reporter: Maryann Xue Attachments: HBASE-5829-0.90.patch, HBASE-5829-trunk.patch There are occurrences in AM where this.servers is not kept consistent with this.regions. This might cause balancer to offline a region from the RS that already returned NotServingRegionException at a previous offline attempt. In AssignmentManager.unassign(HRegionInfo, boolean) try { // TODO: We should consider making this look more like it does for the // region open where we catch all throwables and never abort if (serverManager.sendRegionClose(server, state.getRegion(), versionOfClosingNode)) { LOG.debug(Sent CLOSE to + server + for region + region.getRegionNameAsString()); return; } // This never happens. Currently regionserver close always return true. LOG.warn(Server + server + region CLOSE RPC returned false for + region.getRegionNameAsString()); } catch (NotServingRegionException nsre) { LOG.info(Server + server + returned + nsre + for + region.getRegionNameAsString()); // Presume that master has stale data. Presume remote side just split. // Presume that the split message when it comes in will fix up the master's // in memory cluster state. } catch (Throwable t) { if (t instanceof RemoteException) { t = ((RemoteException)t).unwrapRemoteException(); if (t instanceof NotServingRegionException) { if (checkIfRegionBelongsToDisabling(region)) { // Remove from the regionsinTransition map LOG.info(While trying to recover the table + region.getTableNameAsString() + to DISABLED state the region + region + was offlined but the table was in DISABLING state); synchronized (this.regionsInTransition) { this.regionsInTransition.remove(region.getEncodedName()); } // Remove from the regionsMap synchronized (this.regions) { this.regions.remove(region); } deleteClosingOrClosedNode(region); } } // RS is already processing this region, only need to update the timestamp if (t instanceof RegionAlreadyInTransitionException) { LOG.debug(update + state + the timestamp.); state.update(state.getState()); } } In AssignmentManager.assign(HRegionInfo, RegionState, boolean, boolean, boolean) synchronized (this.regions) { this.regions.put(plan.getRegionInfo(), plan.getDestination()); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5829) Inconsistency between the regions map and the servers map in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13261327#comment-13261327 ] Maryann Xue commented on HBASE-5829: @ for the second, think we should guarantee that it is also added to the map this.servers. Inconsistency between the regions map and the servers map in AssignmentManager -- Key: HBASE-5829 URL: https://issues.apache.org/jira/browse/HBASE-5829 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6, 0.92.1 Reporter: Maryann Xue Attachments: HBASE-5829-0.90.patch, HBASE-5829-trunk.patch There are occurrences in AM where this.servers is not kept consistent with this.regions. This might cause balancer to offline a region from the RS that already returned NotServingRegionException at a previous offline attempt. In AssignmentManager.unassign(HRegionInfo, boolean) try { // TODO: We should consider making this look more like it does for the // region open where we catch all throwables and never abort if (serverManager.sendRegionClose(server, state.getRegion(), versionOfClosingNode)) { LOG.debug(Sent CLOSE to + server + for region + region.getRegionNameAsString()); return; } // This never happens. Currently regionserver close always return true. LOG.warn(Server + server + region CLOSE RPC returned false for + region.getRegionNameAsString()); } catch (NotServingRegionException nsre) { LOG.info(Server + server + returned + nsre + for + region.getRegionNameAsString()); // Presume that master has stale data. Presume remote side just split. // Presume that the split message when it comes in will fix up the master's // in memory cluster state. } catch (Throwable t) { if (t instanceof RemoteException) { t = ((RemoteException)t).unwrapRemoteException(); if (t instanceof NotServingRegionException) { if (checkIfRegionBelongsToDisabling(region)) { // Remove from the regionsinTransition map LOG.info(While trying to recover the table + region.getTableNameAsString() + to DISABLED state the region + region + was offlined but the table was in DISABLING state); synchronized (this.regionsInTransition) { this.regionsInTransition.remove(region.getEncodedName()); } // Remove from the regionsMap synchronized (this.regions) { this.regions.remove(region); } deleteClosingOrClosedNode(region); } } // RS is already processing this region, only need to update the timestamp if (t instanceof RegionAlreadyInTransitionException) { LOG.debug(update + state + the timestamp.); state.update(state.getState()); } } In AssignmentManager.assign(HRegionInfo, RegionState, boolean, boolean, boolean) synchronized (this.regions) { this.regions.put(plan.getRegionInfo(), plan.getDestination()); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5829) Inconsistency between the regions map and the servers map in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maryann Xue updated HBASE-5829: --- Status: Patch Available (was: Open) Inconsistency between the regions map and the servers map in AssignmentManager -- Key: HBASE-5829 URL: https://issues.apache.org/jira/browse/HBASE-5829 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.1, 0.90.6 Reporter: Maryann Xue Attachments: HBASE-5829-0.90.patch, HBASE-5829-trunk.patch There are occurrences in AM where this.servers is not kept consistent with this.regions. This might cause balancer to offline a region from the RS that already returned NotServingRegionException at a previous offline attempt. In AssignmentManager.unassign(HRegionInfo, boolean) try { // TODO: We should consider making this look more like it does for the // region open where we catch all throwables and never abort if (serverManager.sendRegionClose(server, state.getRegion(), versionOfClosingNode)) { LOG.debug(Sent CLOSE to + server + for region + region.getRegionNameAsString()); return; } // This never happens. Currently regionserver close always return true. LOG.warn(Server + server + region CLOSE RPC returned false for + region.getRegionNameAsString()); } catch (NotServingRegionException nsre) { LOG.info(Server + server + returned + nsre + for + region.getRegionNameAsString()); // Presume that master has stale data. Presume remote side just split. // Presume that the split message when it comes in will fix up the master's // in memory cluster state. } catch (Throwable t) { if (t instanceof RemoteException) { t = ((RemoteException)t).unwrapRemoteException(); if (t instanceof NotServingRegionException) { if (checkIfRegionBelongsToDisabling(region)) { // Remove from the regionsinTransition map LOG.info(While trying to recover the table + region.getTableNameAsString() + to DISABLED state the region + region + was offlined but the table was in DISABLING state); synchronized (this.regionsInTransition) { this.regionsInTransition.remove(region.getEncodedName()); } // Remove from the regionsMap synchronized (this.regions) { this.regions.remove(region); } deleteClosingOrClosedNode(region); } } // RS is already processing this region, only need to update the timestamp if (t instanceof RegionAlreadyInTransitionException) { LOG.debug(update + state + the timestamp.); state.update(state.getState()); } } In AssignmentManager.assign(HRegionInfo, RegionState, boolean, boolean, boolean) synchronized (this.regions) { this.regions.put(plan.getRegionInfo(), plan.getDestination()); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira