[jira] [Commented] (HBASE-12465) HBase master start fails due to incorrect file creations
[ https://issues.apache.org/jira/browse/HBASE-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585314#comment-14585314 ] Sudarshan Kadambi commented on HBASE-12465: --- This one was as unsecure cluster running 0.96. HBase master start fails due to incorrect file creations Key: HBASE-12465 URL: https://issues.apache.org/jira/browse/HBASE-12465 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.96.0 Environment: Ubuntu Reporter: Biju Nair Assignee: Alicia Ying Shu Labels: hbase, hbase-bulkload - Start of HBase master fails due to the following error found in the log. 2014-11-11 20:25:58,860 WARN org.apache.hadoop.hbase.backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePa th,file:hdfs:///hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_ on try #1 org.apache.hadoop.security.AccessControlException: Permission denied: user=hbase,access=WRITE,inode=/hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_:devuser:supergroup:-rwxr-xr-x - All the files that hbase master was complaining about are created under an users user-id instead on hbase user resulting in incorrect access permission for the master to act on. - Looks like this was due to bulk load done using LoadIncrementalHFiles program. - HBASE-12052 is another scenario similar to this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12465) HBase master start fails due to incorrect file creations
[ https://issues.apache.org/jira/browse/HBASE-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576457#comment-14576457 ] Sudarshan Kadambi commented on HBASE-12465: --- Jeffrey: We ran into this issue on one of our clusters last week. Looking at your JIRA updates, I couldn't make out if you were able to figure out a fix. Would sharing our logs be of any help here? Thanks! HBase master start fails due to incorrect file creations Key: HBASE-12465 URL: https://issues.apache.org/jira/browse/HBASE-12465 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.96.0 Environment: Ubuntu Reporter: Biju Nair Assignee: Alicia Ying Shu Labels: hbase, hbase-bulkload - Start of HBase master fails due to the following error found in the log. 2014-11-11 20:25:58,860 WARN org.apache.hadoop.hbase.backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileablePa th,file:hdfs:///hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_ on try #1 org.apache.hadoop.security.AccessControlException: Permission denied: user=hbase,access=WRITE,inode=/hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_:devuser:supergroup:-rwxr-xr-x - All the files that hbase master was complaining about are created under an users user-id instead on hbase user resulting in incorrect access permission for the master to act on. - Looks like this was due to bulk load done using LoadIncrementalHFiles program. - HBASE-12052 is another scenario similar to this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12070) Add an option to hbck to fix ZK inconsistencies
Sudarshan Kadambi created HBASE-12070: - Summary: Add an option to hbck to fix ZK inconsistencies Key: HBASE-12070 URL: https://issues.apache.org/jira/browse/HBASE-12070 Project: HBase Issue Type: Bug Reporter: Sudarshan Kadambi If the HMaster bounces in the middle of table creation, we could be left in a state where a znode exists for the table, but that hasn't percolated into META or to HDFS. We've run into this a couple times on our clusters. Once the table is in this state, the only fix is to rm the znode using the zookeeper-client. Doing this manually looks a bit error prone. Could an option be added to hbck to catch and fix such inconsistencies? A more general issue I'd like comment on is whether it makes sense for HMaster to be maintaining its own write-ahead log? The idea would be that on a bounce, the master would discover it was in the middle of creating a table and either rollback or complete that operation? An issue that we observed recently was that a table that was in DISABLING state before a bounce was not in that state after. A write-ahead log to persist table state changes seems useful. Now, all of this state could be in ZK instead of the WAL - it doesn't matter where it gets persisted as long as it does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12071) Separate out thread pool for Master - RegionServer communication
Sudarshan Kadambi created HBASE-12071: - Summary: Separate out thread pool for Master - RegionServer communication Key: HBASE-12071 URL: https://issues.apache.org/jira/browse/HBASE-12071 Project: HBase Issue Type: Bug Reporter: Sudarshan Kadambi Over in HBASE-12028, there is a discussion about the case of a RegionServer still being alive despite all its handler threads being dead. One outcome of this is that the Master is left hanging on the RS for completion of various operations - such as region un-assignment when a table is disabled. Does it make sense to create a separate thread pool for communication between the Master and the RS? This addresses not just the case of the RPC handler threads terminating but also long-running queries or co-processor executions holding up master operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die
Sudarshan Kadambi created HBASE-12028: - Summary: Abort the RegionServer, when one of it's handler threads die Key: HBASE-12028 URL: https://issues.apache.org/jira/browse/HBASE-12028 Project: HBase Issue Type: Bug Components: regionserver Reporter: Sudarshan Kadambi Over in HBase-11813, a user identified an issue where in all the RPC handler threads would exit with StackOverflow errors due to an unchecked recursion-terminating condition. Our clusters demonstrated the same trace. While the patch posted for HBASE-11813 got our clusters to be merry again, the breakdown surfaced some larger issues. When the RegionServer had all it's RPC handler threads dead, it continued to have regions assigned it. Clearly, it wouldn't be able to serve reads and writes on those regions. A second issue was that when a user tried to disable or drop a table, the master would try to communicate to the regionserver for region unassignment. Since the same handler threads seem to be used for master - RS communication as well, the master ended up hanging on the RS indefinitely. Eventually, the master stopped responding to all table meta-operations. A handler thread should never exit, and if it does, it seems like the more prudent thing to do would be for the RS to abort. This way, atleast recovery can be undertaken and the regions could be reassigned elsewhere. I also think that the master-RS communication should get its own exclusive threadpool, but I'll wait until this issue has been sufficiently discussed before opening an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die
[ https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140767#comment-14140767 ] Sudarshan Kadambi commented on HBASE-12028: --- It's also the case that if the handler thread exited because of any peculiarities in a given region (I'm still unclear about the root cause for HBASE-11813), moving that region off by aborting the RS, could end up taking down the entire cluster rather than keep it localized to a single RS. Abort the RegionServer, when one of it's handler threads die Key: HBASE-12028 URL: https://issues.apache.org/jira/browse/HBASE-12028 Project: HBase Issue Type: Bug Components: regionserver Reporter: Sudarshan Kadambi Over in HBase-11813, a user identified an issue where in all the RPC handler threads would exit with StackOverflow errors due to an unchecked recursion-terminating condition. Our clusters demonstrated the same trace. While the patch posted for HBASE-11813 got our clusters to be merry again, the breakdown surfaced some larger issues. When the RegionServer had all it's RPC handler threads dead, it continued to have regions assigned it. Clearly, it wouldn't be able to serve reads and writes on those regions. A second issue was that when a user tried to disable or drop a table, the master would try to communicate to the regionserver for region unassignment. Since the same handler threads seem to be used for master - RS communication as well, the master ended up hanging on the RS indefinitely. Eventually, the master stopped responding to all table meta-operations. A handler thread should never exit, and if it does, it seems like the more prudent thing to do would be for the RS to abort. This way, atleast recovery can be undertaken and the regions could be reassigned elsewhere. I also think that the master-RS communication should get its own exclusive threadpool, but I'll wait until this issue has been sufficiently discussed before opening an issue ticket for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-8894) Forward port compressed l2 cache from 0.89fb
[ https://issues.apache.org/jira/browse/HBASE-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913015#comment-13913015 ] Sudarshan Kadambi commented on HBASE-8894: -- Liang - Are you still running performance tests to see if storing compressed blocks in the L2 cache has any benefits? What are the next steps for integrating this into the mainline code? Forward port compressed l2 cache from 0.89fb Key: HBASE-8894 URL: https://issues.apache.org/jira/browse/HBASE-8894 Project: HBase Issue Type: New Feature Reporter: stack Assignee: Liang Xie Priority: Critical Attachments: HBASE-8894-0.94-v1.txt, HBASE-8894-0.94-v2.txt Forward port Alex's improvement on hbase-7407 from 0.89-fb branch: {code} 1 r1492797 | liyin | 2013-06-13 11:18:20 -0700 (Thu, 13 Jun 2013) | 43 lines 2 3 [master] Implements a secondary compressed cache (L2 cache) 4 5 Author: avf 6 7 Summary: 8 This revision implements compressed and encoded second-level cache with off-heap 9 (and optionally on-heap) storage and a bucket-allocator based on HBASE-7404. 10 11 BucketCache from HBASE-7404 is extensively modified to: 12 13 * Only handle byte arrays (i.e., no more serialization/deserialization within) 14 * Remove persistence support for the time being 15 * Keep an index of hfilename to blocks for efficient eviction on close 16 17 A new interface (L2Cache) is introduced in order to separate it from the current 18 implementation. The L2 cache is then integrated into the classes that handle 19 reading from and writing to HFiles to allow cache-on-write as well as 20 cache-on-read. Metrics for the L2 cache are integrated into RegionServerMetrics 21 much in the same fashion as metrics for the existing (L2) BlockCache. 22 23 Additionally, CacheConfig class is re-refactored to configure the L2 cache, 24 replace multile constructors with a Builder, as well as replace static methods 25 for instantiating the caches with abstract factories (with singleton 26 implementations for both the existing LruBlockCache and the newly introduced 27 BucketCache based L2 cache) 28 29 Test Plan: 30 1) Additional unit tests 31 2) Stress test on a single devserver 32 3) Test on a single-node in shadow cluster 33 4) Test on a whole shadow cluster 34 35 Revert Plan: 36 37 Reviewers: liyintang, aaiyer, rshroff, manukranthk, adela 38 39 Reviewed By: liyintang 40 41 CC: gqchen, hbase-eng@ 42 43 Differential Revision: https://phabricator.fb.com/D837264 44 45 Task ID: 2325295 7 6 r1492340 | liyin | 2013-06-12 11:36:03 -0700 (Wed, 12 Jun 2013) | 21 lines {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)