[jira] [Commented] (HBASE-12465) HBase master start fails due to incorrect file creations

2015-06-14 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585314#comment-14585314
 ] 

Sudarshan Kadambi commented on HBASE-12465:
---

This one was as unsecure cluster running 0.96.

 HBase master start fails due to incorrect file creations
 

 Key: HBASE-12465
 URL: https://issues.apache.org/jira/browse/HBASE-12465
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.96.0
 Environment: Ubuntu
Reporter: Biju Nair
Assignee: Alicia Ying Shu
  Labels: hbase, hbase-bulkload

 - Start of HBase master fails due to the following error found in the log.
 2014-11-11 20:25:58,860 WARN org.apache.hadoop.hbase.backup.HFileArchiver: 
 Failed to archive class 
 org.apache.hadoop.hbase.backup.HFileArchiver$FileablePa
 th,file:hdfs:///hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_
  on try #1
 org.apache.hadoop.security.AccessControlException: Permission denied: 
 user=hbase,access=WRITE,inode=/hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_:devuser:supergroup:-rwxr-xr-x
 -  All the files that hbase master was complaining about are created under an 
 users user-id instead on hbase user resulting in incorrect access 
 permission for the master to act on.
 - Looks like this was due to bulk load done using LoadIncrementalHFiles 
 program.
 - HBASE-12052 is another scenario similar to this one. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12465) HBase master start fails due to incorrect file creations

2015-06-07 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576457#comment-14576457
 ] 

Sudarshan Kadambi commented on HBASE-12465:
---

Jeffrey:  We ran into this issue on one of our clusters last week. Looking at 
your JIRA updates, I couldn't make out if you were able to figure out a fix. 
Would sharing our logs be of any help here? Thanks!

 HBase master start fails due to incorrect file creations
 

 Key: HBASE-12465
 URL: https://issues.apache.org/jira/browse/HBASE-12465
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.96.0
 Environment: Ubuntu
Reporter: Biju Nair
Assignee: Alicia Ying Shu
  Labels: hbase, hbase-bulkload

 - Start of HBase master fails due to the following error found in the log.
 2014-11-11 20:25:58,860 WARN org.apache.hadoop.hbase.backup.HFileArchiver: 
 Failed to archive class 
 org.apache.hadoop.hbase.backup.HFileArchiver$FileablePa
 th,file:hdfs:///hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_
  on try #1
 org.apache.hadoop.security.AccessControlException: Permission denied: 
 user=hbase,access=WRITE,inode=/hbase/.tmp/data/default/tbl/00820520f5cb7839395e83f40c8d97c2/e/52bf9eee7a27460c8d9e2a26fa43c918_SeqId_282271246_:devuser:supergroup:-rwxr-xr-x
 -  All the files that hbase master was complaining about are created under an 
 users user-id instead on hbase user resulting in incorrect access 
 permission for the master to act on.
 - Looks like this was due to bulk load done using LoadIncrementalHFiles 
 program.
 - HBASE-12052 is another scenario similar to this one. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12070) Add an option to hbck to fix ZK inconsistencies

2014-09-23 Thread Sudarshan Kadambi (JIRA)
Sudarshan Kadambi created HBASE-12070:
-

 Summary: Add an option to hbck to fix ZK inconsistencies
 Key: HBASE-12070
 URL: https://issues.apache.org/jira/browse/HBASE-12070
 Project: HBase
  Issue Type: Bug
Reporter: Sudarshan Kadambi


If the HMaster bounces in the middle of table creation, we could be left in a 
state where a znode exists for the table, but that hasn't percolated into META 
or to HDFS. We've run into this a couple times on our clusters. Once the table 
is in this state, the only fix is to rm the znode using the zookeeper-client. 
Doing this manually looks a bit error prone. Could an option be added to hbck 
to catch and fix such inconsistencies?

A more general issue I'd like comment on is whether it makes sense for HMaster 
to be maintaining its own write-ahead log? The idea would be that on a bounce, 
the master would discover it was in the middle of creating a table and either 
rollback or complete that operation? An issue that we observed recently was 
that a table that was in DISABLING state before a bounce was not in that state 
after. A write-ahead log to persist table state changes seems useful. Now, all 
of this state could be in ZK instead of the WAL - it doesn't matter where it 
gets persisted as long as it does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12071) Separate out thread pool for Master - RegionServer communication

2014-09-23 Thread Sudarshan Kadambi (JIRA)
Sudarshan Kadambi created HBASE-12071:
-

 Summary: Separate out thread pool for Master - RegionServer 
communication
 Key: HBASE-12071
 URL: https://issues.apache.org/jira/browse/HBASE-12071
 Project: HBase
  Issue Type: Bug
Reporter: Sudarshan Kadambi


Over in HBASE-12028, there is a discussion about the case of a RegionServer 
still being alive despite all its handler threads being dead. One outcome of 
this is that the Master is left hanging on the RS for completion of various 
operations - such as region un-assignment when a table is disabled. Does it 
make sense to create a separate thread pool for communication between the 
Master and the RS? This addresses not just the case of the RPC handler threads 
terminating but also long-running queries or co-processor executions holding up 
master operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die

2014-09-19 Thread Sudarshan Kadambi (JIRA)
Sudarshan Kadambi created HBASE-12028:
-

 Summary: Abort the RegionServer, when one of it's handler threads 
die
 Key: HBASE-12028
 URL: https://issues.apache.org/jira/browse/HBASE-12028
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Sudarshan Kadambi


Over in HBase-11813, a user identified an issue where in all the RPC handler 
threads would exit with StackOverflow errors due to an unchecked 
recursion-terminating condition. Our clusters demonstrated the same trace. 
While the patch posted for HBASE-11813 got our clusters to be merry again, the 
breakdown surfaced some larger issues.

When the RegionServer had all it's RPC handler threads dead, it continued to 
have regions assigned it. Clearly, it wouldn't be able to serve reads and 
writes on those regions. A second issue was that when a user tried to disable 
or drop a table, the master would try to communicate to the regionserver for 
region unassignment. Since the same handler threads seem to be used for master 
- RS communication as well, the master ended up hanging on the RS 
indefinitely. Eventually, the master stopped responding to all table 
meta-operations.

A handler thread should never exit, and if it does, it seems like the more 
prudent thing to do would be for the RS to abort. This way, atleast recovery 
can be undertaken and the regions could be reassigned elsewhere. I also think 
that the master-RS communication should get its own exclusive threadpool, but 
I'll wait until this issue has been sufficiently discussed before opening an 
issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12028) Abort the RegionServer, when one of it's handler threads die

2014-09-19 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140767#comment-14140767
 ] 

Sudarshan Kadambi commented on HBASE-12028:
---

It's also the case that if the handler thread exited because of any 
peculiarities in a given region (I'm still unclear about the root cause for 
HBASE-11813), moving that region off by aborting the RS, could end up taking 
down the entire cluster rather than keep it localized to a single RS. 

 Abort the RegionServer, when one of it's handler threads die
 

 Key: HBASE-12028
 URL: https://issues.apache.org/jira/browse/HBASE-12028
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Sudarshan Kadambi

 Over in HBase-11813, a user identified an issue where in all the RPC handler 
 threads would exit with StackOverflow errors due to an unchecked 
 recursion-terminating condition. Our clusters demonstrated the same trace. 
 While the patch posted for HBASE-11813 got our clusters to be merry again, 
 the breakdown surfaced some larger issues.
 When the RegionServer had all it's RPC handler threads dead, it continued to 
 have regions assigned it. Clearly, it wouldn't be able to serve reads and 
 writes on those regions. A second issue was that when a user tried to disable 
 or drop a table, the master would try to communicate to the regionserver for 
 region unassignment. Since the same handler threads seem to be used for 
 master - RS communication as well, the master ended up hanging on the RS 
 indefinitely. Eventually, the master stopped responding to all table 
 meta-operations.
 A handler thread should never exit, and if it does, it seems like the more 
 prudent thing to do would be for the RS to abort. This way, atleast recovery 
 can be undertaken and the regions could be reassigned elsewhere. I also think 
 that the master-RS communication should get its own exclusive threadpool, 
 but I'll wait until this issue has been sufficiently discussed before opening 
 an issue ticket for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-8894) Forward port compressed l2 cache from 0.89fb

2014-02-26 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913015#comment-13913015
 ] 

Sudarshan Kadambi commented on HBASE-8894:
--

Liang - Are you still running performance tests to see if storing compressed 
blocks in the L2 cache has any benefits? What are the next steps for 
integrating this into the mainline code?

 Forward port compressed l2 cache from 0.89fb
 

 Key: HBASE-8894
 URL: https://issues.apache.org/jira/browse/HBASE-8894
 Project: HBase
  Issue Type: New Feature
Reporter: stack
Assignee: Liang Xie
Priority: Critical
 Attachments: HBASE-8894-0.94-v1.txt, HBASE-8894-0.94-v2.txt


 Forward port Alex's improvement on hbase-7407 from 0.89-fb branch:
 {code}
   1 r1492797 | liyin | 2013-06-13 11:18:20 -0700 (Thu, 13 Jun 2013) | 43 lines
   2
   3 [master] Implements a secondary compressed cache (L2 cache)
   4
   5 Author: avf
   6
   7 Summary:
   8 This revision implements compressed and encoded second-level cache with 
 off-heap
   9 (and optionally on-heap) storage and a bucket-allocator based on 
 HBASE-7404.
  10
  11 BucketCache from HBASE-7404 is extensively modified to:
  12
  13 * Only handle byte arrays (i.e., no more serialization/deserialization 
 within)
  14 * Remove persistence support for the time being
  15 * Keep an  index of hfilename to blocks for efficient eviction on close
  16
  17 A new interface (L2Cache) is introduced in order to separate it from the 
 current
  18 implementation. The L2 cache is then integrated into the classes that 
 handle
  19 reading from and writing to HFiles to allow cache-on-write as well as
  20 cache-on-read. Metrics for the L2 cache are integrated into 
 RegionServerMetrics
  21 much in the same fashion as metrics for the existing (L2) BlockCache.
  22
  23 Additionally, CacheConfig class is re-refactored to configure the L2 
 cache,
  24 replace multile constructors with a Builder, as well as replace static 
 methods
  25 for instantiating the caches with abstract factories (with singleton
  26 implementations for both the existing LruBlockCache and the newly 
 introduced
  27 BucketCache based L2 cache)
  28
  29 Test Plan:
  30 1) Additional unit tests
  31 2) Stress test on a single devserver
  32 3) Test on a single-node in shadow cluster
  33 4) Test on a whole shadow cluster
  34
  35 Revert Plan:
  36
  37 Reviewers: liyintang, aaiyer, rshroff, manukranthk, adela
  38
  39 Reviewed By: liyintang
  40
  41 CC: gqchen, hbase-eng@
  42
  43 Differential Revision: https://phabricator.fb.com/D837264
  44
  45 Task ID: 2325295
  7 
   6 r1492340 | liyin | 2013-06-12 11:36:03 -0700 (Wed, 12 Jun 2013) | 21 lines
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)