[jira] [Resolved] (HBASE-22154) Facing issue with HA of HBase

2019-04-02 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang resolved HBASE-22154.

Resolution: Not A Problem

> Facing issue with HA of HBase
> -
>
> Key: HBASE-22154
> URL: https://issues.apache.org/jira/browse/HBASE-22154
> Project: HBase
>  Issue Type: Test
>Reporter: James
>Priority: Critical
>  Labels: /hbase-1.2.6.1
>
> Hi Team,
> I have set up HA Hadoop cluster and same for HBase.
> When my Active name node is going down, Stand by name node is becoming active 
> name node however as same time my backup hbase master is not becoming active 
> HMaster(Active HMaster and Region server goes down). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-22047) LeaseException in Scan should be retired

2019-03-13 Thread Allan Yang (JIRA)
Allan Yang created HBASE-22047:
--

 Summary: LeaseException in Scan should be retired
 Key: HBASE-22047
 URL: https://issues.apache.org/jira/browse/HBASE-22047
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.1.3, 2.0.4, 2.2.0
Reporter: Allan Yang


We should retry LeaseException just like other exceptions like 
OutOfOrderScannerNextException and UnknownScannerException
Code in ClientScanner:
{code:java}
if ((cause != null && cause instanceof NotServingRegionException) ||
(cause != null && cause instanceof RegionServerStoppedException) ||
e instanceof OutOfOrderScannerNextException || e instanceof 
UnknownScannerException ||
e instanceof ScannerResetException) {
  // Pass. It is easier writing the if loop test as list of what is allowed 
rather than
  // as a list of what is not allowed... so if in here, it means we do not 
throw.
  if (retriesLeft <= 0) {
throw e; // no more retries
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-22043) HMaster Went down

2019-03-12 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang resolved HBASE-22043.

Resolution: Not A Problem

> HMaster Went down
> -
>
> Key: HBASE-22043
> URL: https://issues.apache.org/jira/browse/HBASE-22043
> Project: HBase
>  Issue Type: Bug
>  Components: Admin
>Reporter: James
>Priority: Critical
>
> HMaster went down
> /hbase/WALs/regionserver80-XXXsplitting is non empty': Directory is 
> not empty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21962) Filters do not work in ThriftTable

2019-02-26 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21962:
--

 Summary: Filters do not work in ThriftTable
 Key: HBASE-21962
 URL: https://issues.apache.org/jira/browse/HBASE-21962
 Project: HBase
  Issue Type: Sub-task
  Components: Thrift
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0, 2.2.0


Filters in ThriftTable is not working, this issue is to fix it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21809) Add retry thrift client for ThriftTable/Admin

2019-01-29 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21809:
--

 Summary: Add retry thrift client for ThriftTable/Admin
 Key: HBASE-21809
 URL: https://issues.apache.org/jira/browse/HBASE-21809
 Project: HBase
  Issue Type: Sub-task
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0, 2.2.0


This is for ThriftTable/Admin to handle exceptions like connection loss.
And only available for http thrift client. For client using TSocket, it is not 
so easy to implement a retry client, maybe later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21754) ReportRegionStateTransitionRequest should be executed in priority executor

2019-01-22 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21754:
--

 Summary: ReportRegionStateTransitionRequest should be executed in 
priority executor
 Key: HBASE-21754
 URL: https://issues.apache.org/jira/browse/HBASE-21754
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.4, 2.1.2
Reporter: Allan Yang
Assignee: Allan Yang


Now, ReportRegionStateTransitionRequest is executed in default handler, only 
region of system table is executed in priority handler. That is because we have 
only two kinds of handler default and priority in master(replication handler is 
for replication specifically), if the transition report for all region is 
executed in priority handler, there is a dead lock situation that other 
regions' transition report take all handler and need to update meta, but meta 
region is not able to report online since all handler is taken(addressed in the 
comments of MasterAnnotationReadingPriorityFunction).

But there is another dead lock case that user's DDL requests (or other sync op 
like moveregion) take over all default handlers, making region transition 
report is not possible, thus those sync ops can't complete either. A simple UT 
provided in the patch shows this case.

To resolve this problem, I added a new metaTransitionExecutor to execute meta 
region transition report only, and all the other region's report are executed 
in priority handlers, separating them from user's requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-14223) Meta WALs are not cleared if meta region was closed and RS aborts

2019-01-21 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang resolved HBASE-14223.

Resolution: Fixed

> Meta WALs are not cleared if meta region was closed and RS aborts
> -
>
> Key: HBASE-14223
> URL: https://issues.apache.org/jira/browse/HBASE-14223
> Project: HBase
>  Issue Type: Bug
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Major
> Fix For: 3.0.0, 1.5.0, 2.2.0
>
> Attachments: HBASE-14223logs, hbase-14223_v0.patch, 
> hbase-14223_v1-branch-1.patch, hbase-14223_v2-branch-1.patch, 
> hbase-14223_v3-branch-1.patch, hbase-14223_v3-branch-1.patch, 
> hbase-14223_v3-master.patch
>
>
> When an RS opens meta, and later closes it, the WAL(FSHlog) is not closed. 
> The last WAL file just sits there in the RS WAL directory. If RS stops 
> gracefully, the WAL file for meta is deleted. Otherwise if RS aborts, WAL for 
> meta is not cleaned. It is also not split (which is correct) since master 
> determines that the RS no longer hosts meta at the time of RS abort. 
> From a cluster after running ITBLL with CM, I see a lot of {{-splitting}} 
> directories left uncleaned: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls 
> /apps/hbase/data/WALs
> Found 31 items
> drwxr-xr-x   - hbase hadoop  0 2015-06-05 01:14 
> /apps/hbase/data/WALs/hregion-58203265
> drwxr-xr-x   - hbase hadoop  0 2015-06-05 07:54 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433489308745-splitting
> drwxr-xr-x   - hbase hadoop  0 2015-06-05 09:28 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433494382959-splitting
> drwxr-xr-x   - hbase hadoop  0 2015-06-05 10:01 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-1.openstacklocal,16020,1433498252205-splitting
> ...
> {code}
> The directories contain WALs from meta: 
> {code}
> [root@os-enis-dal-test-jun-4-7 cluster-os]# sudo -u hdfs hadoop fs -ls 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting
> Found 2 items
> -rw-r--r--   3 hbase hadoop 201608 2015-06-05 03:15 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
> -rw-r--r--   3 hbase hadoop  44420 2015-06-05 04:36 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> The RS hosted the meta region for some time: 
> {code}
> 2015-06-05 03:14:28,692 INFO  [PostOpenDeployTasks:1588230740] 
> zookeeper.MetaTableLocator: Setting hbase:meta region location in ZooKeeper 
> as os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285
> ...
> 2015-06-05 03:15:17,302 INFO  
> [RS_CLOSE_META-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion: Closed 
> hbase:meta,,1.1588230740
> {code}
> In between, a WAL is created: 
> {code}
> 2015-06-05 03:15:11,707 INFO  
> [RS_OPEN_META-os-enis-dal-test-jun-4-5:16020-0-MetaLogRoller] wal.FSHLog: 
> Rolled WAL 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433470511501.meta
>  with entries=385, filesize=196.88 KB; new WAL 
> /apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285..meta.1433474111645.meta
> {code}
> When CM killed the region server later master did not see these WAL files: 
> {code}
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:46,075 
> INFO  [MASTER_SERVER_OPERATIONS-os-enis-dal-test-jun-4-3:16000-0] 
> master.SplitLogManager: started splitting 2 logs in 
> [hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting]
>  for [os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285]
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:47,300 
> INFO  [main-EventThread] wal.WALSplitter: Archived processed log 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-5.openstacklocal,16020,1433466904285-splitting/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475074436
>  to 
> hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-5.openstacklocal%2C16020%2C1433466904285.default.1433475074436
> ./hbase-hbase-master-os-enis-dal-test-jun-4-3.log:2015-06-05 03:36:50,497 
> INFO  [main-EventThread] wal.WALSplitter: Archived processed 

[jira] [Created] (HBASE-21751) WAL create fails during region open may cause region assign forever fail

2019-01-21 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21751:
--

 Summary: WAL create fails during region open may cause region 
assign forever fail
 Key: HBASE-21751
 URL: https://issues.apache.org/jira/browse/HBASE-21751
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.4, 2.1.2
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 2.2.0, 2.1.3, 2.0.5


During the first region opens on the RS, WALFactory will create a WAL file, but 
if the wal creation fails, in some cases, HDFS will leave a empty file in the 
dir(e.g. disk full, file is created succesfully but block allocation fails). We 
have a check in AbstractFSWAL that if WAL belong to the same factory exists, 
then a error will be throw. Thus, the region can never be open on this RS later.
{code:java}
2019-01-17 02:15:53,320 ERROR [RS_OPEN_META-regionserver/server003:16020-0] 
handler.OpenRegionHandler(301): Failed open of region=hbase:meta,,1.1588230740
java.io.IOException: Target WAL already exists within directory 
hdfs://cluster/hbase/WALs/server003.hbase.hostname.com,16020,1545269815888
at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.(AbstractFSWAL.java:382)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.(AsyncFSWAL.java:210)
at 
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:72)
at 
org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:47)
at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:138)
at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:57)
at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:264)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:2085)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:284)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:834)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-21652) Refactor ThriftServer making thrift2 server inherited from thrift1 server

2019-01-09 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang reopened HBASE-21652:


> Refactor ThriftServer making thrift2 server inherited from thrift1 server
> -
>
> Key: HBASE-21652
> URL: https://issues.apache.org/jira/browse/HBASE-21652
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21652.addendum.patch, HBASE-21652.branch-2.patch, 
> HBASE-21652.patch, HBASE-21652.v2.patch, HBASE-21652.v3.patch, 
> HBASE-21652.v4.patch, HBASE-21652.v5.patch, HBASE-21652.v6.patch, 
> HBASE-21652.v7.patch
>
>
> Except the different protocol, thrift2 Server should have no much difference 
> from thrift1 Server.  So refactoring the thrift server, making thrift2 server 
> inherit from thrift1 server. Getting rid of many duplicated code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21661) Provide Thrift2 implementation of Table/Admin

2018-12-29 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21661:
--

 Summary: Provide Thrift2 implementation of Table/Admin
 Key: HBASE-21661
 URL: https://issues.apache.org/jira/browse/HBASE-21661
 Project: HBase
  Issue Type: Sub-task
 Environment: Provide Thrift2 implementation of Table/Admin, making 
Java user to use thrift client more easily(Some environment which can not 
expose ZK or RS Servers directly require thrift or REST protocol even using 
Java). 
Another Example of this is RemoteHTable and RemoteAdmin, they are REST 
connectors.
Reporter: Allan Yang
Assignee: Allan Yang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21652) Refactor ThriftServer making thrift2 server to support both thrift1 and thrift2 protocol

2018-12-27 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21652:
--

 Summary: Refactor ThriftServer making thrift2 server to support 
both thrift1 and thrift2 protocol
 Key: HBASE-21652
 URL: https://issues.apache.org/jira/browse/HBASE-21652
 Project: HBase
  Issue Type: Sub-task
Reporter: Allan Yang
Assignee: Allan Yang


Except the different protocol, thrift2 Server should have no much difference 
from thrift1 Server.  So refactoring the thrift server, making thrift2 server 
inherit from thrift1 server. Getting rid of many duplicated code, making 
thrift2 server can serve thrift1 protocol tt the same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21650) Add DDL operation and some other miscellaneous to thrift2

2018-12-26 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21650:
--

 Summary: Add DDL operation and some other miscellaneous to thrift2
 Key: HBASE-21650
 URL: https://issues.apache.org/jira/browse/HBASE-21650
 Project: HBase
  Issue Type: Sub-task
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0, 2.2.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21649) Complete Thrift2 to supersede Thrift1

2018-12-26 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21649:
--

 Summary: Complete Thrift2 to supersede Thrift1
 Key: HBASE-21649
 URL: https://issues.apache.org/jira/browse/HBASE-21649
 Project: HBase
  Issue Type: Umbrella
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0, 2.2.0


Thrift1 and Thrift2 coexists in our project for a very long time. Functionality 
is more complete in thrift1 but its interface design is bad for adding new 
features(so we have get(), getVer(),getVerTs,getRowWithColumns() and so many 
other methods for a single get request, this is bad). Thrift2 has a more clean 
interface and structure definition, making our user more easy to use. But, it 
has not been updated for a long time, lacking of DDL method is a major 
weakness. 

I think we should complete Thrift2 and supersede Thrift1, making Thrift2 as the 
standard multi language definition. This is a umbrella issue to make it happen. 
The plan would be:
1. complete the DDL interface of thrift2
2. Making thrift2 server can handle thrift1 requests, user don't have to choose 
which thrift server they need to start
3. deprecate thrift1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-21392) HTable can still write data after calling the close method.

2018-11-28 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang reopened HBASE-21392:


> HTable can still write data after calling the close method.
> ---
>
> Key: HBASE-21392
> URL: https://issues.apache.org/jira/browse/HBASE-21392
> Project: HBase
>  Issue Type: Improvement
>  Components: Client
>Affects Versions: 1.2.0, 2.1.0, 2.0.0
> Environment: HBase 1.2.0
>Reporter: lixiaobao
>Assignee: lixiaobao
>Priority: Major
> Attachments: HBASE-21392.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> HTable can still write data after calling the close method.
>  
> {code:java}
> val conn = ConnectionFactory.createConnection(conf)
> var table = conn.getTable(TableName.valueOf(tableName))
> val put = new Put(rowKey.getBytes())
> put.addColumn("cf".getBytes(), columnField.getBytes(), endTimeLong, 
> Bytes.toBytes(line.getLong(8)))
> table.put(put)
> //call table close() method
> table.close()
> //put again
> val put1 = new Put(rowKey4.getBytes())
> out1.addColumn("cf".getBytes(), columnField.getBytes(), endTimeLong, 
> Bytes.toBytes(line.getLong(8)))
> table.put(put1)
> {code}
>  
> after call close method ,can alse write data into HBase,I think this does not 
> match close logic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21469) Re-visit post* hooks in DDL operations

2018-11-11 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21469:
--

 Summary: Re-visit post* hooks in DDL operations
 Key: HBASE-21469
 URL: https://issues.apache.org/jira/browse/HBASE-21469
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.2, 2.1.1
Reporter: Allan Yang
Assignee: Allan Yang


I have some discuss in HBASE-19953 from 
[here|https://issues.apache.org/jira/browse/HBASE-19953?focusedCommentId=16673126=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16673126]
In HBASE-19953,[~elserj] want to make sure that the post* hooks are called only 
when the procedures finish.But it accidentally turns modifytable and truncate 
table request into a sync call, which make clients RPC timeout easily on big 
tables.
We should re-visit those postxxx hooks in DDL operations, because they are now 
not consistent now:
For DDLs other than modifytable and truncate table, although the call will wait 
on the latch, the latch is actually released just after prepare state, so we 
still call postxxx hooks before the operation finish.
For DDLs of  modifytable and truncate, the latch is only released after the 
whole procedure finish. So the effort works(but will cause RPC timeout)
I think these latches are designed for compatibility with 1.x clients. Take 
ModifyTable for example, in 1.x, we use admin.getAlterStauts() to check the 
alter status, but in 2.x, this method is deprecated and returning inaccurate 
result, we have to make 1.x client in a sync wait.
And for the semantics of postxxx hooks in 1.x, we will call them after the 
corresponding DDL request return, but actually, the DDL request may not 
finished also since we don't want for region assignment.

So, here, we need to discuss the semantics of postxxx hooks in DDL operations, 
we need to make it consistent in every DDL operations, do we really need to 
make sure this hooks being called only after the operation finish? What's more, 
we have postCompletedxxx hooks for that need.
  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-11 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang resolved HBASE-21423.

Resolution: Fixed

Opened HBASE-21468 for the addendum, close this one

> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch, 
> HBASE-21423.branch-2.0.addendum.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21468) separate workers for meta table is not working

2018-11-11 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21468:
--

 Summary: separate workers for meta table is not working
 Key: HBASE-21468
 URL: https://issues.apache.org/jira/browse/HBASE-21468
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.2, 2.1.1
Reporter: Allan Yang
Assignee: Allan Yang


This is an addendum for HBASE-21423, since HBASE-21423 is already closed, the 
QA won't be triggered.
It is my mistake that the separate workers for meta table is not working, since 
when polling from queue, the onlyUrgent flag is not passed in.
And for some UT that only require one worker thread, urgent workers should set 
to 0 to ensure there is one worker at time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-10 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang reopened HBASE-21423:


> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21423) Procedures for meta table/region should be able to executed in separate workers

2018-11-01 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21423:
--

 Summary: Procedures for meta table/region should be able to 
executed in separate workers 
 Key: HBASE-21423
 URL: https://issues.apache.org/jira/browse/HBASE-21423
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.2, 2.1.1
Reporter: Allan Yang
Assignee: Allan Yang


We have higher priority for meta table procedures, but only in queue level. 
There is a case that the meta table is closed and a AssignProcedure(or RTSP in 
branch-2+) is waiting there to be executed, but at the same time, all the Work 
threads are executing procedures need to write to meta table, then all the 
worker will be stuck and retry for writing meta, no worker will take the AP for 
meta.
Though we have a mechanism that will detect stuck and adding more ''KeepAlive'' 
workers to the pool to resolve the stuck. It is already stuck a long time.
This is a real case I encountered in ITBLL.
So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
procedures(other workers can take meta procedures too), which can resolve this 
kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21421) Do not kill RS if reportOnlineRegions fails

2018-11-01 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21421:
--

 Summary: Do not kill RS if reportOnlineRegions fails
 Key: HBASE-21421
 URL: https://issues.apache.org/jira/browse/HBASE-21421
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.2, 2.1.1
Reporter: Allan Yang
Assignee: Allan Yang


In the periodic regionServerReport call from RS to master, we will check 
master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
different state from Master. If RS holds a region which master think should be 
on another RS, the Master will kill the RS.

But, the regionServerReport could be lagging(due to network or something), 
which can't represent the current state of RegionServer. Besides, we will call 
reportRegionStateTransition and try forever until it successfully reported to 
master  when online a region. We can count on reportRegionStateTransition calls.

I have encountered cases that the regions are closed on the RS and  
reportRegionStateTransition to master successfully. But later, a lagging 
regionServerReport tells the master the region is online on the RS(Which is not 
at the moment, this call may generated some time ago and delayed by network 
somehow), the the master think the region should be on another RS, and kill the 
RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21395) Abort split/merge procedure if there is a table procedure of the same table going on

2018-10-26 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21395:
--

 Summary: Abort split/merge procedure if there is a table procedure 
of the same table going on
 Key: HBASE-21395
 URL: https://issues.apache.org/jira/browse/HBASE-21395
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.2, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


In my ITBLL, I often see that if split/merge procedure and table procedure(like 
ModifyTableProcedure) happen at the same time, and since there some race 
conditions between these two kind of procedures,  causing some serious 
problems. e.g. the split/merged parent is bought on line by the table procedure 
or the split merged region making the whole table procedure rollback.
Talked with [~Apache9] offline today, this kind of problem was solved in 
branch-2+ since There is a fence that only one RTSP can agianst a single region 
at the same time.
To keep out of the mess in branch-2.0 and branch-2.1, I added a simple safe 
fence in the split/merge procedure: If there is a table procedure going on 
against the same table, then abort the split/merge procedure. Aborting the 
split/merge procedure at the beginning of the execution is no big deal, 
compared with the mess it will cause...




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-21364) Procedure holds the lock should put to front of the queue after restart

2018-10-24 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang resolved HBASE-21364.

   Resolution: Fixed
Fix Version/s: 2.2.0
   3.0.0

> Procedure holds the lock should put to front of the queue after restart
> ---
>
> Key: HBASE-21364
> URL: https://issues.apache.org/jira/browse/HBASE-21364
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.3
>
> Attachments: HBASE-21364.branch-2.0.001.patch, 
> HBASE-21364.branch-2.0.002.patch
>
>
> After restore the procedures form Procedure WALs. We will put the runable 
> procedures back to the queue to execute. The order is not the problem before 
> HBASE-20846 since the first one to execute will acquire the lock itself. But 
> since the locks will restored after HBASE-20846. If we execute a procedure 
> without the lock first before a procedure with the lock in the same queue, 
> there is a race condition that we may not be able to execute all procedures 
> in the same queue at all.
> The race condtion is:
> 1. A procedure need to take the table's exclusive lock was put into the 
> table's queue, but the table's shard lock was lock by a Region Procedure. 
> Since no one takes the exclusive lock, the queue is put to run queue to 
> execute. But soon, the worker thread see the procedure can't execute because 
> it doesn't hold the lock, so it will stop execute and remove the queue from 
> run queue.
> 2. At the same time, the Region procedure which holds the table's shard lock 
> and the region's exclusive lock is put to the table's queue. But, since the 
> queue already added to the run queue, it won't add again.
> 3. Since 1, the table's queue was removed from the run queue.
> 4. Then, no one will put the table's queue back, thus no worker will execute 
> the procedures inside
> A test case in the patch shows how.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21384) Procedure with holdlock=false should not be restored lock when restarts

2018-10-24 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21384:
--

 Summary: Procedure with holdlock=false should not be restored lock 
when restarts 
 Key: HBASE-21384
 URL: https://issues.apache.org/jira/browse/HBASE-21384
 Project: HBase
  Issue Type: Sub-task
Reporter: Allan Yang
Assignee: Allan Yang


Yet another case of stuck similar with HBASE-21364.
The case is that:
1. A ModifyProcedure spawned a ReopenTableProcedure, and since its 
holdLock=false, so it release the lock
2. The  ReopenTableProcedure spawned several MoveRegionProcedure, it also has 
holdLock=false, but just after it store the children procedures to the wal and 
begin to release the lock, the master was killed.
3. When restarting, the  ReopenTableProcedure's lock was restored (since it was 
hold the lock before, which is not right, since it is in WAITING state now and 
its holdLock=false)
4. After restart, MoveRegionProcedure can execute since its parent has the 
lock, but when it spawned the AssignProcedure, the AssignProcedure procedure 
can't execute anymore, since it parent didn't have the lock, but its 'grandpa' 
- ReopenTableProcedure  has.
5. Restart the master, the stuck still, because we will restore the lock for 
ReopenTableProcedure.

Two fixes:
1. We should not restore the lock if the procedure doesn't hold lock and in 
WAITING state.
2. Procedures don't have lock but its parent has the lock should also be put in 
front of the queue, as a addendum of HBASE-21364.

Discussion:
 Should we check the lock of all ancestors not only its parents? As addressed 
in the comments of the patch, currently, after fix the issue above, check 
parent is enough.  




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21376) Add some verbose log to MasterProcedureScheduler

2018-10-23 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21376:
--

 Summary: Add some verbose log to MasterProcedureScheduler
 Key: HBASE-21376
 URL: https://issues.apache.org/jira/browse/HBASE-21376
 Project: HBase
  Issue Type: Sub-task
Reporter: Allan Yang
Assignee: Allan Yang


As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the 
critical one is already submitted in HBASE-21364 to branch-2.0 and branch-2.1, 
but I also added some useful logs  which need to commit to all branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21364) Procedure holds the lock should put to front of the queue after restart

2018-10-23 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21364:
--

 Summary: Procedure holds the lock should put to front of the queue 
after restart
 Key: HBASE-21364
 URL: https://issues.apache.org/jira/browse/HBASE-21364
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.2, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


After restore the procedures form Procedure WALs. We will put the runable 
procedures back to the queue to execute. The order is not the problem before 
HBASE-20846 since the first one to execute will acquire the lock itself. But 
since the locks will restored after HBASE-20846. If we execute a procedure 
without the lock first before a procedure with the lock in the same queue, 
there is a race condition that we may not be able to execute all procedures in 
the same queue at all.
The race condtion is:
1. A procedure need to take the table's exclusive lock was put into the table's 
queue, but the table's shard lock was lock by a Region Procedure. Since no one 
takes the exclusive lock, the queue is put to run queue to execute. But soon, 
the worker thread see the procedure can't execute because it doesn't hold the 
lock, so it will stop execute and remove the queue from run queue.
2. At the same time, the Region procedure which holds the table's shard lock 
and the region's exclusive lock is put to the table's queue. But, since the 
queue already added to the run queue, it won't add again.
3. Since 1, the table's queue was removed from the run queue.
4. Then, no one will put the table's queue back, thus no worker will execute 
the procedures inside
A test case in the patch shows how.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21357) RS should abort if OOM in Reader thread

2018-10-22 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21357:
--

 Summary: RS should abort if OOM in Reader thread
 Key: HBASE-21357
 URL: https://issues.apache.org/jira/browse/HBASE-21357
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.8
Reporter: Allan Yang
Assignee: Allan Yang


It is a bit strange, we will abort the RS if OOM in Listener thread, Responder 
thread and in CallRunner thread, only not in Reader thread... 
We should abort RS if OOM happens in Reader thread, too. If not, the reader 
thread exists because of OOM, and the selector closes. Later connection select 
to this reader will be ignored
{quote}
try {
  if (key.isValid()) {
if (key.isAcceptable())
  doAccept(key);
  }
} catch (IOException ignored) {
  if (LOG.isTraceEnabled()) LOG.trace("ignored", ignored);
}
{quote}
Leaving the client (or Master and other RS)'s call wait until SocketTimeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21354) Procedure may be deleted improperly during master restarts resulting in

2018-10-20 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21354:
--

 Summary: Procedure may be deleted improperly during master 
restarts resulting in 
 Key: HBASE-21354
 URL: https://issues.apache.org/jira/browse/HBASE-21354
 Project: HBase
  Issue Type: Sub-task
Reporter: Allan Yang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21292) IdLock.getLockEntry() may hang if interrupted

2018-10-11 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21292:
--

 Summary: IdLock.getLockEntry() may hang if interrupted
 Key: HBASE-21292
 URL: https://issues.apache.org/jira/browse/HBASE-21292
 Project: HBase
  Issue Type: Bug
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 1.4.9, 2.0.2, 2.1.0


This is a rare case found by my colleague which really happened on our 
production env. 
Thread may hang(or enter a infinite loop ) when try to call 
IdLock.getLockEntry(). Here is the case:
1. Thread1 owned the IdLock, while Thread2(the only one waiting) was waiting 
for it.
2. Thread1 called releaseLockEntry, it will set IdLock.locked = false, but 
since Thread2 was waiting, it won't call map.remove(entry.id)
3. While Thread1 was calling releaseLockEntry, Thread2 was interrupted. So no 
one will remove this IdLock from the map.
4. If another thread try to call getLockEntry on this IdLock, it will end up in 
a infinite loop. Since existing = map.putIfAbsent(entry.id, entry)) != null and 
existing.locked=false

It is hard to write a UT since it is a very rare race condition.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21288) HostingServer in UnassignProcedure is not accurate

2018-10-10 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21288:
--

 Summary: HostingServer in UnassignProcedure is not accurate
 Key: HBASE-21288
 URL: https://issues.apache.org/jira/browse/HBASE-21288
 Project: HBase
  Issue Type: Sub-task
  Components: amv2, Balancer
Affects Versions: 2.0.2, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


We have a case that a region shows status OPEN on a already dead server in meta 
table(it is hard to trace how this happen), meaning this region is actually not 
online. But balance came and scheduled a MoveReionProcedure for this region, 
which created a mess:
The balancer 'thought' this region was on the server which has the same 
address(but with different startcode). So it schedules a MRP from this online 
server to another, but the UnassignProcedure dispatch the unassign call to the 
dead server according to regionstate, which then found the server dead and 
schedulre a SCP for the dead server. But since the UnassignProcedure's 
hostingServer is not accurate, the SCP can't interrupt it.
So, in the end, the SCP can't finish since the UnassignProcedure has the 
region' lock, the UnassignProcedure can finish since no one wake it, thus stuck.

Here is log, notice that the server of the UnassignProcedure is 
'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was dispatch 
to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'

{code}
2018-10-10 14:34:50,011 INFO  [PEWorker-4] 
assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
2018-10-10 14:34:50,011 WARN  [PEWorker-4] 
assignment.RegionTransitionProcedure(230): Remote call failed 
hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; 
exception=NoServerDispatchException
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584

//Then a SCP was scheduled
2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): 
Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but 
server not online
2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): 
Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. 
,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. ,16000,1539088156164
2018-10-10 14:34:50,017 DEBUG [PEWorker-4] procedure2.ProcedureExecutor(1089): 
Stored pid=14, state=RUNNABLE:SERVER_CRASH_START, hasLock=false; 
ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. 
,16020,1539076734964, splitWal=true, meta=false

//The SCP did not interrupt the UnassignProcedure but schedule new 
AssignProcedure for this region
2018-10-10 14:34:50,043 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(250): 
Done splitting WALs pid=14, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, 
hasLock=true; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. 
,16020,1539076734964, splitWal=true, meta=false
2018-10-10 14:34:50,054 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1691): 
Initialized subprocedures=[{pid=15, ppid=14, 
state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, {pid=16, ppid=14, 
state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
table=hbase:req_intercept_rule, region=460481706415d776b3742f428a6f579b}, 
{pid=17, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
AssignProcedure table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
{code}


Here I also added a safe fence in balancer, if such regions are found, 
balancing is skipped for safe.It should do no harm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21253) Backport HBASE-21244 Skip persistence when retrying for assignment related procedures to branch-2.0 and branch-2.1

2018-09-28 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21253:
--

 Summary: Backport HBASE-21244 Skip persistence when retrying for 
assignment related procedures to branch-2.0 and branch-2.1 
 Key: HBASE-21253
 URL: https://issues.apache.org/jira/browse/HBASE-21253
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.2, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


See HBASE-21244



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21237) Use CompatRemoteProcedureResolver to dispatch open/close region requests to RS

2018-09-26 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21237:
--

 Summary: Use CompatRemoteProcedureResolver to dispatch open/close 
region requests to RS
 Key: HBASE-21237
 URL: https://issues.apache.org/jira/browse/HBASE-21237
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.2, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


As discussed in HBASE-21217, in branch-2.0 and branch-2.1, we should use  
CompatRemoteProcedureResolver  instead of ExecuteProceduresRemoteCall to 
dispatch region open/close requests to RS. Since ExecuteProceduresRemoteCall  
will group all the open/close operations in one call and execute them 
sequentially on the target RS. If one operation fails, all the operation will 
be marked as failure. Actually, some of the operations(like open region) is 
already executing in the open region handler thread. But master thinks these 
operations fails and reassign the regions to another RS. So when the previous 
RS report to the master that the region is online, master will kill the RS 
since it already assign the region to another RS.
For branch-2.2+, HBASE-21217 will fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21228) Memory leak since AbstractFSWAL caches Thread object and never clean later

2018-09-25 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21228:
--

 Summary: Memory leak since AbstractFSWAL caches Thread object and 
never clean later
 Key: HBASE-21228
 URL: https://issues.apache.org/jira/browse/HBASE-21228
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.7, 2.0.2, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


In AbstractFSWAL(FSHLog in branch-1), we have a map caches thread and 
SyncFutures.
{code}
/**
   * Map of {@link SyncFuture}s keyed by Handler objects. Used so we reuse 
SyncFutures.
   * 
   * TODO: Reuse FSWALEntry's rather than create them anew each time as we do 
SyncFutures here.
   * 
   * TODO: Add a FSWalEntry and SyncFuture as thread locals on handlers rather 
than have them get
   * them from this Map?
   */
  private final ConcurrentMap syncFuturesByHandler;
{code}

A colleague of mine find a memory leak case caused by this map.

Every thread who writes WAL will be cached in this map, And no one will clean 
the threads in the map even after the thread is dead. 

In one of our customer's cluster, we noticed that even though there is no 
requests, the heap of the RS is almost full and CMS GC was triggered every 
second.
We dumped the heap and then found out there were more than 30 thousands threads 
with Terminated state. which are all cached in this map above. Everything 
referenced in these threads were leaked. Most of the threads are:
1.PostOpenDeployTasksThread, which will write Open Region mark in WAL
2. hconnection-0x1f838e31-shared--pool, which are used to write index short 
circuit(Phoenix), and WAL will be write and sync in these threads.
3.  Index writer thread(Phoenix), which referenced by RegionEnvironment  then 
by HRegion and finally been referenced by PostOpenDeployTasksThread.

We should turn this map into a thread local one, let JVM GC the terminated 
thread for us. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21212) Wrong flush time when update flush metric

2018-09-20 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21212:
--

 Summary: Wrong flush time when update flush metric
 Key: HBASE-21212
 URL: https://issues.apache.org/jira/browse/HBASE-21212
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.2, 2.1.0, 3.0.0
Reporter: Allan Yang
Assignee: Allan Yang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21085) Adding getter methods to some private fields in ProcedureV2 module

2018-08-21 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21085:
--

 Summary: Adding getter methods to some private fields in 
ProcedureV2 module 
 Key: HBASE-21085
 URL: https://issues.apache.org/jira/browse/HBASE-21085
 Project: HBase
  Issue Type: Sub-task
Reporter: Allan Yang
Assignee: Allan Yang


Many fields are private in ProcedureV2 module. adding getter method to them 
making them more transparent.
And some classes are private too, making it public.
Some class is private in ProcecudeV2 module, making it public.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21083) Introduce a mechanism to bypass the execution of a stuck procedure

2018-08-21 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21083:
--

 Summary: Introduce a mechanism to bypass the execution of a stuck 
procedure
 Key: HBASE-21083
 URL: https://issues.apache.org/jira/browse/HBASE-21083
 Project: HBase
  Issue Type: Sub-task
  Components: amv2
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


Offline discussed with [~stack] and [~Apache9]. We all agreed that we need to 
introduce a mechanism to 'force complete' a stuck procedure, so the AMv2 can 
continue running.
 we still have some unrevealed bugs hiding in our AMv2 and procedureV2 system, 
we need something to interfere with stuck procedures before HBCK2 can work. 
This is very crucial for a production ready system. 

For now, we have little ways to interfere with running procedures. Aborting 
them is not a good choice, since some procedures are not abort-able. And some 
procedure may have overridden the abort() method, which will ignore the abort 
request.

So, here, I will introduce a mechanism  to bypass the execution of a stuck 
procedure.
Basically, I added a field called 'bypass' to Procedure class. If we set this 
field to true, all the logic in execute/rollback will be skipped, letting this 
procedure and its ancestors complete normally and releasing the lock resources 
at last.

Notice that bypassing a procedure may leave the cluster in a middle state, e.g. 
the region not assigned, or some hdfs files left behind. 
The Operators need know the side effect of bypassing and recover the 
inconsistent state of the cluster themselves, like issuing new procedures to 
assign the regions.

A patch will be uploaded and review board will be open. For now, only APIs in 
ProcedureExecutor are provided. If anything is fine, I will add it to master 
service and add a shell command to bypass a procedure. Or, maybe we can use 
dynamically compiled JSPs to execute those APIs as mentioned in HBASE-20679. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21051) Possible NPE if ModifyTable and region split happen at the same time

2018-08-14 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21051:
--

 Summary: Possible NPE if ModifyTable and region split happen at 
the same time
 Key: HBASE-21051
 URL: https://issues.apache.org/jira/browse/HBASE-21051
 Project: HBase
  Issue Type: Sub-task
  Components: amv2
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


Similar with HBASE-20921, ModifyTable procedure and reopenProcedure won't held 
the lock, so another procedures like split/merge can execute at the same time.

1. a split happend during ModifyTable, as you can see from the log, the split 
was nealy complete.
{code}
2018-08-05 01:28:31,339 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1659): 
Finished subprocedure(s) of pid=772, 
state=RUNNABLE:SPLIT_TABLE_REGION_POST_OPERATION, hasLock=true; 
SplitTableRegionProce
dure table=IntegrationTestBigLinkedList, 
parent=357a7a6a62c76bc2d7ab30a6cc812637, 
daughterA=b13e5d155b65a5f752f3adda78fcfb6a, 
daughterB=5be3aadcee68d91c3d1e464865550246; resume parent processing.
2018-08-05 01:28:31,345 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1296): 
Finished pid=795, ppid=772, state=SUCCESS, hasLock=false; AssignProcedure 
table=IntegrationTestBigLinkedList, region=b13e5
d155b65a5f752f3adda78fcfb6a, target=e010125048016.bja,60020,1533402809226 in 
5.0280sec
{code}

2. reopenProcedure began to reopen region by moving it
{code}
2018-08-05 01:28:31,389 INFO  [PEWorker-11] 
procedure.MasterProcedureScheduler(631): pid=781, ppid=774, 
state=RUNNABLE:MOVE_REGION_UNASSIGN, hasLock=false; MoveRegionProcedure 
hri=357a7a6a62c76bc2d7ab3
0a6cc812637, source=e010125048016.bja,60020,1533402809226, 
destination=e010125048016.bja,60020,1533402809226 checking lock on 
357a7a6a62c76bc2d7ab30a6cc812637
2018-08-05 01:28:31,390 INFO  [PEWorker-3] procedure2.ProcedureExecutor(1296): 
Finished pid=772, state=SUCCESS, hasLock=false; SplitTableRegionProcedure 
table=IntegrationTestBigLinkedList, parent=357a7
a6a62c76bc2d7ab30a6cc812637, daughterA=b13e5d155b65a5f752f3adda78fcfb6a, 
daughterB=5be3aadcee68d91c3d1e464865550246 in 21.9050sec
2018-08-05 01:28:31,518 INFO  [PEWorker-11] procedure2.ProcedureExecutor(1533): 
Initialized subprocedures=[{pid=797, ppid=781, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=false; UnassignProcedur
e table=IntegrationTestBigLinkedList, region=357a7a6a62c76bc2d7ab30a6cc812637, 
server=e010125048016.bja,60020,1533402809226}]
2018-08-05 01:28:31,530 INFO  [PEWorker-15] 
procedure.MasterProcedureScheduler(631): pid=797, ppid=781, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=false; UnassignProcedure 
table=IntegrationTest
BigLinkedList, region=357a7a6a62c76bc2d7ab30a6cc812637, 
server=e010125048016.bja,60020,1533402809226 checking lock on 
357a7a6a62c76bc2d7ab30a6cc812637
{code}

3. MoveRegionProcdure fails since the region did not exis any more (due to 
split)
{code}
2018-08-05 01:28:31,543 ERROR [PEWorker-15] procedure2.ProcedureExecutor(1517): 
CODE-BUG: Uncaught runtime exception: pid=797, ppid=781, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; Unassig
nProcedure table=IntegrationTestBigLinkedList, 
region=357a7a6a62c76bc2d7ab30a6cc812637, 
server=e010125048016.bja,60020,1533402809226
java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
at 
org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1097)
at 
org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1125)
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1455)
at 
org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:204)
at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:349)
at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:101)
at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:873)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1498)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1278)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1785)
{code}

We need to think about the case, and find a untimely solution for it, 
otherwise, issues like this one and HBASE-20921 will keep comming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21050) Exclusive lock may be held by a SUCCESS state procedure forever

2018-08-14 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21050:
--

 Summary: Exclusive lock may be held by a SUCCESS state procedure 
forever
 Key: HBASE-21050
 URL: https://issues.apache.org/jira/browse/HBASE-21050
 Project: HBase
  Issue Type: Sub-task
  Components: amv2
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


After HBASE-20846, we restore lock info for procedures. But, there is a case 
that the lock and be held by a already success procedure. Since the procedure 
won't execute again, the lock will held by the procedure forever.

1. All children for pid=1208 had been finished, but before procedure 1208 
awake, the master was killed
{code}
2018-08-05 02:20:14,465 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1659): 
Finished subprocedure(s) of pid=1208, ppid=1206, state=RUNNABLE, hasLock=true; 
MoveRegionProcedure hri=c2a23a735f16df57299
dba6fd4599f2f, source=e010125050127.bja,60020,1533403109034, 
destination=e010125050127.bja,60020,1533403109034; resume parent processing.

2018-08-05 02:20:14,466 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1296): 
Finished pid=1232, ppid=1208, state=SUCCESS, hasLock=false; AssignProcedure 
table=IntegrationTestBigLinkedList, region=c2a
23a735f16df57299dba6fd4599f2f, target=e010125050127.bja,60020,1533403109034 in 
1.5060sec
{code}

2. Master restarts, since procedure 1208 held the lock before restart, so the 
lock was resotore for it
{code}
2018-08-05 02:20:30,803 DEBUG [Thread-15] procedure2.ProcedureExecutor(456): 
Loading pid=1208, ppid=1206, state=SUCCESS, hasLock=false; MoveRegionProcedure 
hri=c2a23a735f16df57299dba6fd4599f2f, source=
e010125050127.bja,60020,1533403109034, 
destination=e010125050127.bja,60020,1533403109034

2018-08-05 02:20:30,818 DEBUG [Thread-15] procedure2.Procedure(898): pid=1208, 
ppid=1206, state=SUCCESS, hasLock=false; MoveRegionProcedure 
hri=c2a23a735f16df57299dba6fd4599f2f, source=e010125050127.bj
a,60020,1533403109034, destination=e010125050127.bja,60020,1533403109034 held 
the lock before restarting, call acquireLock to restore it.

2018-08-05 02:20:30,818 INFO  [Thread-15] 
procedure.MasterProcedureScheduler(631): pid=1208, ppid=1206, state=SUCCESS, 
hasLock=false; MoveRegionProcedure hri=c2a23a735f16df57299dba6fd4599f2f, 
source=e0
10125050127.bja,60020,1533403109034, 
destination=e010125050127.bja,60020,1533403109034 checking lock on 
c2a23a735f16df57299dba6fd4599f2f
{code}

3. Since procedure 1208 is success, it won't execute later, so the lock will be 
held by it forever

We need to check the state of the procedure before restoring locks, if the 
procedure is already finished (success or rollback), we do not need to acquire 
lock for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21041) Memstore's heap size will be decreased to minus zero after flush

2018-08-13 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21041:
--

 Summary: Memstore's heap size will be decreased to minus zero 
after flush
 Key: HBASE-21041
 URL: https://issues.apache.org/jira/browse/HBASE-21041
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


When creating an active mutable segment (MutableSegment) in memstore, 
MutableSegment's deep overheap (208 bytes) was added to its heap size, but not 
to the region's memstore's heap size. And so was the immutable 
segment(CSLMImmutableSegment) which the mutable segment turned into (additional 
8 bytes ) later. So after one flush, the memstore's heapsize will be decreased 
to -216 bytes, The minus number will accumulate after every flush. 
CompactingMemstore has this problem too.

We need to record the overhead for CSLMImmutableSegment and MutableSegment to 
the corresponding region's memstore size.

For CellArrayImmutableSegment,  CellChunkImmutableSegment and 
CompositeImmutableSegment , it is not necessary to do so, because inside 
CompactingMemstore, the overheads are already be taken care of when transfer a 
CSLMImmutableSegment into them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21035:
--

 Summary: Meta Table should be able to online even if all 
procedures are lost
 Key: HBASE-21035
 URL: https://issues.apache.org/jira/browse/HBASE-21035
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


After HBASE-20708, we changed the way we init after master starts. It will only 
check WAL dirs and compare to Zookeeper RS nodes to decide which server need to 
expire. For servers which's dir is ending with 'SPLITTING', we assure that 
there will be a SCP for it.

But, if the server with the meta region crashed before master restarts, and if 
all the procedure wals are lost (due to bug, or deleted manually, whatever), 
the new restarted master will be stuck when initing. Since no one will bring 
meta region online.

Although it is a anomaly case, but I think no matter what happens, we need to 
online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-20976) SCP can be scheduled multiple times for the same RS

2018-08-10 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang reopened HBASE-20976:


> SCP can be scheduled multiple times for the same RS
> ---
>
> Key: HBASE-20976
> URL: https://issues.apache.org/jira/browse/HBASE-20976
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.1
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.2
>
> Attachments: HBASE-20976.branch-2.0.001.patch
>
>
> SCP can be scheduled multiple times for the same RS:
> 1. a RS crashed, a SCP was submitted for it
> 2. before this SCP finish, the Master crashed
> 3. The new master will scan the meta table and find some region is still open 
> on a dead server
> 4. The new master submit a SCP for the dead server again
> The two SCP for the same RS can even execute concurrently if without 
> HBASE-20846…
> Provided a test case to reproduce this issue and a fix solution in the patch.
> Another case that SCP might be scheduled multiple times for the same RS(with 
> HBASE-20708.):
> 1.  a RS crashed, a SCP was submitted for it
> 2. A new RS on the same host started, the old RS's Serveranme was remove from 
> DeadServer.deadServers
> 3. after the SCP passed the Handle_RIT state, a UnassignProcedure need to 
> send a close region operation to the crashed RS
> 4. The UnassignProcedure's dispatch failed since 'NoServerDispatchException'
> 5. Begin to expire the RS, but only find it not online and not in deadServer 
> list, so a SCP was submitted for the same RS again
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21031) Memory leak if replay edits failed during region opening

2018-08-09 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21031:
--

 Summary: Memory leak if replay edits failed during region opening
 Key: HBASE-21031
 URL: https://issues.apache.org/jira/browse/HBASE-21031
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


Due to HBASE-21029, when replaying edits with a lot of same cells, the memstore 
won't flush,  a exception will throw when all heap space was used:
{code}
2018-08-06 15:52:27,590 ERROR 
[RS_OPEN_REGION-regionserver/hb-bp10cw4ejoy0a2f3f-009:16020-2] 
handler.OpenRegionHandler(302): Failed open of 
region=hbase_test,dffa78,1531227033378.cbf9a2daf3aaa0c7e931e9c9a7b53f41., 
starting to roll back the global memstore size.
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at 
org.apache.hadoop.hbase.regionserver.OnheapChunk.allocateDataBuffer(OnheapChunk.java:41)
at org.apache.hadoop.hbase.regionserver.Chunk.init(Chunk.java:104)
at 
org.apache.hadoop.hbase.regionserver.ChunkCreator.getChunk(ChunkCreator.java:226)
at 
org.apache.hadoop.hbase.regionserver.ChunkCreator.getChunk(ChunkCreator.java:180)
at 
org.apache.hadoop.hbase.regionserver.ChunkCreator.getChunk(ChunkCreator.java:163)
at 
org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.getOrMakeChunk(MemStoreLABImpl.java:273)
at 
org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.copyCellInto(MemStoreLABImpl.java:148)
at 
org.apache.hadoop.hbase.regionserver.MemStoreLABImpl.copyCellInto(MemStoreLABImpl.java:111)
at 
org.apache.hadoop.hbase.regionserver.Segment.maybeCloneWithAllocator(Segment.java:178)
at 
org.apache.hadoop.hbase.regionserver.AbstractMemStore.maybeCloneWithAllocator(AbstractMemStore.java:287)
at 
org.apache.hadoop.hbase.regionserver.AbstractMemStore.add(AbstractMemStore.java:107)
at org.apache.hadoop.hbase.regionserver.HStore.add(HStore.java:706)
at 
org.apache.hadoop.hbase.regionserver.HRegion.restoreEdit(HRegion.java:5494)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEdits(HRegion.java:4608)
at 
org.apache.hadoop.hbase.regionserver.HRegion.replayRecoveredEditsIfAny(HRegion.java:4404)
{code}
After this exception, the memstore did not roll back, and since MSLAB is used, 
all the chunk allocated won't release for ever. Those memory is leak forever...

We need to rollback the memory if open region fails(For now, only global 
memstore size is decreased after failure).

Another problem is that we use replayEditsPerRegion in RegionServerAccounting 
to record how many memory used during replaying. And decrease the global 
memstore size if replay fails. This is not right, since during replaying, we 
may also flush the memstore, the size in the map of replayEditsPerRegion is not 
accurate at all! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21029) Miscount of memstore's heap/offheap size if same cell was put

2018-08-08 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21029:
--

 Summary: Miscount of memstore's heap/offheap size if same cell was 
put
 Key: HBASE-21029
 URL: https://issues.apache.org/jira/browse/HBASE-21029
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


We are now using memstore.heapSize() + memstore.offheapSize() to decide whether 
a flush is needed. But, if a same cell was put again in memstore, only the 
memstore's dataSize will be increased, the heap/offheap size won't. Actually, 
if MSLAB is used, the heap/offheap will increase no matter the cell is added or 
not. IIRC, memstore's heap/offheap size should always bigger than data size. We 
introduced heap/offheap size besides data size to reflect memory footprint more 
precisely. 
{code}
// If there's already a same cell in the CellSet and we are using MSLAB, we 
must count in the
// MSLAB allocation size as well, or else there will be memory leak 
(occupied heap size larger
// than the counted number)
if (succ || mslabUsed) {
  cellSize = getCellLength(cellToAdd);
}
// heap/offheap size is changed only if the cell is truly added in the 
cellSet
long heapSize = heapSizeChange(cellToAdd, succ);
long offHeapSize = offHeapSizeChange(cellToAdd, succ);
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20976) SCP can be scheduled multiple times for the same RS

2018-08-07 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang resolved HBASE-20976.

Resolution: Invalid

> SCP can be scheduled multiple times for the same RS
> ---
>
> Key: HBASE-20976
> URL: https://issues.apache.org/jira/browse/HBASE-20976
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.1
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.2
>
> Attachments: HBASE-20976.branch-2.0.001.patch
>
>
> SCP can be scheduled multiple times for the same RS:
> 1. a RS crashed, a SCP was submitted for it
> 2. before this SCP finish, the Master crashed
> 3. The new master will scan the meta table and find some region is still open 
> on a dead server
> 4. The new master submit a SCP for the dead server again
> The two SCP for the same RS can even execute concurrently if without 
> HBASE-20846…
> Provided a test case to reproduce this issue and a fix solution in the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21003) Fix the flaky TestSplitOrMergeStatus

2018-08-03 Thread Allan Yang (JIRA)
Allan Yang created HBASE-21003:
--

 Summary: Fix the flaky TestSplitOrMergeStatus
 Key: HBASE-21003
 URL: https://issues.apache.org/jira/browse/HBASE-21003
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


TestSplitOrMergeStatus.testSplitSwitch() is flaky because :
{code}
//Set the split switch to false
boolean[] results = admin.setSplitOrMergeEnabled(false, false, 
MasterSwitchType.SPLIT);
..
//Split the region
admin.split(t.getName());
int count = admin.getTableRegions(tableName).size();
assertTrue(originalCount == count);
//Set the split switch to true, actually, the last split procedure may not 
started yet on master
//So, after setting the switch to true, the last split operation may 
success, which is not 
//excepted   
results = admin.setSplitOrMergeEnabled(true, false, MasterSwitchType.SPLIT);
assertEquals(1, results.length);
assertFalse(results[0]);
//Since last split success, split the region again will end up with a 
//DoNotRetryRegionException here
admin.split(t.getName());
{code}

{code}
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
3f16a57c583e6ecf044c5b7de2e97121 is not OPEN; 
regionState={3f16a57c583e6ecf044c5b7de2e97121 state=SPLITTING, 
ts=1533239385789, server=asf911.gq1.ygridcore.net,60061,1533239369899}
 at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:191)
 at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.(SplitTableRegionProcedure.java:112)
 at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:756)
 at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1722)
 at 
org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:131)
 at org.apache.hadoop.hbase.master.HMaster.splitRegion(HMaster.java:1714)
 at 
org.apache.hadoop.hbase.master.MasterRpcServices.splitRegion(MasterRpcServices.java:797)
 at 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20990) One operation in procedure batch throws an exception will cause all RegionTransitionProcedures receive the same exception

2018-07-31 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20990:
--

 Summary: One operation in procedure batch throws an exception will 
cause all RegionTransitionProcedures receive the same exception
 Key: HBASE-20990
 URL: https://issues.apache.org/jira/browse/HBASE-20990
 Project: HBase
  Issue Type: Sub-task
  Components: amv2
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


In AMv2, we batch open/close region operations and call RS with 
executeProcedures API. But, in this API, if one of the region's operations 
throws an exception, all the operations in the batch will receive the same 
exception. Actually, some of the operations in the batch is executing normally 
in the RS.
I think we should try catch exceptions respectively, and call remoteCallFailed 
or remoteCallCompleted in RegionTransitionProcedure respectively. 
Otherwise, there will be some very strange behave. Such as this one:
{code}
2018-07-18 02:56:18,506 WARN  [RSProcedureDispatcher-pool3-t1] 
assignment.RegionTransitionProcedure(226): Remote call failed 
e010125048016.bja,60020,1531848989401; pid=8362, ppid=8272, state=RUNNABLE:R
EGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList, 
region=0beb8ea4e2f239fc082be7cefede1427, 
target=e010125048016.bja,60020,1531848989401; rit=OPENING, 
location=e010125048016
.bja,60020,1531848989401; exception=NotServingRegionException
{code}
The AssignProcedure failed with a NotServingRegionException, what??? It is very 
strange, actually, the AssignProcedure successes on the RS, another CloseRegion 
operation failed in the operation batch was causing the exception.
To correct this, we need to modify the response of executeProcedures API, which 
is the ExecuteProceduresResponse proto, to return infos(status, exceptions) per 
operation.
This issue alone won't cause much trouble, so not so hurry to change the behave 
here, but indeed we need to consider this one when we want do some reconstruct 
to AMv2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-20976) SCP can be scheduled multiple times for the same RS

2018-07-30 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang reopened HBASE-20976:


> SCP can be scheduled multiple times for the same RS
> ---
>
> Key: HBASE-20976
> URL: https://issues.apache.org/jira/browse/HBASE-20976
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.1
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-20976.branch-2.0.001.patch
>
>
> SCP can be scheduled multiple times for the same RS:
> 1. a RS crashed, a SCP was submitted for it
> 2. before this SCP finish, the Master crashed
> 3. The new master will scan the meta table and find some region is still open 
> on a dead server
> 4. The new master submit a SCP for the dead server again
> The two SCP for the same RS can even execute concurrently if without 
> HBASE-20846…
> Provided a test case to reproduce this issue and a fix solution in the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20976) SCP can be scheduled multiple times for the same RS

2018-07-30 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20976:
--

 Summary: SCP can be scheduled multiple times for the same RS
 Key: HBASE-20976
 URL: https://issues.apache.org/jira/browse/HBASE-20976
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


SCP can be scheduled multiple times for the same RS:
1. a RS crashed, a SCP was submitted for it
2. before this SCP finish, the Master crashed
3. The new master will scan the meta table and find some region is still open 
on a dead server
4. The new master submit a SCP for the dead server again
The two SCP for the same RS can even execute concurrently if without 
HBASE-20846…

Provided a test case to reproduce this issue and a fix solution in the patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20975) Lock may not be taken while rolling back procedure

2018-07-30 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20975:
--

 Summary: Lock may not be taken while rolling back procedure
 Key: HBASE-20975
 URL: https://issues.apache.org/jira/browse/HBASE-20975
 Project: HBase
  Issue Type: Sub-task
  Components: amv2
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


Find this one when investigating HBASE-20921, too.

Here is some code from executeRollback in ProcedureExecutor.java.
{code}
boolean reuseLock = false;
while (stackTail --> 0) {
  final Procedure proc = subprocStack.get(stackTail);

  LockState lockState;
  //If reuseLock, then don't acquire the lock
  if (!reuseLock && (lockState = acquireLock(proc)) != 
LockState.LOCK_ACQUIRED) {
return lockState;
  }

  lockState = executeRollback(proc);
  boolean abortRollback = lockState != LockState.LOCK_ACQUIRED;
  abortRollback |= !isRunning() || !store.isRunning();

  //If the next procedure in the stack is the current one, then reuseLock = 
true
  reuseLock = stackTail > 0 && (subprocStack.get(stackTail - 1) == proc) && 
!abortRollback;
  //If reuseLock, don't releaseLock
  if (!reuseLock) {
releaseLock(proc, false);
  }

  if (abortRollback) {
return lockState;
  }

  subprocStack.remove(stackTail);

  if (proc.isYieldAfterExecutionStep(getEnvironment())) {
return LockState.LOCK_YIELD_WAIT;
  }

  //But, here, lock is released no matter reuseLock is true or false
  if (proc != rootProc) {
execCompletionCleanup(proc);
  }
}
{code}

You can see my comments in the code above, reuseLock can cause the procedure 
executing(rollback) without a lock. Though I haven't found any bugs introduced 
by this issue, it is indeed a potential bug need to fix.
I think we can just remove the reuseLock logic. Acquire and release lock every 
time. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20973) ArrayIndexOutOfBoundsException when rolling back procedure

2018-07-29 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20973:
--

 Summary: ArrayIndexOutOfBoundsException when rolling back procedure
 Key: HBASE-20973
 URL: https://issues.apache.org/jira/browse/HBASE-20973
 Project: HBase
  Issue Type: Sub-task
  Components: amv2
Affects Versions: 2.0.1, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


Find this one while investigating HBASE-20921. After the root 
procedure(ModifyTableProcedure  in this case) rolled back, a 
ArrayIndexOutOfBoundsException was thrown
{code}
2018-07-18 01:39:10,241 ERROR [PEWorker-8] procedure2.ProcedureExecutor(159): 
CODE-BUG: Uncaught runtime exception for pid=5973, 
state=FAILED:MODIFY_TABLE_REOPEN_ALL_REGIONS, exception=java.lang.NullPo
interException via CODE-BUG: Uncaught runtime exception: pid=5974, ppid=5973, 
state=RUNNABLE:REOPEN_TABLE_REGIONS_CONFIRM_REOPENED; 
ReopenTableRegionsProcedure table=IntegrationTestBigLinkedList:java.l
ang.NullPointerException; ModifyTableProcedure 
table=IntegrationTestBigLinkedList
java.lang.UnsupportedOperationException: unhandled 
state=MODIFY_TABLE_REOPEN_ALL_REGIONS
at 
org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.rollbackState(ModifyTableProcedure.java:147)
at 
org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.rollbackState(ModifyTableProcedure.java:50)
at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203)
at 
org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:864)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
2018-07-18 01:39:10,243 WARN  [PEWorker-8] procedure2.ProcedureExecutor(1756): 
Worker terminating UNNATURALLY null
java.lang.ArrayIndexOutOfBoundsException: 1
at 
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker$BitSetNode.updateState(ProcedureStoreTracker.java:405)
at 
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker$BitSetNode.delete(ProcedureStoreTracker.java:178)
at 
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.delete(ProcedureStoreTracker.java:513)
at 
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.delete(ProcedureStoreTracker.java:505)
at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:741)
at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:691)
at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.delete(WALProcedureStore.java:603)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1387)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
{code}

This is a very serious condition, After this exception thrown, the exclusive 
lock held by ModifyTableProcedure was never released. All the procedure against 
this table were blocked. Until the master restarted, and since the lock info 
for the procedure won't be restored, the other procedures can go again, it is 
quite embarrassing that a bug save us...(this bug will be fixed in HBASE-20846)

I tried to reproduce this one using the test case in HBASE-20921 but I just 
can't reproduce it.
A easy way to resolve this is add a try catch, making sure no matter what 
happens, the table's exclusive lock can always be relased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20921) Possible NPE in ReopenTableRegionsProcedure

2018-07-23 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20921:
--

 Summary: Possible NPE in ReopenTableRegionsProcedure
 Key: HBASE-20921
 URL: https://issues.apache.org/jira/browse/HBASE-20921
 Project: HBase
  Issue Type: Sub-task
  Components: amv2
Affects Versions: 2.1.0, 3.0.0, 2.0.2
Reporter: Allan Yang
Assignee: Allan Yang


After HBASE-20752, we issue a ReopenTableRegionsProcedure in 
ModifyTableProcedure to ensure all regions are reopened.
But, ModifyTableProcedure and ReopenTableRegionsProcedure do not hold the lock 
(why?), so there is a chance that while ModifyTableProcedure  executing, a 
merge/split procedure can be executed at the same time.
So, when ReopenTableRegionsProcedure reaches the state of 
"REOPEN_TABLE_REGIONS_CONFIRM_REOPENED", some of the persisted regions to check 
is actually not exists, thus a NPE will throw.
{code}
2018-07-18 01:38:57,528 INFO  [PEWorker-9] procedure2.ProcedureExecutor(1246): 
Finished pid=6110, state=SUCCESS; MergeTableRegionsProcedure 
table=IntegrationTestBigLinkedList, regions=[845d286231eb01b7
1aeaa17b0e30058d, 4a46ab0918c99cada72d5336ad83a828], forcibly=false in 
10.8610sec
2018-07-18 01:38:57,530 ERROR [PEWorker-8] procedure2.ProcedureExecutor(1478): 
CODE-BUG: Uncaught runtime exception: pid=5974, ppid=5973, 
state=RUNNABLE:REOPEN_TABLE_REGIONS_CONFIRM_REOPENED; ReopenTab
leRegionsProcedure table=IntegrationTestBigLinkedList
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.master.assignment.RegionStates.checkReopened(RegionStates.java:651)
at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.hadoop.hbase.master.procedure.ReopenTableRegionsProcedure.executeFromState(ReopenTableRegionsProcedure.java:102)
at 
org.apache.hadoop.hbase.master.procedure.ReopenTableRegionsProcedure.executeFromState(ReopenTableRegionsProcedure.java:45)
at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1453)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
{code}

I think we need to renew the region list of the table at the 
"REOPEN_TABLE_REGIONS_CONFIRM_REOPENED" state. For the regions which are merged 
or split, we do not need to check it. Since we can make sure that they are 
opened after we made change to table descriptor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-20864) RS was killed due to master thought the region should be on a already dead server

2018-07-19 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-20864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang resolved HBASE-20864.

   Resolution: Resolved
Fix Version/s: (was: 2.0.2)

HBASE-20792 solved this issue

> RS was killed due to master thought the region should be on a already dead 
> server
> -
>
> Key: HBASE-20864
> URL: https://issues.apache.org/jira/browse/HBASE-20864
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: log.zip
>
>
> When I was running ITBLL with our internal 2.0.0 version(with 2.0.1 
> backported and with other two issues: HBASE-20706, HBASE-20752). I found two 
> of my RS killed by master since master has a different region state with 
> those RS. It is very strange that master thought these region should be on a 
> already dead server. There might be a serious bug, but I haven't found it 
> yet. Here is the process:
> 1. e010125048153.bja,60020,1531137365840 is crashed, and clearly 
> 4423e4182457c5b573729be4682cc3a3 was assigned to 
> e010125049164.bja,60020,1531136465378 during ServerCrashProcedure
> {code:java}
> 2018-07-09 20:03:32,443 INFO  [PEWorker-10] procedure.ServerCrashProcedure: 
> Start pid=2303, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false
> 2018-07-09 20:03:39,220 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=294,queue=24,port=6] 
> assignment.RegionTransitionProcedure: Received report OPENED seqId=16021, 
> pid=2305, ppid=2303, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> AssignProcedure table=IntegrationTestBigLinkedList, 
> region=4423e4182457c5b573729be4682cc3a3; rit=OPENING, 
> location=e010125049164.bja,60020,1531136465378
> 2018-07-09 20:03:39,220 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=2305 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3, 
> regionState=OPEN, openSeqNum=16021, 
> regionLocation=e010125049164.bja,60020,1531136465378
> 2018-07-09 20:03:43,190 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
> Finished pid=2303, state=SUCCESS; ServerCrashProcedure 
> server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false in 
> 10.7490sec
> {code}
> 2. A modify table happened later, the 4423e4182457c5b573729be4682cc3a3 was 
> reopend on e010125049164.bja,60020,1531136465378
> {code:java}
> 2018-07-09 20:04:39,929 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=295,queue=25,port=6] 
> assignment.RegionTransitionProcedure: Received report OPENED seqId=16024, 
> pid=2351, ppid=2314, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> AssignProcedure table=IntegrationTestBigLinkedList, 
> region=4423e4182457c5b573729be4682cc3a3, 
> target=e010125049164.bja,60020,1531136465378; rit=OPENING, 
> location=e010125049164.bja,60020,1531136465378
> 2018-07-09 20:04:40,554 INFO  [PEWorker-6] assignment.RegionStateStore: 
> pid=2351 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3, 
> regionState=OPEN, openSeqNum=16024, 
> regionLocation=e010125049164.bja,60020,1531136465378
> {code}
> 3. Active master was killed, the backup master took over, but when loading 
> meta entry, it clearly showed 4423e4182457c5b573729be4682cc3a3 is on the 
> privous dead server e010125048153.bja,60020,1531137365840. That is very very 
> strange!!!
> {code:java}
> 2018-07-09 20:06:17,985 INFO  [master/e010125048016:6] 
> assignment.RegionStateStore: Load hbase:meta entry 
> region=4423e4182457c5b573729be4682cc3a3, regionState=OPEN, 
> lastHost=e010125049164.bja,60020,1531136465378, 
> regionLocation=e010125048153.bja,60020,1531137365840, openSeqNum=16024
> {code}
> 4. the rs was killed
> {code:java}
> 2018-07-09 20:06:20,265 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=297,queue=27,port=6] 
> assignment.AssignmentManager: Killing e010125049164.bja,60020,1531136465378: 
> rit=OPEN, location=e010125048153.bja,60020,1531137365840, 
> table=IntegrationTestBigLinkedList, 
> region=4423e4182457c5b573729be4682cc3a3reported OPEN on 
> server=e010125049164.bja,60020,1531136465378 but state has otherwise.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20903) backport HBASE-20792 "info:servername and info:sn inconsistent for OPEN region" to branch-2.0

2018-07-17 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20903:
--

 Summary: backport HBASE-20792 "info:servername and info:sn 
inconsistent for OPEN region" to branch-2.0
 Key: HBASE-20903
 URL: https://issues.apache.org/jira/browse/HBASE-20903
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.1
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 2.0.2


As discussed in HBASE-20864. This is a very serious bug which can cause RS 
being killed or data loss. Should be backported to branch-2.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20893) Data loss if splitting region while ServerCrashProcedure executing

2018-07-16 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20893:
--

 Summary: Data loss if splitting region while ServerCrashProcedure 
executing
 Key: HBASE-20893
 URL: https://issues.apache.org/jira/browse/HBASE-20893
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 2.0.1, 3.0.0, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


Similar case as HBASE-20878.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20878) Data loss if merging regions while ServerCrashProcedure executing

2018-07-12 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20878:
--

 Summary: Data loss if merging regions while ServerCrashProcedure 
executing
 Key: HBASE-20878
 URL: https://issues.apache.org/jira/browse/HBASE-20878
 Project: HBase
  Issue Type: Bug
  Components: amv2
Affects Versions: 2.0.1, 3.0.0, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


In MergeTableRegionsProcedure, we close the regions to merge using 
UnassignProcedure. But, if the RS these regions on is crashed, a 
ServerCrashProcedure will execute at the same time. UnassignProcedures will be 
blocks until all logs are split. But since these regions are closed for 
merging, the regions won't open again, the recovered.edit in the region dir 
won't be replay, thus, data will loss.
I provided a test to repo this case. I seriously doubt Split region procedure 
also has this kind of problem. I will check later



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20870) Wrong HBase root dir in ITBLL's Search Tool

2018-07-11 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20870:
--

 Summary: Wrong HBase root dir in ITBLL's Search Tool
 Key: HBASE-20870
 URL: https://issues.apache.org/jira/browse/HBASE-20870
 Project: HBase
  Issue Type: Bug
  Components: integration tests
Affects Versions: 2.0.1, 3.0.0, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


When using IntegrationTestBigLinkedList's Search tools, it always fails since 
it tries to read WALs in a wrong HBase root dir. Turned out that when 
initializing IntegrationTestingUtility in IntegrationTestBigLinkedList, its 
super class HBaseTestingUtility will change hbase.rootdir to a local random 
dir. It is not wrong since HBaseTestingUtility is mostly used by Minicluster. 
But for IntegrationTest runs on distributed clusters, we should change it back.
 Here is the error info.
{code:java}
2018-07-11 16:35:49,679 DEBUG [main] hbase.HBaseCommonTestingUtility: Setting 
hbase.rootdir to 
/home/hadoop/target/test-data/deb67611-2737-4696-abe9-32a7783df7bb
2018-07-11 16:35:50,736 ERROR [main] util.AbstractHBaseTool: Error running 
command-line tool java.io.FileNotFoundException: File 
file:/home/hadoop/target/test-data/deb67611-2737-4696-abe9-32a7783df7bb/WALs 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:431)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20867) RS may got killed while master restarts

2018-07-10 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20867:
--

 Summary: RS may got killed while master restarts
 Key: HBASE-20867
 URL: https://issues.apache.org/jira/browse/HBASE-20867
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.1, 3.0.0, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang


If the master is dispatching a RPC call to RS when aborting. A connection 
exception may be thrown by the RPC layer(A IOException with "Connection closed" 
message in this case). The 
RSProcedureDispatcher will regard is as an un-retryable exception and pass it 
to UnassignProcedue.remoteCallFailed, which will expire the RS.
Actually, the RS is very healthy, only the master is restarting.
I think we should deal with those kinds of connection exceptions in 
RSProcedureDispatcher and retry the rpc call



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20864) RS was killed due to master thought the region should be on a already dead server

2018-07-10 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20864:
--

 Summary: RS was killed due to master thought the region should be 
on a already dead server
 Key: HBASE-20864
 URL: https://issues.apache.org/jira/browse/HBASE-20864
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Allan Yang


When I was running ITBLL with our internal 2.0.0 version(with 2.0.1 backported 
and with other two issues: HBASE-20706, HBASE-20752). I found two of my RS 
killed by master since master has a different region state with those RS. It is 
very strange that master thought these region should be on a already dead 
server. There might be a serious bug, but I haven't found it yet. Here is the 
process:


 1. e010125048153.bja,60020,1531137365840 is crashed, and clearly 
4423e4182457c5b573729be4682cc3a3 was assigned to 
e010125049164.bja,60020,1531136465378 during ServerCrashProcedure
{code:java}
2018-07-09 20:03:32,443 INFO  [PEWorker-10] procedure.ServerCrashProcedure: 
Start pid=2303, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=e010125048153.bja,60020,1531137365840, splitWa
l=true, meta=false
2018-07-09 20:03:39,220 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=294,queue=24,port=6] 
assignment.RegionTransitionProcedure: Received report OPENED seqId=16021, 
pid=2305, ppid=2303, state=RUNNABLE
:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3; 
rit=OPENING, location=e010125049164.bja,60020,1531136465378
2018-07-09 20:03:39,220 INFO  [PEWorker-13] assignment.RegionStateStore: 
pid=2305 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3, 
regionState=OPEN, openSeqNum=16021, regionLocation=e010125049
164.bja,60020,1531136465378
2018-07-09 20:03:43,190 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
Finished pid=2303, state=SUCCESS; ServerCrashProcedure 
server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false
in 10.7490sec
{code}
2. A modify table happened later, the 4423e4182457c5b573729be4682cc3a3 was 
reopend on e010125049164.bja,60020,1531136465378
{code:java}
2018-07-09 20:04:39,929 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=295,queue=25,port=6] 
assignment.RegionTransitionProcedure: Received report OPENED seqId=16024, 
pid=2351, ppid=2314, state=RUNNABLE
:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3, 
target=e010125049164.bja,60020,1531136465378; rit=OPENING, location=e0101250491
64.bja,60020,1531136465378
2018-07-09 20:04:40,554 INFO  [PEWorker-6] assignment.RegionStateStore: 
pid=2351 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3, 
regionState=OPEN, openSeqNum=16024, regionLocation=e0101250491
64.bja,60020,1531136465378
{code}
3. Active master was killed, the backup master took over, but when loading meta 
entry, it clearly showed 4423e4182457c5b573729be4682cc3a3 is on the privous 
dead server e010125048153.bja,60020,1531137365840. That is very very strange!!!
{code:java}
2018-07-09 20:06:17,985 INFO  [master/e010125048016:6] 
assignment.RegionStateStore: Load hbase:meta entry 
region=4423e4182457c5b573729be4682cc3a3, regionState=OPEN, 
lastHost=e010125049164.bja,60020
,1531136465378, regionLocation=e010125048153.bja,60020,1531137365840, 
openSeqNum=16024
{code}
4. the rs was killed
{code:java}
2018-07-09 20:06:20,265 WARN  
[RpcServer.default.FPBQ.Fifo.handler=297,queue=27,port=6] 
assignment.AssignmentManager: Killing e010125049164.bja,60020,1531136465378: 
rit=OPEN, location=e010125048153
.bja,60020,1531137365840, table=IntegrationTestBigLinkedList, 
region=4423e4182457c5b573729be4682cc3a3reported OPEN on 
server=e010125049164.bja,60020,1531136465378 but state has otherwise.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20860) Merged region's RIT state may not be cleaned after master restart

2018-07-09 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20860:
--

 Summary: Merged region's RIT state may not be cleaned after master 
restart
 Key: HBASE-20860
 URL: https://issues.apache.org/jira/browse/HBASE-20860
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.1, 3.0.0, 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0, 2.1.0, 2.0.2


In MergeTableRegionsProcedure, we issue UnassignProcedures to offline regions 
to merge. But if we restart master just after MergeTableRegionsProcedure 
finished these two UnassignProcedure and before it can delete their meta 
entries. The new master will found these two region is CLOSED but no procedures 
are attached to them. They will be regard as RIT regions and nobody will clean 
the RIT state for them later.
A quick way to resolve this stuck situation in the production env is restarting 
master again, since the meta entries are deleted in MergeTableRegionsProcedure. 
Here, I offer a fix for this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20854) Wrong retires in RpcRetryingCaller's log message

2018-07-06 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20854:
--

 Summary: Wrong retires in RpcRetryingCaller's log message
 Key: HBASE-20854
 URL: https://issues.apache.org/jira/browse/HBASE-20854
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0, 2.1.0, 2.0.2


Just a small bug fix. In the error log message in RpcRetryingCallerImpl. tries 
number is passed to both tries and retries. Causing a bit of confusing.

{code}
2018-07-05 21:04:46,343 INFO [Thread-20] 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, tries=6, 
retries=6, started=4174 ms ago, cancelled=false, 
msg=org.apache.hadoop.hbase.exce
ptions.RegionOpeningException: Region 
IntegrationTestBigLinkedList,\x7F\xFF\xFF\xFF\xFF\xFF\xFF\xFE,1530795739116.0cfd339596648348ac13d979150eb2bf.
 is opening on e010125049164.bja,60020,1530795698451
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20846) Table's shared lock is not hold by sub-procedures after master restart

2018-07-04 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20846:
--

 Summary: Table's shared lock is not hold by sub-procedures after 
master restart
 Key: HBASE-20846
 URL: https://issues.apache.org/jira/browse/HBASE-20846
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.1.0
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0, 2.1.0, 2.0.2


Found this one when investigating ModifyTableProcedure got stuck while there 
was a MoveRegionProcedure going on after master restart.
Though this issue can be solved by HBASE-20752. But I discovered something else.
Before a MoveRegionProcedure can execute, it will hold the table's shared lock. 
so,, when a UnassignProcedure was spwaned, it will not check the table's shared 
lock since it is sure that its parent(MoveRegionProcedure) has aquired the 
table's lock.
{code:java}
// If there is parent procedure, it would have already taken xlock, so no need 
to take
  // shared lock here. Otherwise, take shared lock.
  if (!procedure.hasParent()
  && waitTableQueueSharedLock(procedure, table) == null) {
  return true;
  }
{code}

But, it is not the case when Master was restarted. The child 
procedure(UnassignProcedure) will be executed first after restart. Though it 
has a parent(MoveRegionProcedure), but apprently the parent didn't hold the 
table's lock.
So, since it began to execute without hold the table's shared lock. A 
ModifyTableProcedure can aquire the table's exclusive lock and execute at the 
same time. Which is not possible if the master was not restarted.
This will cause a stuck before HBASE-20752. But since HBASE-20752 has fixed, I 
wrote a simple UT to repo this case.

I think we don't have to check the parent for table's shared lock. It is a 
shared lock, right? I think we can acquire it every time we need it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20727) Persist FlushedSequenceId to speed up WAL split after cluster restart

2018-06-13 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20727:
--

 Summary: Persist FlushedSequenceId to speed up WAL split after 
cluster restart
 Key: HBASE-20727
 URL: https://issues.apache.org/jira/browse/HBASE-20727
 Project: HBase
  Issue Type: New Feature
Affects Versions: 2.0.0
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0


We use flushedSequenceIdByRegion and storeFlushedSequenceIdsByRegion in 
ServerManager to record the latest flushed seqids of regions and stores. So 
during log split, we can use seqids stored in those maps to filter out the 
edits which do not need to be replayed. But, those maps are not persisted. 
After cluster restart or master restart, info of flushed seqids are all lost. 
Here I offer a way to persist those info to HDFS, even if master restart, we 
can still use those info to filter WAL edits and then to speed up replay.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20679) Add the ability to compile JSP dynamically in Jetty

2018-06-04 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20679:
--

 Summary: Add the ability to compile JSP dynamically in Jetty
 Key: HBASE-20679
 URL: https://issues.apache.org/jira/browse/HBASE-20679
 Project: HBase
  Issue Type: New Feature
Affects Versions: 2.0.0
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 3.0.0


As discussed in HBASE-20617, adding the ability to dynamically compile jsp 
enable us to do some hot fix. 
 For example, several days ago, in our testing HBase-2.0 cluster, procedureWals 
were corrupted due to some unknown reasons. After restarting the cluster, since 
some procedures(AssignProcedure for example) were corrupted and couldn't be 
replayed. Some regions were stuck in RIT forever. We couldn't use HBCK since it 
haven't support AssignmentV2 yet. As a matter of fact, the namespace region was 
not online, so the master was not inited, we even couldn't use shell command 
like assign/move. But, we wrote a jsp and fix this issue easily. The jsp file 
is like this:
{code:java}
<%
  String action = request.getParameter("action");
  HMaster master = (HMaster)getServletContext().getAttribute(HMaster.MASTER);
  List offlineRegionsToAssign = new ArrayList<>();
  List regionRITs = master.getAssignmentManager()
  .getRegionStates().getRegionsInTransition();
  for (RegionStates.RegionStateNode regionStateNode :  regionRITs) {
// if regionStateNode don't have a procedure attached, but meta state shows
// this region is in RIT, that means the previous procedure may be corrupted
// we need to create a new assignProcedure to assign them
if (!regionStateNode.isInTransition()) {
  offlineRegionsToAssign.add(regionStateNode.getRegionInfo());
  out.println("RIT region:" + regionStateNode);
}
  }
  // Assign offline regions. Uses round-robin.
  if ("fix".equals(action) && offlineRegionsToAssign.size() > 0) {

master.getMasterProcedureExecutor().submitProcedures(master.getAssignmentManager().
createRoundRobinAssignProcedures(offlineRegionsToAssign));
  } else {
out.println("use ?action=fix to fix RIT regions");
  }
%>
{code}
Above it is only one example we can do if we have the ability to compile jsp 
dynamically. We think it is very useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20611) UnsupportedOperationException may thrown when calling getCallQueueInfo()

2018-05-21 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20611:
--

 Summary: UnsupportedOperationException may thrown when calling 
getCallQueueInfo()
 Key: HBASE-20611
 URL: https://issues.apache.org/jira/browse/HBASE-20611
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Allan Yang


HBASE-16290 added a new feature to dump queue info,  the method 
getCallQueueInfo() need to iterate the queue to get the elements in the queue.  
But, except the Java's LinkedBlockingQueue, the other queue implementations 
like BoundedPriorityBlockingQueue and AdaptiveLifoCoDelCallQueue don't 
implement the method iterator(). If those queues are used, a 
UnsupportedOperationException  will be thrown.
This can be easily be reproduced by the UT testCallQueueInfo while adding a 
conf ( conf.set("hbase.ipc.server.callqueue.type", "deadline");)
{code}
java.lang.UnsupportedOperationException
at 
org.apache.hadoop.hbase.util.BoundedPriorityBlockingQueue.iterator(BoundedPriorityBlockingQueue.java:285)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor.getCallQueueCountsSummary(RpcExecutor.java:166)
at 
org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.getCallQueueInfo(SimpleRpcScheduler.java:241)
at 
org.apache.hadoop.hbase.ipc.TestSimpleRpcScheduler.testCallQueueInfo(TestSimpleRpcScheduler.java:164)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-20601) And multiPut support and other miscellaneous to pe

2018-05-18 Thread Allan Yang (JIRA)
Allan Yang created HBASE-20601:
--

 Summary: And multiPut support and other miscellaneous to pe
 Key: HBASE-20601
 URL: https://issues.apache.org/jira/browse/HBASE-20601
 Project: HBase
  Issue Type: Bug
  Components: tooling
Affects Versions: 2.0.0
Reporter: Allan Yang
Assignee: Allan Yang
 Fix For: 2.1.0


Add some useful stuff and some refinement to PE tool
1. And multiPut support
Though we have BufferedMutator, sometimes we need to benchmark batch put in a 
certain number.
Set --multiPut=number to enable batchput(meanwhile, --autoflush need be set to 
false)

2. And Connection Number support
Before, there is only on parameter to control the connection used by threads. 
oneCon=true means all threads use one connection, false means each thread has 
it own connection.
When thread number is high and oneCon=false, we noticed high context switch 
frequency in the machine which PE run on, disturbing the benchmark results(each 
connection has its own netty worker threads, 2*CPU IIRC).  
So, added a new parameter conNum to PE. set --conNum=2 means all threads will 
share 2 connections.

3. And avg RT and avg TPS/QPS statstic for all threads
Useful when we want to meansure the total throughtput of the cluster

4. Delete some redundant code
Now RandomWriteTest is inherited from SequentialWrite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-18233) We shouldn't wait for readlock in doMiniBatchMutation in case of deadlock

2017-06-18 Thread Allan Yang (JIRA)
Allan Yang created HBASE-18233:
--

 Summary: We shouldn't wait for readlock in doMiniBatchMutation in 
case of deadlock
 Key: HBASE-18233
 URL: https://issues.apache.org/jira/browse/HBASE-18233
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.2.7
Reporter: Allan Yang
Assignee: Allan Yang


Please refer to the discuss in HBASE-18144
https://issues.apache.org/jira/browse/HBASE-18144?focusedCommentId=16051701=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16051701




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (HBASE-18168) NoSuchElementException when rolling the log

2017-06-06 Thread Allan Yang (JIRA)
Allan Yang created HBASE-18168:
--

 Summary: NoSuchElementException when rolling the log
 Key: HBASE-18168
 URL: https://issues.apache.org/jira/browse/HBASE-18168
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.1.11
Reporter: Allan Yang
Assignee: Allan Yang


Today, one of our server aborted due to the following log.
{code}
2017-06-06 05:38:47,142 ERROR [regionserver/.logRoller] 
regionserver.LogRoller: Log rolling failed
java.util.NoSuchElementException
at 
java.util.concurrent.ConcurrentSkipListMap$Iter.advance(ConcurrentSkipListMap.java:2224)
at 
java.util.concurrent.ConcurrentSkipListMap$ValueIterator.next(ConcurrentSkipListMap.java:2253)
at java.util.Collections.min(Collections.java:628)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.findEligibleMemstoresToFlush(FSHLog.java:861)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.findRegionsToForceFlush(FSHLog.java:886)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:728)
at 
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:137)
at java.lang.Thread.run(Thread.java:756)
2017-06-06 05:38:47,142 FATAL [regionserver/.logRoller] 
regionserver.HRegionServer: ABORTING region server : Log rolling failed
java.util.NoSuchElementException
..
{code}

The code is here: 
{code}
private byte[][] findEligibleMemstoresToFlush(Map 
regionsSequenceNums) {
List regionsToFlush = null;
// Keeping the old behavior of iterating unflushedSeqNums under 
oldestSeqNumsLock.
synchronized (regionSequenceIdLock) {
  for (Map.Entry e: regionsSequenceNums.entrySet()) {
ConcurrentMap m =
this.oldestUnflushedStoreSequenceIds.get(e.getKey());
if (m == null) {
  continue;
}
long unFlushedVal = Collections.min(m.values()); //The exception is 
thrown here
..
{code}
The map 'm' is empty is the only reason I can think of why 
NoSuchElementException is thrown. I then looked up all code related to the 
update of 'oldestUnflushedStoreSequenceIds'. All update to 
'oldestUnflushedStoreSequenceIds' is guarded by the synchronization of 
'regionSequenceIdLock' except here:
{code}
private ConcurrentMap 
getOrCreateOldestUnflushedStoreSequenceIdsOfRegion(
  byte[] encodedRegionName) {
..
oldestUnflushedStoreSequenceIdsOfRegion =
new ConcurrentSkipListMap(Bytes.BYTES_COMPARATOR);
ConcurrentMap alreadyPut =
oldestUnflushedStoreSequenceIds.putIfAbsent(encodedRegionName,
  oldestUnflushedStoreSequenceIdsOfRegion); // Here, a empty map may 
put to 'oldestUnflushedStoreSequenceIds' with no synchronization
return alreadyPut == null ? oldestUnflushedStoreSequenceIdsOfRegion : 
alreadyPut;
  }
{code}

It should be a very rare bug. But it can lead to server abort. It only exists 
in branch-1.1. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-18156) Provide a tool to show cache summary

2017-06-04 Thread Allan Yang (JIRA)
Allan Yang created HBASE-18156:
--

 Summary: Provide a tool to show cache summary
 Key: HBASE-18156
 URL: https://issues.apache.org/jira/browse/HBASE-18156
 Project: HBase
  Issue Type: New Feature
Affects Versions: 2.0.0, 1.4.0
Reporter: Allan Yang
Assignee: Allan Yang


HBASE-17757 is already committed. But since there is no easy way to show the 
size distribution of cached blocks, it is hard to decide the unified size 
should be used. 
Here I provide a tool to show the details of size distribution of cached 
blocks. This tool is well used in our production environment. It is a jsp page 
summaries the cache details like this 
{code}
BlockCache type:org.apache.hadoop.hbase.io.hfile.LruBlockCache
LruBlockCache

Total size:28.40 GB

Current size:22.49 GB

MetaBlock size:1.56 GB

Free size:5.91 GB

Block count:152684

Size distribution summary:

BlockCacheSizeDistributionSummary [0 B<=blocksize<4 KB, blocks=833, 
heapSize=1.19 MB]

BlockCacheSizeDistributionSummary [4 KB<=blocksize<8 KB, blocks=65, 
heapSize=310.83 KB]

BlockCacheSizeDistributionSummary [8 KB<=blocksize<12 KB, blocks=175, 
heapSize=1.46 MB]

BlockCacheSizeDistributionSummary [12 KB<=blocksize<16 KB, blocks=18, 
heapSize=267.43 KB]

BlockCacheSizeDistributionSummary [16 KB<=blocksize<20 KB, blocks=512, 
heapSize=8.30 MB]

BlockCacheSizeDistributionSummary [20 KB<=blocksize<24 KB, blocks=22, 
heapSize=499.66 KB]

BlockCacheSizeDistributionSummary [24 KB<=blocksize<28 KB, blocks=24, 
heapSize=632.59 KB]

BlockCacheSizeDistributionSummary [28 KB<=blocksize<32 KB, blocks=34, 
heapSize=1.02 MB]

BlockCacheSizeDistributionSummary [32 KB<=blocksize<36 KB, blocks=31, 
heapSize=1.02 MB]

BlockCacheSizeDistributionSummary [36 KB<=blocksize<40 KB, blocks=22, 
heapSize=838.58 KB]

BlockCacheSizeDistributionSummary [40 KB<=blocksize<44 KB, blocks=28, 
heapSize=1.15 MB]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-18132) Low replication should be checked in period in case of datanode rolling upgrade

2017-05-30 Thread Allan Yang (JIRA)
Allan Yang created HBASE-18132:
--

 Summary: Low replication should be checked in period in case of 
datanode rolling upgrade
 Key: HBASE-18132
 URL: https://issues.apache.org/jira/browse/HBASE-18132
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.1.10, 1.4.0
Reporter: Allan Yang
Assignee: Allan Yang


For now, we just check low replication of WALs when there is a sync operation 
(HBASE-2234), rolling the log if the replica of the WAL is less than 
configured. But if the WAL has very little writes or no writes at all, low 
replication will not be detected and thus no log will be rolled. 
That is a problem when rolling updating datanode, all replica of the WAL with 
no writes will be restarted and lead to the WAL file end up with a abnormal 
state. Later operation of opening this file will be always failed.
I bring up a patch to check low replication of WALs at a configured period. 
When rolling updating datanodes, we just make sure the restart interval time 
between two nodes is bigger than the low replication check time, the WAL will 
be closed and rolled normally. A UT in the patch will show everything.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-18058) Zookeeper retry sleep time should have a up limit

2017-05-16 Thread Allan Yang (JIRA)
Allan Yang created HBASE-18058:
--

 Summary: Zookeeper retry sleep time should have a up limit
 Key: HBASE-18058
 URL: https://issues.apache.org/jira/browse/HBASE-18058
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 1.4.0
Reporter: Allan Yang
Assignee: Allan Yang


Now, in {{RecoverableZooKeeper}}, the retry backoff sleep time grow 
exponentially, but it doesn't have any up limit. It directly lead to a long 
long recovery time after Zookeeper going down for some while and come back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17969) Balance by table using SimpleLoadBalancer could end up imbalance

2017-04-27 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17969:
--

 Summary: Balance by table using SimpleLoadBalancer could end up 
imbalance
 Key: HBASE-17969
 URL: https://issues.apache.org/jira/browse/HBASE-17969
 Project: HBase
  Issue Type: Improvement
Affects Versions: 1.1.10
Reporter: Allan Yang
Assignee: Allan Yang


This really happens in our production env.
Here is a example:
Say we have three RS named r1, r2, r3. A table named table1 with 3 regions 
distributes on these rs like this:
r1 1
r2 1
r3 1
Each rs have one region, it means table1 is balanced. So balancer will not run.

If the region on r3 splits, then it becomes:
r1 1
r2 1
r3 2
For table1, in average, each rs will have min=1, max=2 regions. So still it is 
balanced, balancer will not run.

Then a region on r3 splits again, the distribution becomes:
r1 1
r2 1
r3 3
In average, each rs will have min=1, max=2 regions. So balancer will run.
For r1 and r2, they have already have min=1 regions. Balancer won't do any 
operation on them.
But for r3, it exceed max=3, so balancer will remove one region from r3 and 
choose one rs from r1, r2 to move to.
But r1 and r2 have the same load, so balancer will always choose r1 since 
servername r1 < r2(alphabet order, sorted by ServerAndLoad's compareTo method). 
It is OK for table1 itself. But if every table in the cluster have similar 
situations like table1, then the load in the cluster will always be like r1 > 
r2 > r3.  
So, the solution here is when each rs reach min regions (min=total region / 
servers), but there still some region need to move, shuffle the regionservers 
before move.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17808) FastPath for RWQueueRpcExecutor

2017-03-20 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17808:
--

 Summary: FastPath for RWQueueRpcExecutor
 Key: HBASE-17808
 URL: https://issues.apache.org/jira/browse/HBASE-17808
 Project: HBase
  Issue Type: Improvement
  Components: rpc
Affects Versions: 2.0.0
Reporter: Allan Yang
Assignee: Allan Yang


FastPath for the FIFO rpcscheduler was introduced in HBASE-16023. But it is not 
implemented for RW queues. In this issue, I use 
FastPathBalancedQueueRpcExecutor in RW queues. So anyone who want to isolate 
their read/write requests can also benefit from the fastpath.
I haven't test the performance yet. But since I haven't change any of the core 
implemention of FastPathBalancedQueueRpcExecutor, it should have the same 
performance in HBASE-16023.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17757) Unify blocksize after encoding to decrease memory fragment

2017-03-07 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17757:
--

 Summary: Unify blocksize after encoding to decrease memory 
fragment 
 Key: HBASE-17757
 URL: https://issues.apache.org/jira/browse/HBASE-17757
 Project: HBase
  Issue Type: New Feature
Reporter: Allan Yang
Assignee: Allan Yang


Usually, we store encoded block(uncompressed) in blockcache/bucketCache. Though 
we have set the blocksize, after encoding, blocksize is varied. Varied 
blocksize will cause memory fragment problem, which will result in more FGC 
finally.In order to relief the memory fragment, This issue adjusts the encoded 
block to a unified size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working

2017-03-01 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17718:
--

 Summary: Difference between RS's servername and its ephemeral node 
cause SSH stop working
 Key: HBASE-17718
 URL: https://issues.apache.org/jira/browse/HBASE-17718
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.1.8, 1.2.4, 2.0.0
Reporter: Allan Yang
Assignee: Allan Yang



After HBASE-9593, RS put up an ephemeral node in ZK before reporting for duty. 
But if the hosts config (/etc/hosts) is different between master and RS, RS's 
serverName can be different from the one stored the ephemeral zk node. The 
email metioned in HBASE-13753 
(http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E)
 is exactly what happened in our production env. 

But what the email didn't point out is that the difference between serverName 
in RS and zk node can cause SSH stop to work. as we can see from the code in 
{{RegionServerTracker}}
{code}
  @Override
  public void nodeDeleted(String path) {
if (path.startsWith(watcher.rsZNode)) {
  String serverName = ZKUtil.getNodeName(path);
  LOG.info("RegionServer ephemeral node deleted, processing expiration [" +
serverName + "]");
  ServerName sn = ServerName.parseServerName(serverName);
  if (!serverManager.isServerOnline(sn)) {
LOG.warn(serverName.toString() + " is not online or isn't known to the 
master."+
 "The latter could be caused by a DNS misconfiguration.");
return;
  }
  remove(sn);
  this.serverManager.expireServer(sn);
}
  }
{code}
The server will not be processed by SSH/ServerCrashProcedure. The regions on 
this server will not been assigned again until master restart or failover.
I know HBASE-9593 was to fix the issue if RS report to duty and crashed before 
it can put up a zk node. It is a very rare case. But The issue I metioned can 
happened more often(due to DNS, config, etc.) and have more severe consequence.

So here I offer some solutions to discuss:
1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in 
branch-0.98
2. Abort RS if master return a different name, otherwise SSH can't work properly
3. Master receive whatever servername reported by RS and don't change it.

 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17673) Monitored RPC Handler not show in the WebUI

2017-02-20 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17673:
--

 Summary: Monitored RPC Handler not show in the WebUI
 Key: HBASE-17673
 URL: https://issues.apache.org/jira/browse/HBASE-17673
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.1.8, 1.2.4, 2.0.0, 3.0.0
Reporter: Allan Yang
Assignee: Allan Yang
Priority: Minor


This issue has been fixed once in HBASE-14674. But, I noticed that almost all 
RS in our production environment still have this problem. Strange thing is that 
newly started servers seems do not affected. Digging for a while, then I 
realize the {{CircularFifoBuffer}} introduced by HBASE-10312 is the root cause. 
The RPC hander's monitoredTask only create once, if the server is flooded with 
tasks, RPC monitoredTask can be purged by CircularFifoBuffer, and then never 
visible in WebUI.
So my solution is create a list for RPC monitoredTask sepreately. It is OK to 
do so since the RPC handlers remain in a fixed number. It won't increase or 
decrease during the lifetime of the server.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17506) started mvcc transaction is not completed in branch-1

2017-01-21 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17506:
--

 Summary: started mvcc transaction is not completed in branch-1
 Key: HBASE-17506
 URL: https://issues.apache.org/jira/browse/HBASE-17506
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: Allan Yang
Assignee: Allan Yang


In {{doMiniBatchMutation}}, if it is in replay and if the the nonce of the 
mutation is different, we append them to a different wal.  But, after 
HBASE-14465, we start a mvcc transition in the ringbuffer's append thread.  So, 
every time we append a wal entry, we started a mvcc transition, but we didn't 
complete the mvcc transition anywhere. This can block other transition of this 
region.
{code}
// txid should always increase, so having the one from the last call is ok.
// we use HLogKey here instead of WALKey directly to support legacy 
coprocessors.
walKey = new 
ReplayHLogKey(this.getRegionInfo().getEncodedNameAsBytes(),
  this.htableDescriptor.getTableName(), now, m.getClusterIds(),
  currentNonceGroup, currentNonce, mvcc);
txid = this.wal.append(this.htableDescriptor,  
this.getRegionInfo(),  walKey,
  walEdit, true);
walEdit = new WALEdit(cellCount, isInReplay);
walKey = null;
{code}

Looked at master branch, there is no such problem. It has a method 
named{{appendCurrentNonces}} :
{code}
 private void appendCurrentNonces(final Mutation mutation, final boolean replay,
   final WALEdit walEdit, final long now, final long currentNonceGroup, 
final long currentNonce)
   throws IOException {
 if (walEdit.isEmpty()) return;
 if (!replay) throw new IOException("Multiple nonces per batch and not in 
replay");
 WALKey walKey = new WALKey(this.getRegionInfo().getEncodedNameAsBytes(),
 this.htableDescriptor.getTableName(), now, mutation.getClusterIds(),
 currentNonceGroup, currentNonce, mvcc, this.getReplicationScope());
 this.wal.append(this.getRegionInfo(), walKey, walEdit, true);
 // Complete the mvcc transaction started down in append else it will block 
others
 this.mvcc.complete(walKey.getWriteEntry());
   }
{code}

Yes, the easiest way to fix branch-1 is to complete the writeEntry like master 
branch do. But is it really fine to do this?

1. Question 1:
complete the mvcc transition before waiting sync will create a disturbance of 
data visibility.

2.Question 2:
In what circumstance will there be different nonce and nonce group in a single 
wal entry? Nonce are used in append/increment. But in {{batchMuate}} ,we treat 
them differently and append one wal entry for each of them. So I think no test 
can reach this code path,  that maybe why no one has found this bug(Please tell 
me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-17482) mvcc mechanism failed when using mvccPreAssign

2017-01-17 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17482:
--

 Summary: mvcc mechanism failed when using mvccPreAssign
 Key: HBASE-17482
 URL: https://issues.apache.org/jira/browse/HBASE-17482
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0..
Reporter: Allan Yang
Assignee: Allan Yang
Priority: Critical


If mvccPreAssign and ASYNC_WAL is used, then cells may have been commited to 
memstore before append thread can stamp seqid to them. The unit test shows 
everything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-17475) Stack overflow in AsyncProcess if retry too much

2017-01-16 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17475:
--

 Summary: Stack overflow in AsyncProcess if retry too much
 Key: HBASE-17475
 URL: https://issues.apache.org/jira/browse/HBASE-17475
 Project: HBase
  Issue Type: Bug
  Components: API
Affects Versions: 2.0.0, 1.4.0
Reporter: Allan Yang
Assignee: Allan Yang


In AsyncProcess, we resubmit the retry task in the same thread
{code}
  // run all the runnables
  for (Runnable runnable : runnables) {
if ((--actionsRemaining == 0) && reuseThread) {
  runnable.run();
} else {
  try {
pool.submit(runnable);
  } 
  ..
{code}

But, if we retry too much time. soon, stack overflow will occur. This is very 
common in clusters with Phoenix. Phoenix need to write index table in the 
normal write path, retry will cause stack overflow exception.

{noformat}
"htable-pool19-t2" #582 daemon prio=5 os_prio=0 tid=0x02687800 
nid=0x4a96 waiting on condition [0x7fe3f6301000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1174)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveMultiAction(AsyncProcess.java:1321)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1200(AsyncProcess.java:575)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:729)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.sendMultiAction(AsyncProcess.java:977)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:886)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1181)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveMultiAction(AsyncProcess.java:1321)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1200(AsyncProcess.java:575)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:729)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.sendMultiAction(AsyncProcess.java:977)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:886)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1181)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveMultiAction(AsyncProcess.java:1321)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1200(AsyncProcess.java:575)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:729)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.sendMultiAction(AsyncProcess.java:977)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:886)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1181)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveMultiAction(AsyncProcess.java:1321)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1200(AsyncProcess.java:575)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:729)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.sendMultiAction(AsyncProcess.java:977)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:886)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1181)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveMultiAction(AsyncProcess.java:1321)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1200(AsyncProcess.java:575)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:729)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.sendMultiAction(AsyncProcess.java:977)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:886)
at 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1181)
at 

[jira] [Created] (HBASE-17471) Region Seqid will be out of order in WAL if using mvccPreAssign

2017-01-15 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17471:
--

 Summary: Region Seqid will be out of order in WAL if using 
mvccPreAssign
 Key: HBASE-17471
 URL: https://issues.apache.org/jira/browse/HBASE-17471
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.4.0
Reporter: Allan Yang
Assignee: Allan Yang


 mvccPreAssign was bring by HBASE-16698, which truly improved the performance 
of writing, especially in ASYNC_WAL scenario. But mvccPreAssign was only used 
in {{doMiniBatchMutate}}, not in Increment/Append path. If Increment/Append and 
batch put are using against the same region in parallel, then seqid of the same 
region may not monotonically increasing in the WAL. Since one write path 
acquires mvcc/seqid before append, and the other acquires in the append/sync 
consume thread.

The out of order situation can easily reproduced by a simple UT, which was 
attached in the attachment. I modified the code to assert on the disorder: 
{code}
if(this.highestSequenceIds.containsKey(encodedRegionName)) {
  assert highestSequenceIds.get(encodedRegionName) < sequenceid;
}
{code}


I'd like to say, If we are allow disorder in WALs, then this is not a issue. 

But as far as I know, if {{highestSequenceIds}} is not properly set, some WALs 
may not archive to oldWALs correctly.

which I haven't figure out yet is that, will disorder in WAL cause data loss 
when recovering from disaster? If so, then it is a big problem need to be fixed.

I have fix this problem in our costom1.1.x branch, my solution is using 
mvccPreAssign everywhere, making it un-configurable. Since mvccPreAssign it is 
indeed a better way than assign seqid in the ringbuffer thread while keeping 
handlers waiting for it.

If anyone think it is doable, then I will port it to branch-1 and master branch 
and upload it. 


 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-17319) Truncate table with preserve after split may cause truncate fail

2016-12-14 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17319:
--

 Summary: Truncate table with preserve after split may cause 
truncate fail
 Key: HBASE-17319
 URL: https://issues.apache.org/jira/browse/HBASE-17319
 Project: HBase
  Issue Type: Bug
  Components: Admin
Affects Versions: 1.2.4, 1.1.7
Reporter: Allan Yang
Assignee: Allan Yang


In truncateTableProcedue , when getting tables regions  from meta to recreate 
new regions, split parents are not excluded, so the new regions can end up with 
the same start key, and the same region dir:
{noformat}
2016-12-14 20:15:22,231 WARN  [RegionOpenAndInitThread-writetest-1] 
regionserver.HRegionFileSystem: Trying to create a region that already exists 
on disk: 
hdfs://hbasedev1/zhengyan-hbase11-func2/.tmp/data/default/writetest/9b2c8d1539cd92661703ceb8a4d518a1
{noformat} 
The truncateTableProcedue will retry forever and never get success.
A attached unit test shows everything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-17275) Assign timeout cause region unassign forever

2016-12-07 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17275:
--

 Summary: Assign timeout cause region unassign forever
 Key: HBASE-17275
 URL: https://issues.apache.org/jira/browse/HBASE-17275
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 1.1.7, 1.2.3
Reporter: Allan Yang
Assignee: Allan Yang


This is a real cased happened in my test cluster.
I have more 8000 regions to assign when I restart a cluster, but I only started 
one regionserver. That means master need to assign these 8000 regions to a 
single server(I know it is not right, but just for testing).

The rs recevied the open region rpc and began to open regions. But the due to 
the hugh number of regions, , master timeout the rpc call(but actually some 
region had already opened) after 1 mins, as you can see from log 1.
{noformat}
1. 2016-11-22 10:17:32,285 INFO  [example.org:30001.activeMasterManager] 
master.AssignmentManager: Unable to communicate with 
example.org,30003,1479780976834 in order to assign regions,
java.io.IOException: Call to /example.org:30003 failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, 
operationTimeout=6 expired.
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290)
at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177)
at 
org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775)
at 
org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876)
at 
org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646)
at 
org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493)
at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796)
at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188)
at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711)
at java.lang.Thread.run(Thread.java:756)
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, 
waitTime=60001, operationTimeout=6 expired.
at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246)
... 14 more  
{noformat}
for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after 
timeout, master use a pool to re-assign them, as in 2
{noformat}
2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: 
Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 
state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834}  
{noformat}

But, this region was actually opened on the rs, but (maybe) due to the hugh 
pressure, the OPENED zk event recevied by master , as you can tell from 3, 
"which is more than 15 seconds late"
{noformat}
3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] 
master.AssignmentManager: Handling RS_ZK_REGION_OPENED, 
server=example.org,30003,1479780976834, 
region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, 
current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
ts=1479780992078, server=example.org,30003,1479780976834}
{noformat}

In the meantime, master still try to re-assign this region in another thread. 
Master first close this region in case of multi assign, then change the state 
of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node in 
zk was also transitioned to OFFLINE, as in 4,5,6,7
{noformat}
4. 2016-11-22 10:17:32,321 DEBUG [AM.-pool1-t26] master.AssignmentManager: Sent 
CLOSE to example.org,30003,1479780976834 for region 
test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615.
5. 2016-11-22 10:17:32,461 INFO  [AM.-pool1-t26] master.RegionStates: 
Transition {7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, 
ts=1479781052344, server=example.org,30003,1479780976834} to 
{7e9aee32eb98a6fc9d503b99fc5f9615 state=OFFLINE, ts=1479781052461, 
server=example.org,30003,1479780976834}
6. 2016-11-22 10:17:32,469 DEBUG 

[jira] [Created] (HBASE-17264) Process RIT with offline state will always fail to open in the first time

2016-12-06 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17264:
--

 Summary: Process RIT with offline state will always fail to open 
in the first time
 Key: HBASE-17264
 URL: https://issues.apache.org/jira/browse/HBASE-17264
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 1.1.7
Reporter: Allan Yang
Assignee: Allan Yang
 Attachments: HBASE-17264-branch1.1.patch

In Assignment#processRegionsInTransition, when handling regions with 
M_ZK_REGION_OFFLINE state, we used a handler to reassign this region. But, when 
calling assign, we passed not to set the zk node 
{code}
case M_ZK_REGION_OFFLINE:
// Insert in RIT and resend to the regionserver
regionStates.updateRegionState(rt, State.PENDING_OPEN);
final RegionState rsOffline = regionStates.getRegionState(regionInfo);
this.executorService.submit(
  new EventHandler(server, EventType.M_MASTER_RECOVERY) {
@Override
public void process() throws IOException {
  ReentrantLock lock = 
locker.acquireLock(regionInfo.getEncodedName());
  try {
RegionPlan plan = new RegionPlan(regionInfo, null, sn);
addPlan(encodedName, plan);
assign(rsOffline, false, false);  //we decide to not to 
setOfflineInZK
  } finally {
lock.unlock();
  }
}
  });
break;
{code}
But, when setOfflineInZK is false, we passed a zk node vesion of -1 to the 
regionserver, meaning the zk node does not exists. But actually the offline zk 
node does exist with a different version. RegionServer will report fail to open 
because of this.
This situation is trully happened in our test environment. Though the master 
will recevied the FAILED_OPEN zk event and retry later, but due to a another 
bug(I will open another jira later). The Region will be remain in closed state 
forever.

Master assign region in RIT
{noformat}
2016-11-23 17:11:46,842 INFO  [example.org:30001.activeMasterManager] 
master.AssignmentManager: Processing 57513956a7b671f4e8da1598c2e2970e in state: 
M_ZK_REGION_OFFLINE
2016-11-23 17:11:46,842 INFO  [example.org:30001.activeMasterManager] 
master.RegionStates: Transition {57513956a7b671f4e8da1598c2e2970e 
state=OFFLINE, ts=1479892306738, server=example.org,30003,1475893095003} to 
{57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, ts=1479892306842, 
server=example.org,30003,1479780976834}
2016-11-23 17:11:46,842 INFO  [example.org:30001.activeMasterManager] 
master.AssignmentManager: Processed region 57513956a7b671f4e8da1598c2e2970e in 
state M_ZK_REGION_OFFLINE, on server: example.org,30003,1479780976834
2016-11-23 17:11:46,843 INFO  [MASTER_SERVER_OPERATIONS-example.org:30001-0] 
master.AssignmentManager: Assigning 
test,QFO7M,1475986053104.57513956a7b671f4e8da1598c2e2970e. to 
example.org,30003,1479780976834
{noformat}

RegionServer recevied the open region request, and new a RegionOpenHandler to 
open the region, but only to find the RIT node's version is not as it expected. 
RS transition the RIT ZK node to failed open in the end
{noformat}
2016-11-23 17:11:46,860 WARN  [RS_OPEN_REGION-example.org:30003-1] 
coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE to 
OPENING for region=57513956a7b671f4e8da1598c2e2970e
2016-11-23 17:11:46,861 WARN  [RS_OPEN_REGION-example.org:30003-1] 
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for 
encodedName=57513956a7b671f4e8da1598c2e2970e
2016-11-23 17:11:46,860 WARN  [RS_OPEN_REGION-example.org:30003-1] 
zookeeper.ZKAssign: regionserver:30003-0x15810b5f633015f, 
quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181, 
baseZNode=/test-hbase11-func2 Attempt to transition the unassigned node for 
57513956a7b671f4e8da1598c2e2970e from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING failed, the node existed but was version 3 not the 
expected version -1
{noformat}

Master recevied this zk event and begin to handle RS_ZK_REGION_FAILED_OPEN
{noformat}
2016-11-23 17:11:46,944 DEBUG [AM.ZK.Worker-pool2-t1] master.AssignmentManager: 
Handling RS_ZK_REGION_FAILED_OPEN, server=example.org,30003,1479780976834, 
region=57513956a7b671f4e8da1598c2e2970e, 
current_state={57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, 
ts=1479892306843, server=example.org,30003,1479780976834}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-17265) Region left unassigned in master failover when failed open

2016-12-06 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17265:
--

 Summary: Region left unassigned in master failover when failed open
 Key: HBASE-17265
 URL: https://issues.apache.org/jira/browse/HBASE-17265
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Affects Versions: 1.1.7
Reporter: Allan Yang
Assignee: Allan Yang
 Attachments: HBASE-17265-branch1.patch

This problem is very similar with HBASE-13330. It is also a result of 
ServerShutdownHandler and AssignmentManager 'thought' the region will be 
assigned by each other, and left the region remain unassigned.

But HBASE-13330 only dealed with RS_ZK_REGION_FAILED_OPEN in 
{{processRegionInTransition}}.  

Region failed open may happen after {{processRegionInTransition}}. In my case, 
when master failover, it assigned all RIT regions, but some are failed to 
open(due to HBASE-17264), AssignmentManager received the zk event, and skip to 
assign it(this region was opened on a failed server before and already in RIT 
before master failover). The SSH also skip to assign it because it was RIT on 
another RS.

Master recevied a zk event of RS_ZK_REGION_FAILED_OPEN and begin to handle it:
{noformat}
2016-11-23 17:11:46,944 DEBUG [AM.ZK.Worker-pool2-t1] master.AssignmentManager: 
Handling RS_ZK_REGION_FAILED_OPEN, server=example.org,30003,1479780976834, 
region=57513956a7b671f4e8da1598c2e2970e, 
current_state={57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, 
ts=1479892306843, server=example.org,30003,1479780976834}
2016-11-23 17:11:46,944 INFO  [AM.ZK.Worker-pool2-t1] master.RegionStates: 
Transition {57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, 
ts=1479892306843, server=example.org,30003,1479780976834} to 
{57513956a7b671f4e8da1598c2e2970e state=CLOSED, ts=1479892306944, 
server=example.org,30003,1479780976834}
2016-11-23 17:11:46,945 WARN  [AM.ZK.Worker-pool2-t1] master.RegionStates: 
57513956a7b671f4e8da1598c2e2970e moved to CLOSED on 
example.org,30003,1479780976834, expected example.org,30003,1475893095003
2016-11-23 17:11:46,950 DEBUG [AM.ZK.Worker-pool2-t1] master.AssignmentManager: 
Found an existing plan for 
test,QFO7M,1475986053104.57513956a7b671f4e8da1598c2e2970e. destination server 
is example.org,30003,1479780976834 accepted as a dest server = false
2016-11-23 17:11:47,012 DEBUG [AM.ZK.Worker-pool2-t1] master.AssignmentManager: 
No previous transition plan found (or ignoring an existing plan) for 
test,QFO7M,1475986053104.57513956a7b671f4e8da1598c2e2970e.; generated random 
plan=hri=test,QFO7M,1475986053104.57513956a7b671f4e8da1598c2e2970e., src=, 
dest=11.239.21.235,30003,1479781410131; 2 (online=3) available servers, 
forceNewPlan=true
2016-11-23 17:11:47,014 DEBUG [AM.ZK.Worker-pool2-t1] 
handler.ClosedRegionHandler: Handling CLOSED event for 
57513956a7b671f4e8da1598c2e2970e
2016-11-23 17:11:47,015 WARN  [AM.ZK.Worker-pool2-t1] master.RegionStates: 
57513956a7b671f4e8da1598c2e2970e moved to CLOSED on 
example.org,30003,1479780976834, expected example.org,30003,1475893095003
{noformat}

AssignmentManager skip to assign it because the region was on a failed server
{noformat}
2016-11-23 17:11:47,017 INFO  [AM.ZK.Worker-pool2-t1] master.AssignmentManager: 
Skip assigning test,QFO7M,1475986053104.57513956a7b671f4e8da1598c2e2970e., it's 
host example.org,30003,1475893095003 is dead but not processed yet
{noformat}

SSH also skip it because it was RIT on another server
{noformat}
2016-11-23 17:12:17,850 INFO  [MASTER_SERVER_OPERATIONS-example.org:30001-0] 
master.RegionStates: Transitioning {57513956a7b671f4e8da1598c2e2970e 
state=CLOSED, ts=1479892307015, server=example.org,30003,1479780976834} will be 
handled by SSH for example.org,30003,1475893095003
2016-11-23 17:12:17,910 INFO  [MASTER_SERVER_OPERATIONS-example.org:30001-0] 
handler.ServerShutdownHandler: Skip assigning region in transition on other 
server{57513956a7b671f4e8da1598c2e2970e state=CLOSED, ts=1479892307015, 
server=example.org,30003,1479780976834}
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-17113) finding middle key in HFileV2 is always wrong and can cause IndexOutOfBoundsException

2016-11-16 Thread Allan Yang (JIRA)
Allan Yang created HBASE-17113:
--

 Summary: finding middle key in HFileV2 is always wrong and can 
cause IndexOutOfBoundsException 
 Key: HBASE-17113
 URL: https://issues.apache.org/jira/browse/HBASE-17113
 Project: HBase
  Issue Type: Bug
  Components: HFile
Affects Versions: 1.2.4, 0.98.23, 1.1.7, 0.94.17, 2.0.0
Reporter: Allan Yang
Assignee: Allan Yang


When we want  to split a region, we need to  get the middle rowkey from the 
biggest store file. 

Here is the code from HFileBlockIndex.midkey() which help us find a 
approximation middle key.
{code}
// Caching, using pread, assuming this is not a compaction.
HFileBlock midLeafBlock = cachingBlockReader.readBlock(
midLeafBlockOffset, midLeafBlockOnDiskSize, true, true, false, true,
BlockType.LEAF_INDEX, null);

ByteBuffer b = midLeafBlock.getBufferWithoutHeader();
int numDataBlocks = b.getInt();
int keyRelOffset = b.getInt(Bytes.SIZEOF_INT * (midKeyEntry + 1));
int keyLen = b.getInt(Bytes.SIZEOF_INT * (midKeyEntry + 2)) -
keyRelOffset - SECONDARY_INDEX_ENTRY_OVERHEAD;
int keyOffset = Bytes.SIZEOF_INT * (numDataBlocks + 2) + keyRelOffset
+ SECONDARY_INDEX_ENTRY_OVERHEAD;
targetMidKey = ByteBufferUtils.toBytes(b, keyOffset, keyLen);
{code}
and in each entry of Non-root block index contains three object:
1. Offset of the block referenced by this entry in the file (long)
2 .Ondisk size of the referenced block (int)
3. RowKey. 

But when we caculating the keyLen from the entry, we forget to take away the 12 
byte overhead(1,2 above, SECONDARY_INDEX_ENTRY_OVERHEAD in the code). So the 
keyLen is always 12 bytes bigger than the real rowkey length.
Every time we read the rowkey form the entry, we read 12 bytes from the next 
entry. 
No exception will throw unless the middle key is in the last entry of the 
Non-root block index. which will cause a IndexOutOfBoundsException. That is 
exactly what HBASE-16097 is suffering from.
{code}
2016-11-16 05:27:31,991 ERROR [MemStoreFlusher.1] regionserver.MemStoreFlusher: 
Cache flusher failed for entry [flush region hitsdb,\x14\x03\x83\x1AX\x1A\x9A 
\x00\x00\x07\x00\x00\x07\x00\x00\x09\x00\x00\x09\x00\x01\x9F\x00F\xE3\x00\x00\x0A\x00\x01~\x00\x00\x08\x00\x5C\x09\x00\x03\x11\x00\xEF\x99,1478311873096.79d3f7f285396b6896f3229e2bcac7af.]
java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkIndex(Buffer.java:532)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
at 
org.apache.hadoop.hbase.util.ByteBufferUtils.toBytes(ByteBufferUtils.java:490)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.midkey(HFileBlockIndex.java:349)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.midkey(HFileReaderV2.java:529)
at 
org.apache.hadoop.hbase.regionserver.StoreFile$Reader.midkey(StoreFile.java:1527)
at 
org.apache.hadoop.hbase.regionserver.StoreFile.getFileSplitPoint(StoreFile.java:684)
at 
org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.getSplitPoint(DefaultStoreFileManager.java:126)
at 
org.apache.hadoop.hbase.regionserver.HStore.getSplitPoint(HStore.java:1976)
at 
org.apache.hadoop.hbase.regionserver.RegionSplitPolicy.getSplitPoint(RegionSplitPolicy.java:82)
at 
org.apache.hadoop.hbase.regionserver.HRegion.checkSplit(HRegion.java:7614)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:521)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:75)
at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
at java.lang.Thread.run(Thread.java:756)

{code}

It is a quite serious bug. It may exsits from HFileV2 was invented. But no one 
has found out! Since this bug ONLY happens when finding a middlekey, and since 
we compare a rowkey from the left side, adding 12 bytes more to the right side 
is totally OK, no one cares!
It even won't throw IndexOutOfBoundsException before HBASE-12297. since 
{{Arrays.copyOfRange}} is used, which will check the limit to ensue the length 
won't running past the end of the array.
 But now, {{ByteBufferUtils.toBytes}} is used and IndexOutOfBoundsException 
will been thrown. 

It happens in our production environment. Because of this bug, the region can't 
be split can grow bigger and bigger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-16856) Exception message in SyncRunner.run() should print currentSequence but syncFutureSequence

2016-10-17 Thread Allan Yang (JIRA)
Allan Yang created HBASE-16856:
--

 Summary: Exception message in SyncRunner.run() should print 
currentSequence but syncFutureSequence
 Key: HBASE-16856
 URL: https://issues.apache.org/jira/browse/HBASE-16856
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 1.1.7, 1.2.2, 2.0.0
Reporter: Allan Yang
Assignee: Allan Yang
Priority: Minor


A very small bug, a typo in exception message:
{code}
if (syncFutureSequence > currentSequence) {
  throw new IllegalStateException("currentSequence=" + 
syncFutureSequence
  + ", syncFutureSequence=" + syncFutureSequence);
}
{code}
It should print currentSequence and syncFutureSequence, but print two 
syncFutureSequence



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-16816) HMaster.move() should throw exception if region to move is not online

2016-10-12 Thread Allan Yang (JIRA)
Allan Yang created HBASE-16816:
--

 Summary: HMaster.move() should throw exception if region to move 
is not online
 Key: HBASE-16816
 URL: https://issues.apache.org/jira/browse/HBASE-16816
 Project: HBase
  Issue Type: Bug
  Components: Admin
Affects Versions: 1.1.2
Reporter: Allan Yang
Assignee: Allan Yang
Priority: Minor


The move region function in HMaster only checked the region to move if it is 
not exist. 
{code}
if (regionState == null) {
  throw new UnknownRegionException(Bytes.toStringBinary(encodedRegionName));
}

{code}

It will not return anything if the region is split or in transition which is 
not movable. So the caller has no way to know if the move region operation is 
failed.
It is a problem for "region_move.rb". It only gives up moving a region if a 
exception is thrown.Otherwise, it will wait until a timeout and retry. Without 
a exception, it have no idea the region is not movable.
{code}
begin
  admin.move(Bytes.toBytes(r.getEncodedName()), Bytes.toBytes(newServer))
rescue java.lang.reflect.UndeclaredThrowableException,
org.apache.hadoop.hbase.UnknownRegionException => e
  $LOG.info("Exception moving "  + r.getEncodedName() +
"; split/moved? Continuing: " + e)
  return
end
 # Wait till its up on new server before moving on
maxWaitInSeconds = admin.getConfiguration.getInt("hbase.move.wait.max", 60)
maxWait = Time.now + maxWaitInSeconds
while Time.now < maxWait
  same = isSameServer(admin, r, original)
  break unless same
  sleep 0.1
end
  end
{code}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-16572) Sync method in RecoverableZooKeeper failed to pass callback fucntion in

2016-09-07 Thread Allan Yang (JIRA)
Allan Yang created HBASE-16572:
--

 Summary: Sync method in RecoverableZooKeeper failed to pass 
callback fucntion in
 Key: HBASE-16572
 URL: https://issues.apache.org/jira/browse/HBASE-16572
 Project: HBase
  Issue Type: Bug
  Components: Zookeeper
Affects Versions: 1.1.4, 2.0.0
Reporter: Allan Yang
Priority: Minor
 Fix For: 2.0.0


{code:java}
  public void sync(String path, AsyncCallback.VoidCallback cb, Object ctx) 
throws KeeperException {
checkZk().sync(path, null, null); //callback function cb is not passed in
  }
{code}
It is obvious that the callback method is not passed in.  Since sync operation 
in Zookeeper is a 'async' operation, we need a callback method to notify the 
caller that the 'sync' operation is finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-16283) Batch Append/Increment will always fail if set ReturnResults to false

2016-07-25 Thread Allan Yang (JIRA)
Allan Yang created HBASE-16283:
--

 Summary: Batch Append/Increment will always fail if set 
ReturnResults to false
 Key: HBASE-16283
 URL: https://issues.apache.org/jira/browse/HBASE-16283
 Project: HBase
  Issue Type: Bug
  Components: API
Affects Versions: 1.2.2, 1.1.5, 2.0.0
Reporter: Allan Yang
Priority: Minor
 Fix For: 2.0.0


If set Append/Increment's ReturnResult attribute to false, and batch the 
appends/increments to server. The batch operation will always return false.
The reason is that, since return result is set to false, append/increment will 
return null instead of Result object. But in ResponseConverter#getResults, 
there is some check code 
{code}
if (requestRegionActionCount != responseRegionActionResultCount) {
  throw new IllegalStateException("Request mutation count=" + 
requestRegionActionCount +
  " does not match response mutation result count=" + 
responseRegionActionResultCount);
}
{code}
That means if the result count is not meat with request mutation count, it will 
fail the request.
The solution is simple, instead of returning a null result, return a empty 
result if ReturnResult set to null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-16238) It's useless to catch SESSIONEXPIRED exception and retry in RecoverableZooKeeper

2016-07-16 Thread Allan Yang (JIRA)
Allan Yang created HBASE-16238:
--

 Summary: It's useless to catch SESSIONEXPIRED exception and retry 
in RecoverableZooKeeper
 Key: HBASE-16238
 URL: https://issues.apache.org/jira/browse/HBASE-16238
 Project: HBase
  Issue Type: Bug
  Components: Zookeeper
Reporter: Allan Yang
Priority: Minor


After HBASE-5549, SESSIONEXPIRED exception was caught and retried with other 
zookeeper exceptions like ConnectionLoss. But it is useless to retry when a 
session expired happens, since the retry will never be successful. Though there 
is a config called "zookeeper.recovery.retry" to control retry times, in our 
cases, we set this config to a very big number like "9". When a session 
expired happens, the regionserver should kill itself, but because of the 
retrying, threads of regionserver stuck at trying to reconnect to zookeeper, 
and never properly shut down.

{code}
public Stat exists(String path, boolean watch)
  throws KeeperException, InterruptedException {
TraceScope traceScope = null;
try {
  traceScope = Trace.startSpan("RecoverableZookeeper.exists");
  RetryCounter retryCounter = retryCounterFactory.create();
  while (true) {
try {
  return checkZk().exists(path, watch);
} catch (KeeperException e) {
  switch (e.code()) {
case CONNECTIONLOSS:
case SESSIONEXPIRED: //we shouldn't catch this
case OPERATIONTIMEOUT:
  retryOrThrow(retryCounter, e, "exists");
  break;

default:
  throw e;
  }
}
retryCounter.sleepUntilNextRetry();
  }
} finally {
  if (traceScope != null) traceScope.close();
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-15474) Exception in HConnectionImplementation's constructor cause Zookeeper connnections leak

2016-03-19 Thread Allan Yang (JIRA)
Allan Yang created HBASE-15474:
--

 Summary: Exception in HConnectionImplementation's constructor 
cause Zookeeper connnections leak 
 Key: HBASE-15474
 URL: https://issues.apache.org/jira/browse/HBASE-15474
 Project: HBase
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Allan Yang
Assignee: Allan Yang


HConnectionImplementation creates a ZooKeeperKeepAliveConnection during 
construction, but if the constructor throw a exception, the zookeeper 
connection is not properly closed. 
{code}
HConnectionImplementation(Configuration conf, boolean managed,
ExecutorService pool, User user) throws IOException {
  this(conf);
  this.user = user;
  this.batchPool = pool;
  this.managed = managed;
  this.registry = setupRegistry();
  retrieveClusterId(); //here is the zookeeper connection created
this.rpcClient = RpcClientFactory.createClient(this.conf, 
this.clusterId);
this.rpcControllerFactory = RpcControllerFactory.instantiate(conf);// 
In our case, the exception happens here, so the zookeeper connection never close
..
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)