Re: ABORTING region server and following HBase cluster "crash"

2018-10-02 Thread Batyrshin Alexander
Still observing chaining of region server restarts.

Our Phoenix version is 4.14-HBase-1.4 at commit 
https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8
 




At prod022 got this:

Oct 02 03:24:03 prod022 hbase[160534]: 2018-10-02 03:24:03,678 WARN  
[hconnection-0x4a616d85-shared--pool8-t10050] client.AsyncProcess: #21, 
table=KM_IDX1, attempt=1/1 failed=2ops, last exception: 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region 
KM_IDX1,\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1537400041091.9fdc7d07edce09b08b8d2750b24961b8.
 is not online on prod015,60020,1538417657739
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Oct 02 03:24:03 prod022 hbase[160534]:  on prod015,60020,1538417657739, 
tracking started Tue Oct 02 03:24:03 MSK 2018; not retrying 2 - final failure
Oct 02 03:24:03 prod022 hbase[160534]: 2018-10-02 03:24:03,695 INFO  
[RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP 
for KM_IDX1 due to an exception while writing updates. 
indexState=PENDING_DISABLE
Oct 02 03:24:03 prod022 hbase[160534]: 
org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  
disableIndexOnFailure=true, Failed to write to multiple index tables: [KM_IDX1]
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:236)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
Oct 02 03:24:03 prod022 hbase[160534]: at 

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

2018-10-02 Thread Josh Elser
HBase will invalidate the location of a Region on seeing certain 
exceptions (including NotServingRegionException). After it sees the 
exception you have copied below, it should re-fetch the location of the 
Region.


If HBase keeps trying to access a Region on a RS that isn't hosting it, 
either hbase:meta is wrong or the HBase client has a bug.


However, to the point here, if that region was split successfully, 
clients should not be reading from that region anymore -- they would 
read from the daughters of that split region.


On 10/2/18 2:34 PM, Batyrshin Alexander wrote:
We tried branch 4.14-HBase-1.4 at commit 
https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8


Is there any way to invalidate meta-cache on event of index regions 
split? Maybe there is some option to set max time to live for cache?


Watching this on regions servers:

At 09:34 regions *96c3ede1c40c98959e60bd6fc0e07269* split on prod019

Oct 02 09:34:39 prod019 hbase[152127]: 2018-10-02 09:34:39,719 INFO 
  [regionserver/prod019/10.0.0.19:60020-splits-1538462079117] 
regionserver.SplitRequest: Region split, hbase:meta updated, and report 
to master. Parent=IDX_MARK_O,\x0B\x46200020qC8kovh\x00\x01\x80\x00\x
01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*., 
new regions: 
IDX_MARK_O,\x0B\x46200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1538462079161.80fc2516619d8665789b0c5a2bca8a8b., 
IDX_MARK_O,\x0BON_SCHFDOPPR_2AL-5602
2B7D-2F90-4AA5-8125-4F4001B5BE0D-0_2AL-C0D76C01-EE7E-496B-BCD6-F6488956F75A-0_20180228_7E372181-F23D-4EBE-9CAD-5F5218C9798I\x46186195_5.UHQ=\x00\x02\x80\x00\x01a\xD3\xEA@\x80\x00\x00\x00\x00,1538462079161.24b6675d9e51067a21e58f294a9f816b.. 
Split took 0sec


Fail at 11:51 prod018

Oct 02 11:51:13 prod018 hbase[108476]: 2018-10-02 11:51:13,752 WARN 
  [hconnection-0x4131af19-shared--pool24-t26652] client.AsyncProcess: 
#164, table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: 
Region IDX_MARK_O,\x0B\x46200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*. 
is not online on prod019,60020,1538417663874


Fail at 13:38 on prod005

Oct 02 13:38:06 prod005 hbase[197079]: 2018-10-02 13:38:06,040 WARN 
  [hconnection-0x5e744e65-shared--pool8-t31214] client.AsyncProcess: 
#53, table=IDX_MARK_O, attempt=1/1 failed=11ops, last exception: 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: 
Region IDX_MARK_O,\x0B\x46200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.*96c3ede1c40c98959e60bd6fc0e07269*. 
is not online on prod019,60020,1538417663874


On 27 Sep 2018, at 01:04, Ankit Singhal > wrote:


You might be hitting PHOENIX-4785 
,  you can apply the 
patch on top of 4.14 and see if it fixes your problem.


Regards,
Ankit Singhal

On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62...@gmail.com 
> wrote:


Any advices? Helps?
I can reproduce problem and capture more logs if needed.


On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62...@gmail.com
> wrote:

Looks like lock goes away 30 minutes after index region split.
So i can assume that this issue comes from cache that configured
by this option:*phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs*




On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62...@gmail.com
> wrote:

And how this split looks at Master logs:

Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888
INFO  [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition
{3e44b85ddf407da831dbb9a871496986 state=OPEN,
ts=1537304859509, server=prod013,60020,1537304282885} to
{3e44b85ddf407da831dbb9a871496986 state=SPLITTING,
ts=1537461904888, server=prod
Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340
INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
{3e44b85ddf407da831dbb9a871496986
state=SPLITTING, ts=1537461905340,
server=prod013,60020,1537304282885} to
{3e44b85ddf407da831dbb9a871496986 state=SPLIT, ts=1537461905340,
server=pro
Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340
INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined
3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341
INFO  [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition
{33cba925c7acb347ac3f5e70e839c3cb
state=SPLITTING_NEW, ts=1537461905340,
server=prod013,60020,1537304282885} to
{33cba925c7acb347ac3f5e70e839c3cb state=OPEN, ts=1537461905341,

Re: Table dead lock: ERROR 1120 (XCL20): Writes to table blocked until index can be updated

2018-10-02 Thread Batyrshin Alexander
We tried branch 4.14-HBase-1.4 at commit 
https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8
 


Is there any way to invalidate meta-cache on event of index regions split? 
Maybe there is some option to set max time to live for cache?

Watching this on regions servers:

At 09:34 regions 96c3ede1c40c98959e60bd6fc0e07269 split on prod019 

Oct 02 09:34:39 prod019 hbase[152127]: 2018-10-02 09:34:39,719 INFO  
[regionserver/prod019/10.0.0.19:60020-splits-1538462079117] 
regionserver.SplitRequest: Region split, hbase:meta updated, and report to 
master. Parent=IDX_MARK_O,\x0B\x46200020qC8kovh\x00\x01\x80\x00\x
01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.96c3ede1c40c98959e60bd6fc0e07269.,
 new regions: 
IDX_MARK_O,\x0B\x46200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1538462079161.80fc2516619d8665789b0c5a2bca8a8b.,
 IDX_MARK_O,\x0BON_SCHFDOPPR_2AL-5602
2B7D-2F90-4AA5-8125-4F4001B5BE0D-0_2AL-C0D76C01-EE7E-496B-BCD6-F6488956F75A-0_20180228_7E372181-F23D-4EBE-9CAD-5F5218C9798I\x46186195_5.UHQ=\x00\x02\x80\x00\x01a\xD3\xEA@\x80\x00\x00\x00\x00,1538462079161.24b6675d9e51067a21e58f294a9f816b..
 Split took 0sec

Fail at 11:51 prod018

Oct 02 11:51:13 prod018 hbase[108476]: 2018-10-02 11:51:13,752 WARN  
[hconnection-0x4131af19-shared--pool24-t26652] client.AsyncProcess: #164, 
table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region 
IDX_MARK_O,\x0B\x46200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.96c3ede1c40c98959e60bd6fc0e07269.
 is not online on prod019,60020,1538417663874

Fail at 13:38 on prod005

Oct 02 13:38:06 prod005 hbase[197079]: 2018-10-02 13:38:06,040 WARN  
[hconnection-0x5e744e65-shared--pool8-t31214] client.AsyncProcess: #53, 
table=IDX_MARK_O, attempt=1/1 failed=11ops, last exception: 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region 
IDX_MARK_O,\x0B\x46200020qC8kovh\x00\x01\x80\x00\x01e\x89\x8B\x99@\x00\x00\x00\x00,1537400033958.96c3ede1c40c98959e60bd6fc0e07269.
 is not online on prod019,60020,1538417663874

> On 27 Sep 2018, at 01:04, Ankit Singhal  wrote:
> 
> You might be hitting PHOENIX-4785 
> ,  you can apply the patch 
> on top of 4.14 and see if it fixes your problem.
> 
> Regards,
> Ankit Singhal
> 
> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62...@gmail.com 
> > wrote:
> Any advices? Helps?
> I can reproduce problem and capture more logs if needed.
> 
>> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62...@gmail.com 
>> > wrote:
>> 
>> Looks like lock goes away 30 minutes after index region split.
>> So i can assume that this issue comes from cache that configured by this 
>> option: phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs
>> 
>> 
>> 
>>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62...@gmail.com 
>>> > wrote:
>>> 
>>> And how this split looks at Master logs:
>>> 
>>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO  
>>> [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition 
>>> {3e44b85ddf407da831dbb9a871496986 state=OPEN, ts=1537304859509, 
>>> server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 
>>> state=SPLITTING, ts=1537461904888, server=prod
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  
>>> [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition 
>>> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, 
>>> server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 
>>> state=SPLIT, ts=1537461905340, server=pro
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO  
>>> [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined 
>>> 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  
>>> [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition 
>>> {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, 
>>> server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb 
>>> state=OPEN, ts=1537461905341, server=
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO  
>>> [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition 
>>> {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, 
>>> server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 
>>> state=OPEN, ts=1537461905341, server=
>>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO  
>>> [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT event; 
>>>