Did you restart the cluster and you should set 'hbase.hregion.max.filesize' to a safeguard value which less than RS's capabilities.
---------------------------------------- Jaanai Zhang Best regards! Batyrshin Alexander <0x62...@gmail.com> 于2018年9月29日周六 下午5:28写道: > Meanwhile we tried to disable regions split via per index table options > 'SPLIT_POLICY' => > 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy' and > hbase.hregion.max.filesize > = 10737418240 > Looks like this options set doesn't. Some regions splits at size < 2GB > > Then we tried to disable all splits via hbase shell: splitormerge_switch > 'SPLIT', false > Seems that this also doesn't work. > > Any ideas why we can't disable regions split? > > On 27 Sep 2018, at 02:52, Vincent Poon <vincent.poon...@gmail.com> wrote: > > We are planning a Phoenix 4.14.1 release which will have this fix > > On Wed, Sep 26, 2018 at 3:36 PM Batyrshin Alexander <0x62...@gmail.com> > wrote: > >> Thank you. We will try somehow... >> Is there any chance that this fix will be included in next release for >> HBASE-1.4 (not 2.0)? >> >> On 27 Sep 2018, at 01:04, Ankit Singhal <ankitsingha...@gmail.com> wrote: >> >> You might be hitting PHOENIX-4785 >> <https://jira.apache.org/jira/browse/PHOENIX-4785>, you can apply the >> patch on top of 4.14 and see if it fixes your problem. >> >> Regards, >> Ankit Singhal >> >> On Wed, Sep 26, 2018 at 2:33 PM Batyrshin Alexander <0x62...@gmail.com> >> wrote: >> >>> Any advices? Helps? >>> I can reproduce problem and capture more logs if needed. >>> >>> On 21 Sep 2018, at 02:13, Batyrshin Alexander <0x62...@gmail.com> wrote: >>> >>> Looks like lock goes away 30 minutes after index region split. >>> So i can assume that this issue comes from cache that configured by this >>> option:* phoenix.coprocessor.maxMetaDataCacheTimeToLiveMs* >>> >>> >>> >>> On 21 Sep 2018, at 00:15, Batyrshin Alexander <0x62...@gmail.com> wrote: >>> >>> And how this split looks at Master logs: >>> >>> Sep 20 19:45:04 prod001 hbase[10838]: 2018-09-20 19:45:04,888 INFO >>> [AM.ZK.Worker-pool5-t282] master.RegionStates: Transition >>> {3e44b85ddf407da831dbb9a871496986 state=OPEN, >>> ts=1537304859509, server=prod013,60020,1537304282885} to >>> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461904888, >>> server=prod >>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO >>> [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition >>> {3e44b85ddf407da831dbb9a871496986 state=SPLITTING, ts=1537461905340, >>> server=prod013,60020,1537304282885} to {3e44b85ddf407da831dbb9a871496986 >>> state=SPLIT, ts=1537461905340, server=pro >>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,340 INFO >>> [AM.ZK.Worker-pool5-t284] master.RegionStates: Offlined >>> 3e44b85ddf407da831dbb9a871496986 from prod013,60020,1537304282885 >>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO >>> [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition >>> {33cba925c7acb347ac3f5e70e839c3cb state=SPLITTING_NEW, ts=1537461905340, >>> server=prod013,60020,1537304282885} to {33cba925c7acb347ac3f5e70e839c3cb >>> state=OPEN, ts=1537461905341, server= >>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,341 INFO >>> [AM.ZK.Worker-pool5-t284] master.RegionStates: Transition >>> {acb8f16a004a894c8706f6e12cd26144 state=SPLITTING_NEW, ts=1537461905340, >>> server=prod013,60020,1537304282885} to {acb8f16a004a894c8706f6e12cd26144 >>> state=OPEN, ts=1537461905341, server= >>> Sep 20 19:45:05 prod001 hbase[10838]: 2018-09-20 19:45:05,343 INFO >>> [AM.ZK.Worker-pool5-t284] master.AssignmentManager: Handled SPLIT >>> event; >>> parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., >>> daughter a=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1 >>> Sep 20 19:47:41 prod001 hbase[10838]: 2018-09-20 19:47:41,972 INFO >>> [prod001,60000,1537304851459_ChoreService_2] >>> balancer.StochasticLoadBalancer: Skipping load balancing because balanced >>> cluster; total cost is 17.82282205608522, sum multiplier is 1102.0 min cost >>> which need balance is 0.05 >>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,021 INFO >>> [prod001,60000,1537304851459_ChoreService_1] hbase.MetaTableAccessor: >>> Deleted >>> IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. >>> Sep 20 19:47:42 prod001 hbase[10838]: 2018-09-20 19:47:42,022 INFO >>> [prod001,60000,1537304851459_ChoreService_1] master.CatalogJanitor: >>> Scanned 779 catalog row(s), gc'd 0 unreferenced merged region(s) and 1 >>> unreferenced parent region(s) >>> >>> On 20 Sep 2018, at 21:43, Batyrshin Alexander <0x62...@gmail.com> wrote: >>> >>> Looks like problem was because of index region split >>> >>> Index region split at prod013: >>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO >>> [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] >>> regionserver.SplitRequest: Region split, hbase:meta updated, and report to >>> master. >>> Parent=IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986., >>> new >>> regions: >>> IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1537461904877.33cba925c7acb347ac3f5e70e839c3cb., >>> >>> IDX_MARK_O,\x107834005168\x0000000046200068=4YF!YI,1537461904877.acb8f16a004a894c8706f6e12cd26144.. >>> Split took 0sec >>> Sep 20 19:45:05 prod013 hbase[193055]: 2018-09-20 19:45:05,441 INFO >>> [regionserver/prod013/10.0.0.13:60020-splits-1537400010677] >>> regionserver.SplitRequest: Split transaction journal: >>> Sep 20 19:45:05 prod013 hbase[193055]: STARTED at 1537461904853 >>> Sep 20 19:45:05 prod013 hbase[193055]: PREPARED at 1537461904877 >>> Sep 20 19:45:05 prod013 hbase[193055]: BEFORE_PRE_SPLIT_HOOK at >>> 1537461904877 >>> Sep 20 19:45:05 prod013 hbase[193055]: AFTER_PRE_SPLIT_HOOK at >>> 1537461904877 >>> Sep 20 19:45:05 prod013 hbase[193055]: SET_SPLITTING at >>> 1537461904880 >>> Sep 20 19:45:05 prod013 hbase[193055]: CREATE_SPLIT_DIR at >>> 1537461904987 >>> Sep 20 19:45:05 prod013 hbase[193055]: CLOSED_PARENT_REGION at >>> 1537461905002 >>> Sep 20 19:45:05 prod013 hbase[193055]: OFFLINED_PARENT at >>> 1537461905002 >>> Sep 20 19:45:05 prod013 hbase[193055]: STARTED_REGION_A_CREATION >>> at 1537461905056 >>> Sep 20 19:45:05 prod013 hbase[193055]: STARTED_REGION_B_CREATION >>> at 1537461905131 >>> Sep 20 19:45:05 prod013 hbase[193055]: PONR at 1537461905192 >>> Sep 20 19:45:05 prod013 hbase[193055]: OPENED_REGION_A at >>> 1537461905249 >>> Sep 20 19:45:05 prod013 hbase[193055]: OPENED_REGION_B at >>> 1537461905252 >>> Sep 20 19:45:05 prod013 hbase[193055]: BEFORE_POST_SPLIT_HOOK at >>> 1537461905439 >>> Sep 20 19:45:05 prod013 hbase[193055]: AFTER_POST_SPLIT_HOOK at >>> 1537461905439 >>> Sep 20 19:45:05 prod013 hbase[193055]: COMPLETED at 1537461905439 >>> >>> Index update failed at prod002: >>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,520 WARN >>> [hconnection-0x4f3242a0-shared--pool32-t36014] client.AsyncProcess: #220, >>> table=IDX_MARK_O, attempt=1/1 failed=1ops, last exception: >>> org.apache.hadoop.hbase.NotServingRegionException: >>> org.apache.hadoop.hbase.NotServingRegionException: Re >>> gion >>> IDX_MARK_O,\x107834005168\x0000000046200020LWfBS4c,1536637905252.3e44b85ddf407da831dbb9a871496986. >>> is not online on prod013,60020,1537304282885 >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) >>> Sep 20 20:09:24 prod002 hbase[97285]: on prod013,60020,1537304282885, >>> tracking started Thu Sep 20 20:09:24 MSK 2018; not retrying 1 - final >>> failure >>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO >>> [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] >>> zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x39beae45 >>> connecting to ZooKeeper ensemble=10.0.0.1:2181,10.0.0.2:2181, >>> 10.0.0.3:2181 >>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,549 INFO >>> [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] >>> zookeeper.ZooKeeper: Initiating client connection, connectString= >>> 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=90000 >>> watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@3ef61f7 >>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,562 INFO >>> [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread( >>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Opening socket connection to >>> server 10.0.0.3/10.0.0.3:2181. Will not attempt to authenticate using >>> SASL (unknown error) >>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,570 INFO >>> [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread( >>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Socket connection established to >>> 10.0.0.3/10.0.0.3:2181, initiating session >>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,572 INFO >>> [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020-SendThread( >>> 10.0.0.3:2181)] zookeeper.ClientCnxn: Session establishment complete on >>> server 10.0.0.3/10.0.0.3:2181, sessionid = 0x30000e039e01c7f, >>> negotiated timeout = 40000 >>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,628 INFO >>> [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] >>> index.PhoenixIndexFailurePolicy: Successfully >>> update INDEX_DISABLE_TIMESTAMP for IDX_MARK_O due to an exception while >>> writing updates. indexState=PENDING_DISABLE >>> Sep 20 20:09:24 prod002 hbase[97285]: >>> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException: >>> disableIndexOnFailure=true, Failed to write to multiple index tables: >>> [IDX_MARK_O] >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2405) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297) >>> Sep 20 20:09:24 prod002 hbase[97285]: at >>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277) >>> Sep 20 20:09:24 prod002 hbase[97285]: 2018-09-20 20:09:24,632 INFO >>> [RpcServer.default.FPBQ.Fifo.handler=98,queue=8,port=60020] >>> util.IndexManagementUtil: >>> Rethrowing org.apache.hadoop.hbase.DoNotRetryIOException: ERROR 1121 >>> (XCL21): Write to the index failed. disableIndexOnFailure=true, Failed to >>> write to multiple index tables: [IDX_MARK_O] ,serverTimestamp=1537463364504, >>> >>> >>> On 20 Sep 2018, at 21:01, Batyrshin Alexander <0x62...@gmail.com> wrote: >>> >>> Our setup: >>> HBase-1.4.7 >>> Phoenix-4.14-hbase-1.4 >>> >>> >>> On 20 Sep 2018, at 20:19, Batyrshin Alexander <0x62...@gmail.com> wrote: >>> >>> Hello, >>> Looks live we got dead lock with repeating "ERROR 1120 (XCL20)" >>> exception. At this time all indexes is ACTIVE. >>> Can you help to make deeper diagnose? >>> >>> java.sql.SQLException: ERROR 1120 (XCL20): Writes to table blocked until >>> index can be updated. tableName=TBL_MARK >>> at >>> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:494) >>> at >>> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150) >>> at >>> org.apache.phoenix.execute.MutationState.validateAndGetServerTimestamp(MutationState.java:815) >>> at >>> org.apache.phoenix.execute.MutationState.validateAll(MutationState.java:789) >>> at org.apache.phoenix.execute.MutationState.send(MutationState.java:981) >>> at org.apache.phoenix.execute.MutationState.send(MutationState.java:1514) >>> at >>> org.apache.phoenix.execute.MutationState.commit(MutationState.java:1337) >>> at >>> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:670) >>> at >>> org.apache.phoenix.jdbc.PhoenixConnection$3.call(PhoenixConnection.java:666) >>> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53) >>> at >>> org.apache.phoenix.jdbc.PhoenixConnection.commit(PhoenixConnection.java:666) >>> at >>> x.persistence.phoenix.PhoenixDao.$anonfun$doUpsert$1(PhoenixDao.scala:103) >>> at scala.util.Try$.apply(Try.scala:209) >>> at x.persistence.phoenix.PhoenixDao.doUpsert(PhoenixDao.scala:101) >>> at >>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2(PhoenixDao.scala:45) >>> at >>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$2$adapted(PhoenixDao.scala:45) >>> at scala.collection.immutable.Stream.flatMap(Stream.scala:486) >>> at scala.collection.immutable.Stream.$anonfun$flatMap$1(Stream.scala:494) >>> at scala.collection.immutable.Stream.$anonfun$append$1(Stream.scala:252) >>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1169) >>> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1159) >>> at scala.collection.immutable.Stream.length(Stream.scala:309) >>> at scala.collection.SeqLike.size(SeqLike.scala:105) >>> at scala.collection.SeqLike.size$(SeqLike.scala:105) >>> at scala.collection.AbstractSeq.size(Seq.scala:41) >>> at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:285) >>> at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:283) >>> at scala.collection.AbstractTraversable.toArray(Traversable.scala:104) >>> at >>> x.persistence.phoenix.PhoenixDao.$anonfun$batchInsert$1(PhoenixDao.scala:45) >>> at scala.util.Try$.apply(Try.scala:209) >>> at x.persistence.phoenix.PhoenixDao.batchInsert(PhoenixDao.scala:45) >>> at >>> x.persistence.phoenix.PhoenixDao.$anonfun$insert$2(PhoenixDao.scala:35) >>> at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:655) >>> at scala.util.Success.$anonfun$map$1(Try.scala:251) >>> at scala.util.Success.map(Try.scala:209) >>> at scala.concurrent.Future.$anonfun$map$1(Future.scala:289) >>> at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29) >>> at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) >>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>> at java.lang.Thread.run(Thread.java:748) >>> >>> >>> >>> >>> >>> >>> >>> >> >