Re: ABORTING region server and following HBase cluster "crash"

Josh Elser Mon, 05 Nov 2018 10:16:06 -0800

Thanks, Neelesh. It came off to me like "Phoenix is no good, Cassandrahas something that works better".


I appreciate you taking the time to clarify! That really means a lot.


On 11/2/18 8:14 PM, Neelesh wrote:

By no means am I judging Phoenix based on this. This is simply a designtrade-off (scylladb goes the same route and builds global indexes). Iappreciate all the effort that has gone in to Phoenix, and it was indeeda life saver. But the technical point remains that single node failureshave potential to cascade to the entire cluster. That's the nature ofglobal indexes, not specific to phoenix.

I apologize if my response came off as dismissing phoenix altogether.FWIW, I'm a big advocate of phoenix at my org internally, albeit for thenewer version.

On Fri, Nov 2, 2018, 4:09 PM Josh Elser <[email protected]<mailto:[email protected]>> wrote:


    I would strongly disagree with the assertion that this is some
    unavoidable problem. Yes, an inverted index is a data structure which,
    by design, creates a hotspot (phrased another way, this is "data
    locality").

    Lots of extremely smart individuals have spent a significant amount of
    time and effort in stabilizing secondary indexes in the past 1-2 years,
    not to mention others spending time on a local index implementation.
    Judging Phoenix in its entirety based off of an arbitrarily old version
    of Phoenix is disingenuous.

    On 11/2/18 2:00 PM, Neelesh wrote:
     > I think this is an unavoidable problem in some sense, if global
    indexes
     > are used. Essentially global indexes create a  graph of dependent
    region
     > servers due to index rpc calls from one RS to another. Any single
     > failure is bound to affect the entire graph, which under
    reasonable load
     > becomes the entire HBase cluster. We had to drop global indexes
    just to
     > keep the cluster running for more than a few days.
     >
     > I think Cassandra has local secondary indexes preciesly because
    of this
     > issue. Last I checked there were significant pending improvements
     > required for Phoenix local indexes, especially around read paths
    ( not
     > utilizing primary key prefixes in secondary index reads where
    possible,
     > for example)
     >
     >
     > On Thu, Sep 13, 2018, 8:12 PM Jonathan Leech <[email protected]
    <mailto:[email protected]>
     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
     >
     >     This seems similar to a failure scenario I’ve seen a couple
    times. I
     >     believe after multiple restarts you got lucky and tables were
     >     brought up by Hbase in the correct order.
     >
     >     What happens is some kind of semi-catastrophic failure where 1 or
     >     more region servers go down with edits that weren’t flushed,
    and are
     >     only in the WAL. These edits belong to regions whose tables have
     >     secondary indexes. Hbase wants to replay the WAL before
    bringing up
     >     the region server. Phoenix wants to talk to the index region
    during
     >     this, but can’t. It fails enough times then stops.
     >
     >     The more region servers / tables / indexes affected, the more
    likely
     >     that a full restart will get stuck in a classic deadlock. A good
     >     old-fashioned data center outage is a great way to get
    started with
     >     this kind of problem. You might make some progress and get stuck
     >     again, or restart number N might get those index regions
    initialized
     >     before the main table.
     >
     >     The sure fire way to recover a cluster in this condition is to
     >     strategically disable all the tables that are failing to come up.
     >     You can do this from the Hbase shell as long as the master is
     >     running. If I remember right, it’s a pain since the disable
    command
     >     will hang. You might need to disable a table, kill the shell,
     >     disable the next table, etc. Then restart. You’ll eventually
    have a
     >     cluster with all the region servers finally started, and a
    bunch of
     >     disabled regions. If you disabled index tables, enable one,
    wait for
     >     it to become available; eg its WAL edits will be replayed, then
     >     enable the associated main table and wait for it to come
    online. If
     >     Hbase did it’s job without error, and your failure didn’t include
     >     losing 4 disks at once, order will be restored. Lather, rinse,
     >     repeat until everything is enabled and online.
     >
     >     <TLDR> A big enough failure sprinkled with a little bit of
    bad luck
     >     and what seems to be a Phoenix flaw == deadlock trying to get
    HBASE
     >     to start up. Fix by forcing the order that Hbase brings regions
     >     online. Finally, never go full restart. </TLDR>
     >
     >      > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander
     >     <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>> wrote:
     >      >
     >      > After update web interface at Master show that every region
     >     server now 1.4.7 and no RITS.
     >      >
     >      > Cluster recovered only when we restart all regions servers
    4 times...
     >      >
     >      >> On 11 Sep 2018, at 04:08, Josh Elser <[email protected]
    <mailto:[email protected]>
     >     <mailto:[email protected] <mailto:[email protected]>>> wrote:
     >      >>
     >      >> Did you update the HBase jars on all RegionServers?
     >      >>
     >      >> Make sure that you have all of the Regions assigned (no
    RITs).
     >     There could be a pretty simple explanation as to why the
    index can't
     >     be written to.
     >      >>
     >      >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
     >      >>> Correct me if im wrong.
     >      >>> But looks like if you have A and B region server that
    has index
     >     and primary table then possible situation like this.
     >      >>> A and B under writes on table with indexes
     >      >>> A - crash
     >      >>> B failed on index update because A is not operating then B
     >     starting aborting
     >      >>> A after restart try to rebuild index from WAL but B at this
     >     time is aborting then A starting aborting too
     >      >>> From this moment nothing happens (0 requests to region
    servers)
     >     and A and B is not responsible from Master-status web interface
     >      >>>> On 9 Sep 2018, at 04:38, Batyrshin Alexander
     >     <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     >     <mailto:[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>> wrote:
     >      >>>>
     >      >>>> After update we still can't recover HBase cluster. Our
    region
     >     servers ABORTING over and over:
     >      >>>>
     >      >>>> prod003:
     >      >>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09
    02:51:27,395
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod003,60020,1536446665703: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09
    02:51:27,395
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod003,60020,1536446665703: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09
    02:52:19,224
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod003,60020,1536446665703: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09
    02:52:28,922
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod003,60020,1536446665703: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod003,60020,1536450772841: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod003,60020,1536450772841: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>>
     >      >>>> prod004:
     >      >>>> Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09
    02:52:13,541
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod004,60020,1536446387325: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09
    02:52:50,264
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod004,60020,1536446387325: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09
    02:53:40,709
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod004,60020,1536446387325: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09
    02:54:00,060
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod004,60020,1536446387325: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>>
     >      >>>> prod005:
     >      >>>> Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09
    02:52:50,661
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod005,60020,1536446400009: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09
    02:53:27,542
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=90,queue=0,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod005,60020,1536446400009: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:54:00 prod005 hbase[3772]: 2018-09-09
    02:53:59,915
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=7,queue=7,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod005,60020,1536446400009: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]: 2018-09-09
    02:54:30,058
     >     FATAL [RpcServer.default.FPBQ.Fifo.handler=16,queue=6,port=60020]
     >     regionserver.HRegionServer: ABORTING region server
     >     prod005,60020,1536446400009: Could not update the index table,
     >     killing server region because couldn't write to an index table
     >      >>>>
     >      >>>> And so on...
     >      >>>>
     >      >>>> Trace is the same everywhere:
     >      >>>>
     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:

> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:

     >     disableIndexOnFailure=true, Failed to write to multiple index
     >     tables: [KM_IDX1, KM_IDX2, KM_HISTORY_IDX1, KM_HISTORY_IDX2,
     >     KM_HISTORY_IDX3]
     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
     >     org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatch(UngroupedAggregateRegionObserver.java:271)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatchWithRetries(UngroupedAggregateRegionObserver.java:241)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1068)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:386)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:239)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:287)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2843)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3080)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
     >     org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at
     >     org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)

     >      >>>> Sep 09 02:54:30 prod005 hbase[3772]:         at

> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)

     >      >>>>
     >      >>>>> On 9 Sep 2018, at 01:44, Batyrshin Alexander
     >     <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     >     <mailto:[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>> wrote:
     >      >>>>>
     >      >>>>> Thank you.
     >      >>>>> We're updating our cluster right now...
     >      >>>>>
     >      >>>>>
     >      >>>>>> On 9 Sep 2018, at 01:39, Ted Yu <[email protected]
    <mailto:[email protected]>
     >     <mailto:[email protected] <mailto:[email protected]>>
    <mailto:[email protected] <mailto:[email protected]>
     >     <mailto:[email protected] <mailto:[email protected]>>>>
    wrote:
     >      >>>>>>
     >      >>>>>> It seems you should deploy hbase with the following fix:
     >      >>>>>>
     >      >>>>>> HBASE-21069 NPE in StoreScanner.updateReaders causes
    RS to crash
     >      >>>>>>
     >      >>>>>> 1.4.7 was recently released.
     >      >>>>>>
     >      >>>>>> FYI
     >      >>>>>>
     >      >>>>>> On Sat, Sep 8, 2018 at 3:32 PM Batyrshin Alexander
     >     <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
     >     <mailto:[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>> wrote:
     >      >>>>>>
     >      >>>>>>    Hello,
     >      >>>>>>
     >      >>>>>>   We got this exception from *prod006* server
     >      >>>>>>
     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09
    00:38:02,532
     >      >>>>>>   FATAL [MemStoreFlusher.1]
    regionserver.HRegionServer: ABORTING
     >      >>>>>>   region server prod006,60020,1536235102833: Replay of
     >      >>>>>>   WAL required. Forcing server shutdown
     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:
     >      >>>>>>   org.apache.hadoop.hbase.DroppedSnapshotException:
     >      >>>>>>   region:

> KM,c\xEF\xBF\xBD\x16I7\xEF\xBF\xBD\x0A"A\xEF\xBF\xBDd\xEF\xBF\xBD\xEF\xBF\xBD\x19\x07t,1536178245576.60c121ba50e67f2429b9ca2ba2a11bad.

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2322)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2284)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2170)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2095)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:508)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:478)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>   java.lang.Thread.run(Thread.java:748)
     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: Caused by:
     >      >>>>>>   java.lang.NullPointerException
     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>   java.util.ArrayList.<init>(ArrayList.java:178)
     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.StoreScanner.updateReaders(StoreScanner.java:863)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HStore.notifyChangedReadersObservers(HStore.java:1172)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1145)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HStore.access$900(HStore.java:122)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2505)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2600)

     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]:         ... 9
    more
     >      >>>>>>   Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09
    00:38:02,532
     >      >>>>>>   FATAL [MemStoreFlusher.1] regionserver.HRegionServer:
     >      >>>>>>   RegionServer abort: loaded coprocessors
     >      >>>>>>   are:

> [org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,

     >      >>>>>>   org.apache.phoenix.coprocessor.SequenceRegionObserver,
     >      >>>>>>   org.apache.phoenix.c
     >      >>>>>>
     >      >>>>>>   After that we got ABORTING on almost every Region
    Servers in
     >      >>>>>>   cluster with different reasons:
     >      >>>>>>
     >      >>>>>>   *prod003*
     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]: 2018-09-09
    01:12:11,799
     >      >>>>>>   FATAL
    [PostOpenDeployTasks:88bfac1dfd807c4cd1e9c1f31b4f053f]
     >      >>>>>>   regionserver.HRegionServer: ABORTING region
     >      >>>>>>   server prod003,60020,1536444066291: Exception running
     >      >>>>>>   postOpenDeployTasks;
    region=88bfac1dfd807c4cd1e9c1f31b4f053f
     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:
     >      >>>>>> java.io <http://java.io>
    <http://java.io>.InterruptedIOException: #139,
     >     interrupted.
     >      >>>>>>   currentNumberOfTask=8
     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1853)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1823)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1899)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:250)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:213)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1484)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at

> >>>>>> org.apache.hadoop.hbase.client.HTable.put(HTable.java:1031)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.MetaTableAccessor.put(MetaTableAccessor.java:1033)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.MetaTableAccessor.putToMetaTable(MetaTableAccessor.java:1023)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.MetaTableAccessor.updateLocation(MetaTableAccessor.java:1433)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.MetaTableAccessor.updateRegionLocation(MetaTableAccessor.java:1400)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:2041)

     >      >>>>>>   Sep 09 01:12:11 prod003 hbase[11552]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$PostOpenDeployTasksThread.run(OpenRegionHandler.java:329)

     >      >>>>>>
     >      >>>>>>   *prod002*
     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]: 2018-09-09
    01:12:30,144
     >      >>>>>>   FATAL

> >>>>>> [RpcServer.default.FPBQ.Fifo.handler=36,queue=6,port=60020]

     >      >>>>>>   regionserver.HRegionServer: ABORTING region
     >      >>>>>>   server prod002,60020,1536235138673: Could not
    update the index
     >      >>>>>>   table, killing server region because couldn't write
    to an
     >     index
     >      >>>>>>   table
     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:
     >      >>>>>>

> org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:

     >      >>>>>>    disableIndexOnFailure=true, Failed to write to
    multiple index
     >      >>>>>>   tables: [KM_IDX1, KM_IDX2, KM_HISTORY1, KM_HISTORY2,
     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>
     >       org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatch(UngroupedAggregateRegionObserver.java:271)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.access$000(UngroupedAggregateRegionObserver.java:164)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver$1.doMutation(UngroupedAggregateRegionObserver.java:246)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.index.PhoenixIndexFailurePolicy.doBatchWithRetries(PhoenixIndexFailurePolicy.java:455)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.handleIndexWriteException(UngroupedAggregateRegionObserver.java:929)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.commitBatchWithRetries(UngroupedAggregateRegionObserver.java:243)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.rebuildIndices(UngroupedAggregateRegionObserver.java:1077)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver.doPostScannerOpen(UngroupedAggregateRegionObserver.java:386)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.overrideDelegate(BaseScannerRegionObserver.java:239)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:287)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2843)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3080)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36613)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>
     >       org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2354)
     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>
     >       org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)

     >      >>>>>>   Sep 09 01:12:30 prod002 hbase[29056]:         at
     >      >>>>>>

> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)

     >      >>>>>>
     >      >>>>>>
     >      >>>>>>   And etc...
     >      >>>>>>
     >      >>>>>>   Master-status web interface shows that contact lost
    from this
     >      >>>>>>   aborted servers.
     >      >>>>>
     >      >>>>
     >      >
     >

Re: ABORTING region server and following HBase cluster "crash"

Reply via email to