Re: ABORTING region server and following HBase cluster "crash"

2018-11-05 Thread Josh Elser
Thanks, Neelesh. It came off to me like "Phoenix is no good, Cassandra 
has something that works better".


I appreciate you taking the time to clarify! That really means a lot.

On 11/2/18 8:14 PM, Neelesh wrote:
By no means am I judging Phoenix based on this. This is simply a design 
trade-off (scylladb goes the same route and builds global indexes). I 
appreciate all the effort that has gone in to Phoenix, and it was indeed 
a life saver. But the technical point remains that single node failures 
have potential to cascade to the entire cluster. That's the nature of 
global indexes, not specific to phoenix.


I apologize if my response came off as dismissing phoenix altogether. 
FWIW, I'm a big advocate of phoenix at my org internally, albeit for the 
newer version.



On Fri, Nov 2, 2018, 4:09 PM Josh Elser > wrote:


I would strongly disagree with the assertion that this is some
unavoidable problem. Yes, an inverted index is a data structure which,
by design, creates a hotspot (phrased another way, this is "data
locality").

Lots of extremely smart individuals have spent a significant amount of
time and effort in stabilizing secondary indexes in the past 1-2 years,
not to mention others spending time on a local index implementation.
Judging Phoenix in its entirety based off of an arbitrarily old version
of Phoenix is disingenuous.

On 11/2/18 2:00 PM, Neelesh wrote:
 > I think this is an unavoidable problem in some sense, if global
indexes
 > are used. Essentially global indexes create a  graph of dependent
region
 > servers due to index rpc calls from one RS to another. Any single
 > failure is bound to affect the entire graph, which under
reasonable load
 > becomes the entire HBase cluster. We had to drop global indexes
just to
 > keep the cluster running for more than a few days.
 >
 > I think Cassandra has local secondary indexes preciesly because
of this
 > issue. Last I checked there were significant pending improvements
 > required for Phoenix local indexes, especially around read paths
( not
 > utilizing primary key prefixes in secondary index reads where
possible,
 > for example)
 >
 >
 > On Thu, Sep 13, 2018, 8:12 PM Jonathan Leech mailto:jonat...@gmail.com>
 > >> wrote:
 >
 >     This seems similar to a failure scenario I’ve seen a couple
times. I
 >     believe after multiple restarts you got lucky and tables were
 >     brought up by Hbase in the correct order.
 >
 >     What happens is some kind of semi-catastrophic failure where 1 or
 >     more region servers go down with edits that weren’t flushed,
and are
 >     only in the WAL. These edits belong to regions whose tables have
 >     secondary indexes. Hbase wants to replay the WAL before
bringing up
 >     the region server. Phoenix wants to talk to the index region
during
 >     this, but can’t. It fails enough times then stops.
 >
 >     The more region servers / tables / indexes affected, the more
likely
 >     that a full restart will get stuck in a classic deadlock. A good
 >     old-fashioned data center outage is a great way to get
started with
 >     this kind of problem. You might make some progress and get stuck
 >     again, or restart number N might get those index regions
initialized
 >     before the main table.
 >
 >     The sure fire way to recover a cluster in this condition is to
 >     strategically disable all the tables that are failing to come up.
 >     You can do this from the Hbase shell as long as the master is
 >     running. If I remember right, it’s a pain since the disable
command
 >     will hang. You might need to disable a table, kill the shell,
 >     disable the next table, etc. Then restart. You’ll eventually
have a
 >     cluster with all the region servers finally started, and a
bunch of
 >     disabled regions. If you disabled index tables, enable one,
wait for
 >     it to become available; eg its WAL edits will be replayed, then
 >     enable the associated main table and wait for it to come
online. If
 >     Hbase did it’s job without error, and your failure didn’t include
 >     losing 4 disks at once, order will be restored. Lather, rinse,
 >     repeat until everything is enabled and online.
 >
 >      A big enough failure sprinkled with a little bit of
bad luck
 >     and what seems to be a Phoenix flaw == deadlock trying to get
HBASE
 >     to start up. Fix by forcing the order that Hbase brings regions
 >     online. Finally, never go full restart. 
 >
 >      > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander
 >     <0x62...@gmail.com 
<

Re: ABORTING region server and following HBase cluster "crash"

2018-11-02 Thread Vincent Poon
Indexes in Phoenix should not in theory cause any cluster outage.  An index
write failure should just disable the index, not cause a crash.
In practice, there have been some bugs around race conditions, the most
dangerous of which accidentally trigger a KillServerOnFailurePolicy which
then potentially cascades.
That policy is there for legacy reasons, I believe because at the time that
was the only way to keep indexes consistent - kill the RS and replay from
WAL.
There is now a partial rebuilder which detects when an index has been
disabled due to a write failure, and asynchronously attempts to rebuild the
index.  Killing the RS is supposed to be a last ditch effort only if the
index could not be disabled (because otherwise, your index is out of sync
but still active and your queries will return incorrect results).
PHOENIX-4977 made the policy configurable now.  If you would rather, in the
worst case, have your index potentially get out of sync instead of killing
RSs, you can set that to LeaveIndexActiveFailurePolicy.

On Fri, Nov 2, 2018 at 5:14 PM Neelesh  wrote:

> By no means am I judging Phoenix based on this. This is simply a design
> trade-off (scylladb goes the same route and builds global indexes). I
> appreciate all the effort that has gone in to Phoenix, and it was indeed a
> life saver. But the technical point remains that single node failures have
> potential to cascade to the entire cluster. That's the nature of global
> indexes, not specific to phoenix.
>
> I apologize if my response came off as dismissing phoenix altogether.
> FWIW, I'm a big advocate of phoenix at my org internally, albeit for the
> newer version.
>
>
> On Fri, Nov 2, 2018, 4:09 PM Josh Elser  wrote:
>
>> I would strongly disagree with the assertion that this is some
>> unavoidable problem. Yes, an inverted index is a data structure which,
>> by design, creates a hotspot (phrased another way, this is "data
>> locality").
>>
>> Lots of extremely smart individuals have spent a significant amount of
>> time and effort in stabilizing secondary indexes in the past 1-2 years,
>> not to mention others spending time on a local index implementation.
>> Judging Phoenix in its entirety based off of an arbitrarily old version
>> of Phoenix is disingenuous.
>>
>> On 11/2/18 2:00 PM, Neelesh wrote:
>> > I think this is an unavoidable problem in some sense, if global indexes
>> > are used. Essentially global indexes create a  graph of dependent
>> region
>> > servers due to index rpc calls from one RS to another. Any single
>> > failure is bound to affect the entire graph, which under reasonable
>> load
>> > becomes the entire HBase cluster. We had to drop global indexes just to
>> > keep the cluster running for more than a few days.
>> >
>> > I think Cassandra has local secondary indexes preciesly because of this
>> > issue. Last I checked there were significant pending improvements
>> > required for Phoenix local indexes, especially around read paths ( not
>> > utilizing primary key prefixes in secondary index reads where possible,
>> > for example)
>> >
>> >
>> > On Thu, Sep 13, 2018, 8:12 PM Jonathan Leech > > > wrote:
>> >
>> > This seems similar to a failure scenario I’ve seen a couple times. I
>> > believe after multiple restarts you got lucky and tables were
>> > brought up by Hbase in the correct order.
>> >
>> > What happens is some kind of semi-catastrophic failure where 1 or
>> > more region servers go down with edits that weren’t flushed, and are
>> > only in the WAL. These edits belong to regions whose tables have
>> > secondary indexes. Hbase wants to replay the WAL before bringing up
>> > the region server. Phoenix wants to talk to the index region during
>> > this, but can’t. It fails enough times then stops.
>> >
>> > The more region servers / tables / indexes affected, the more likely
>> > that a full restart will get stuck in a classic deadlock. A good
>> > old-fashioned data center outage is a great way to get started with
>> > this kind of problem. You might make some progress and get stuck
>> > again, or restart number N might get those index regions initialized
>> > before the main table.
>> >
>> > The sure fire way to recover a cluster in this condition is to
>> > strategically disable all the tables that are failing to come up.
>> > You can do this from the Hbase shell as long as the master is
>> > running. If I remember right, it’s a pain since the disable command
>> > will hang. You might need to disable a table, kill the shell,
>> > disable the next table, etc. Then restart. You’ll eventually have a
>> > cluster with all the region servers finally started, and a bunch of
>> > disabled regions. If you disabled index tables, enable one, wait for
>> > it to become available; eg its WAL edits will be replayed, then
>> > enable the associated main table and wait for it to come online. If
>

Re: ABORTING region server and following HBase cluster "crash"

2018-11-02 Thread Neelesh
By no means am I judging Phoenix based on this. This is simply a design
trade-off (scylladb goes the same route and builds global indexes). I
appreciate all the effort that has gone in to Phoenix, and it was indeed a
life saver. But the technical point remains that single node failures have
potential to cascade to the entire cluster. That's the nature of global
indexes, not specific to phoenix.

I apologize if my response came off as dismissing phoenix altogether. FWIW,
I'm a big advocate of phoenix at my org internally, albeit for the newer
version.


On Fri, Nov 2, 2018, 4:09 PM Josh Elser  wrote:

> I would strongly disagree with the assertion that this is some
> unavoidable problem. Yes, an inverted index is a data structure which,
> by design, creates a hotspot (phrased another way, this is "data
> locality").
>
> Lots of extremely smart individuals have spent a significant amount of
> time and effort in stabilizing secondary indexes in the past 1-2 years,
> not to mention others spending time on a local index implementation.
> Judging Phoenix in its entirety based off of an arbitrarily old version
> of Phoenix is disingenuous.
>
> On 11/2/18 2:00 PM, Neelesh wrote:
> > I think this is an unavoidable problem in some sense, if global indexes
> > are used. Essentially global indexes create a  graph of dependent region
> > servers due to index rpc calls from one RS to another. Any single
> > failure is bound to affect the entire graph, which under reasonable load
> > becomes the entire HBase cluster. We had to drop global indexes just to
> > keep the cluster running for more than a few days.
> >
> > I think Cassandra has local secondary indexes preciesly because of this
> > issue. Last I checked there were significant pending improvements
> > required for Phoenix local indexes, especially around read paths ( not
> > utilizing primary key prefixes in secondary index reads where possible,
> > for example)
> >
> >
> > On Thu, Sep 13, 2018, 8:12 PM Jonathan Leech  > > wrote:
> >
> > This seems similar to a failure scenario I’ve seen a couple times. I
> > believe after multiple restarts you got lucky and tables were
> > brought up by Hbase in the correct order.
> >
> > What happens is some kind of semi-catastrophic failure where 1 or
> > more region servers go down with edits that weren’t flushed, and are
> > only in the WAL. These edits belong to regions whose tables have
> > secondary indexes. Hbase wants to replay the WAL before bringing up
> > the region server. Phoenix wants to talk to the index region during
> > this, but can’t. It fails enough times then stops.
> >
> > The more region servers / tables / indexes affected, the more likely
> > that a full restart will get stuck in a classic deadlock. A good
> > old-fashioned data center outage is a great way to get started with
> > this kind of problem. You might make some progress and get stuck
> > again, or restart number N might get those index regions initialized
> > before the main table.
> >
> > The sure fire way to recover a cluster in this condition is to
> > strategically disable all the tables that are failing to come up.
> > You can do this from the Hbase shell as long as the master is
> > running. If I remember right, it’s a pain since the disable command
> > will hang. You might need to disable a table, kill the shell,
> > disable the next table, etc. Then restart. You’ll eventually have a
> > cluster with all the region servers finally started, and a bunch of
> > disabled regions. If you disabled index tables, enable one, wait for
> > it to become available; eg its WAL edits will be replayed, then
> > enable the associated main table and wait for it to come online. If
> > Hbase did it’s job without error, and your failure didn’t include
> > losing 4 disks at once, order will be restored. Lather, rinse,
> > repeat until everything is enabled and online.
> >
> >  A big enough failure sprinkled with a little bit of bad luck
> > and what seems to be a Phoenix flaw == deadlock trying to get HBASE
> > to start up. Fix by forcing the order that Hbase brings regions
> > online. Finally, never go full restart. 
> >
> >  > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander
> > <0x62...@gmail.com > wrote:
> >  >
> >  > After update web interface at Master show that every region
> > server now 1.4.7 and no RITS.
> >  >
> >  > Cluster recovered only when we restart all regions servers 4
> times...
> >  >
> >  >> On 11 Sep 2018, at 04:08, Josh Elser  > > wrote:
> >  >>
> >  >> Did you update the HBase jars on all RegionServers?
> >  >>
> >  >> Make sure that you have all of the Regions assigned (no RITs).
> > There could be a pretty simple explanation as to why the index can't
> > be written

Re: ABORTING region server and following HBase cluster "crash"

2018-11-02 Thread Josh Elser
I would strongly disagree with the assertion that this is some 
unavoidable problem. Yes, an inverted index is a data structure which, 
by design, creates a hotspot (phrased another way, this is "data locality").


Lots of extremely smart individuals have spent a significant amount of 
time and effort in stabilizing secondary indexes in the past 1-2 years, 
not to mention others spending time on a local index implementation. 
Judging Phoenix in its entirety based off of an arbitrarily old version 
of Phoenix is disingenuous.


On 11/2/18 2:00 PM, Neelesh wrote:
I think this is an unavoidable problem in some sense, if global indexes 
are used. Essentially global indexes create a  graph of dependent region 
servers due to index rpc calls from one RS to another. Any single 
failure is bound to affect the entire graph, which under reasonable load 
becomes the entire HBase cluster. We had to drop global indexes just to 
keep the cluster running for more than a few days.


I think Cassandra has local secondary indexes preciesly because of this 
issue. Last I checked there were significant pending improvements 
required for Phoenix local indexes, especially around read paths ( not 
utilizing primary key prefixes in secondary index reads where possible, 
for example)



On Thu, Sep 13, 2018, 8:12 PM Jonathan Leech > wrote:


This seems similar to a failure scenario I’ve seen a couple times. I
believe after multiple restarts you got lucky and tables were
brought up by Hbase in the correct order.

What happens is some kind of semi-catastrophic failure where 1 or
more region servers go down with edits that weren’t flushed, and are
only in the WAL. These edits belong to regions whose tables have
secondary indexes. Hbase wants to replay the WAL before bringing up
the region server. Phoenix wants to talk to the index region during
this, but can’t. It fails enough times then stops.

The more region servers / tables / indexes affected, the more likely
that a full restart will get stuck in a classic deadlock. A good
old-fashioned data center outage is a great way to get started with
this kind of problem. You might make some progress and get stuck
again, or restart number N might get those index regions initialized
before the main table.

The sure fire way to recover a cluster in this condition is to
strategically disable all the tables that are failing to come up.
You can do this from the Hbase shell as long as the master is
running. If I remember right, it’s a pain since the disable command
will hang. You might need to disable a table, kill the shell,
disable the next table, etc. Then restart. You’ll eventually have a
cluster with all the region servers finally started, and a bunch of
disabled regions. If you disabled index tables, enable one, wait for
it to become available; eg its WAL edits will be replayed, then
enable the associated main table and wait for it to come online. If
Hbase did it’s job without error, and your failure didn’t include
losing 4 disks at once, order will be restored. Lather, rinse,
repeat until everything is enabled and online.

 A big enough failure sprinkled with a little bit of bad luck
and what seems to be a Phoenix flaw == deadlock trying to get HBASE
to start up. Fix by forcing the order that Hbase brings regions
online. Finally, never go full restart. 

 > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander
<0x62...@gmail.com > wrote:
 >
 > After update web interface at Master show that every region
server now 1.4.7 and no RITS.
 >
 > Cluster recovered only when we restart all regions servers 4 times...
 >
 >> On 11 Sep 2018, at 04:08, Josh Elser mailto:els...@apache.org>> wrote:
 >>
 >> Did you update the HBase jars on all RegionServers?
 >>
 >> Make sure that you have all of the Regions assigned (no RITs).
There could be a pretty simple explanation as to why the index can't
be written to.
 >>
 >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
 >>> Correct me if im wrong.
 >>> But looks like if you have A and B region server that has index
and primary table then possible situation like this.
 >>> A and B under writes on table with indexes
 >>> A - crash
 >>> B failed on index update because A is not operating then B
starting aborting
 >>> A after restart try to rebuild index from WAL but B at this
time is aborting then A starting aborting too
 >>> From this moment nothing happens (0 requests to region servers)
and A and B is not responsible from Master-status web interface
  On 9 Sep 2018, at 04:38, Batyrshin Alexander
<0x62...@gmail.com 
>> wrote:
 
  After update we still can't recover HBase c

Re: ABORTING region server and following HBase cluster "crash"

2018-11-02 Thread Neelesh
I think this is an unavoidable problem in some sense, if global indexes are
used. Essentially global indexes create a  graph of dependent region
servers due to index rpc calls from one RS to another. Any single failure
is bound to affect the entire graph, which under reasonable load becomes
the entire HBase cluster. We had to drop global indexes just to keep the
cluster running for more than a few days.

I think Cassandra has local secondary indexes preciesly because of this
issue. Last I checked there were significant pending improvements required
for Phoenix local indexes, especially around read paths ( not utilizing
primary key prefixes in secondary index reads where possible, for example)


On Thu, Sep 13, 2018, 8:12 PM Jonathan Leech  wrote:

> This seems similar to a failure scenario I’ve seen a couple times. I
> believe after multiple restarts you got lucky and tables were brought up by
> Hbase in the correct order.
>
> What happens is some kind of semi-catastrophic failure where 1 or more
> region servers go down with edits that weren’t flushed, and are only in the
> WAL. These edits belong to regions whose tables have secondary indexes.
> Hbase wants to replay the WAL before bringing up the region server. Phoenix
> wants to talk to the index region during this, but can’t. It fails enough
> times then stops.
>
> The more region servers / tables / indexes affected, the more likely that
> a full restart will get stuck in a classic deadlock. A good old-fashioned
> data center outage is a great way to get started with this kind of problem.
> You might make some progress and get stuck again, or restart number N might
> get those index regions initialized before the main table.
>
> The sure fire way to recover a cluster in this condition is to
> strategically disable all the tables that are failing to come up. You can
> do this from the Hbase shell as long as the master is running. If I
> remember right, it’s a pain since the disable command will hang. You might
> need to disable a table, kill the shell, disable the next table, etc. Then
> restart. You’ll eventually have a cluster with all the region servers
> finally started, and a bunch of disabled regions. If you disabled index
> tables, enable one, wait for it to become available; eg its WAL edits will
> be replayed, then enable the associated main table and wait for it to come
> online. If Hbase did it’s job without error, and your failure didn’t
> include losing 4 disks at once, order will be restored. Lather, rinse,
> repeat until everything is enabled and online.
>
>  A big enough failure sprinkled with a little bit of bad luck and
> what seems to be a Phoenix flaw == deadlock trying to get HBASE to start
> up. Fix by forcing the order that Hbase brings regions online. Finally,
> never go full restart. 
>
> > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com>
> wrote:
> >
> > After update web interface at Master show that every region server now
> 1.4.7 and no RITS.
> >
> > Cluster recovered only when we restart all regions servers 4 times...
> >
> >> On 11 Sep 2018, at 04:08, Josh Elser  wrote:
> >>
> >> Did you update the HBase jars on all RegionServers?
> >>
> >> Make sure that you have all of the Regions assigned (no RITs). There
> could be a pretty simple explanation as to why the index can't be written
> to.
> >>
> >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
> >>> Correct me if im wrong.
> >>> But looks like if you have A and B region server that has index and
> primary table then possible situation like this.
> >>> A and B under writes on table with indexes
> >>> A - crash
> >>> B failed on index update because A is not operating then B starting
> aborting
> >>> A after restart try to rebuild index from WAL but B at this time is
> aborting then A starting aborting too
> >>> From this moment nothing happens (0 requests to region servers) and A
> and B is not responsible from Master-status web interface
>  On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com
> > wrote:
> 
>  After update we still can't recover HBase cluster. Our region servers
> ABORTING over and over:
> 
>  prod003:
>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020]
> regionserver.HRegionServer: ABORTING region server
> prod003,60020,1536446665703: Could not update the index table, killing
> server region because couldn't write to an index table
>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020]
> regionserver.HRegionServer: ABORTING region server
> prod003,60020,1536446665703: Could not update the index table, killing
> server region because couldn't write to an index table
>  Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020]
> regionserver.HReg

Re: ABORTING region server and following HBase cluster "crash"

2018-10-02 Thread Batyrshin Alexander
Still observing chaining of region server restarts.

Our Phoenix version is 4.14-HBase-1.4 at commit 
https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8
 




At prod022 got this:

Oct 02 03:24:03 prod022 hbase[160534]: 2018-10-02 03:24:03,678 WARN  
[hconnection-0x4a616d85-shared--pool8-t10050] client.AsyncProcess: #21, 
table=KM_IDX1, attempt=1/1 failed=2ops, last exception: 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region 
KM_IDX1,\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1537400041091.9fdc7d07edce09b08b8d2750b24961b8.
 is not online on prod015,60020,1538417657739
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3081)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2365)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36621)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2359)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Oct 02 03:24:03 prod022 hbase[160534]:  on prod015,60020,1538417657739, 
tracking started Tue Oct 02 03:24:03 MSK 2018; not retrying 2 - final failure
Oct 02 03:24:03 prod022 hbase[160534]: 2018-10-02 03:24:03,695 INFO  
[RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
index.PhoenixIndexFailurePolicy: Successfully update INDEX_DISABLE_TIMESTAMP 
for KM_IDX1 due to an exception while writing updates. 
indexState=PENDING_DISABLE
Oct 02 03:24:03 prod022 hbase[160534]: 
org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  
disableIndexOnFailure=true, Failed to write to multiple index tables: [KM_IDX1]
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:236)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:195)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:156)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:145)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:620)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:595)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:578)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1048)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1711)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1789)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1745)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1044)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3646)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3108)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:3050)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:916)
Oct 02 03:24:03 prod022 hbase[160534]: at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:844)
Oct 02 03:24:03 prod022 hbase[160534]: at 

Re: ABORTING region server and following HBase cluster "crash"

2018-09-15 Thread Sergey Soldatov
Obviously yes.  If it's not configured than default handlers would be used
for index writes and may lead to the distributed deadlock.

Thanks,
Sergey

On Sat, Sep 15, 2018 at 11:36 AM Batyrshin Alexander <0x62...@gmail.com>
wrote:

> I've found that we still not configured this:
>
> hbase.region.server.rpc.scheduler.factory.class
> = org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory
>
> Can this misconfiguration leads to our problems?
>
> On 15 Sep 2018, at 02:04, Sergey Soldatov 
> wrote:
>
> That was the real problem quite a long time ago (couple years?). Can't say
> for sure in which version that was fixed, but now indexes has a priority
> over regular tables and their regions open first. So by the moment when we
> replay WALs for tables, all index regions are supposed to be online. If you
> see the problem on recent versions that usually means that cluster is not
> healthy and some of the index regions stuck in RiT state.
>
> Thanks,
> Sergey
>
> On Thu, Sep 13, 2018 at 8:12 PM Jonathan Leech  wrote:
>
>> This seems similar to a failure scenario I’ve seen a couple times. I
>> believe after multiple restarts you got lucky and tables were brought up by
>> Hbase in the correct order.
>>
>> What happens is some kind of semi-catastrophic failure where 1 or more
>> region servers go down with edits that weren’t flushed, and are only in the
>> WAL. These edits belong to regions whose tables have secondary indexes.
>> Hbase wants to replay the WAL before bringing up the region server. Phoenix
>> wants to talk to the index region during this, but can’t. It fails enough
>> times then stops.
>>
>> The more region servers / tables / indexes affected, the more likely that
>> a full restart will get stuck in a classic deadlock. A good old-fashioned
>> data center outage is a great way to get started with this kind of problem.
>> You might make some progress and get stuck again, or restart number N might
>> get those index regions initialized before the main table.
>>
>> The sure fire way to recover a cluster in this condition is to
>> strategically disable all the tables that are failing to come up. You can
>> do this from the Hbase shell as long as the master is running. If I
>> remember right, it’s a pain since the disable command will hang. You might
>> need to disable a table, kill the shell, disable the next table, etc. Then
>> restart. You’ll eventually have a cluster with all the region servers
>> finally started, and a bunch of disabled regions. If you disabled index
>> tables, enable one, wait for it to become available; eg its WAL edits will
>> be replayed, then enable the associated main table and wait for it to come
>> online. If Hbase did it’s job without error, and your failure didn’t
>> include losing 4 disks at once, order will be restored. Lather, rinse,
>> repeat until everything is enabled and online.
>>
>>  A big enough failure sprinkled with a little bit of bad luck and
>> what seems to be a Phoenix flaw == deadlock trying to get HBASE to start
>> up. Fix by forcing the order that Hbase brings regions online. Finally,
>> never go full restart. 
>>
>> > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com>
>> wrote:
>> >
>> > After update web interface at Master show that every region server now
>> 1.4.7 and no RITS.
>> >
>> > Cluster recovered only when we restart all regions servers 4 times...
>> >
>> >> On 11 Sep 2018, at 04:08, Josh Elser  wrote:
>> >>
>> >> Did you update the HBase jars on all RegionServers?
>> >>
>> >> Make sure that you have all of the Regions assigned (no RITs). There
>> could be a pretty simple explanation as to why the index can't be written
>> to.
>> >>
>> >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
>> >>> Correct me if im wrong.
>> >>> But looks like if you have A and B region server that has index and
>> primary table then possible situation like this.
>> >>> A and B under writes on table with indexes
>> >>> A - crash
>> >>> B failed on index update because A is not operating then B starting
>> aborting
>> >>> A after restart try to rebuild index from WAL but B at this time is
>> aborting then A starting aborting too
>> >>> From this moment nothing happens (0 requests to region servers) and A
>> and B is not responsible from Master-status web interface
>>  On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com
>> > wrote:
>> 
>>  After update we still can't recover HBase cluster. Our region
>> servers ABORTING over and over:
>> 
>>  prod003:
>>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
>> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020]
>> regionserver.HRegionServer: ABORTING region server
>> prod003,60020,1536446665703: Could not update the index table, killing
>> server region because couldn't write to an index table
>>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
>> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port

Re: ABORTING region server and following HBase cluster "crash"

2018-09-15 Thread Batyrshin Alexander
I've found that we still not configured this:

hbase.region.server.rpc.scheduler.factory.class = 
org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory

Can this misconfiguration leads to our problems?

> On 15 Sep 2018, at 02:04, Sergey Soldatov  wrote:
> 
> That was the real problem quite a long time ago (couple years?). Can't say 
> for sure in which version that was fixed, but now indexes has a priority over 
> regular tables and their regions open first. So by the moment when we replay 
> WALs for tables, all index regions are supposed to be online. If you see the 
> problem on recent versions that usually means that cluster is not healthy and 
> some of the index regions stuck in RiT state.
> 
> Thanks,
> Sergey
> 
> On Thu, Sep 13, 2018 at 8:12 PM Jonathan Leech  > wrote:
> This seems similar to a failure scenario I’ve seen a couple times. I believe 
> after multiple restarts you got lucky and tables were brought up by Hbase in 
> the correct order. 
> 
> What happens is some kind of semi-catastrophic failure where 1 or more region 
> servers go down with edits that weren’t flushed, and are only in the WAL. 
> These edits belong to regions whose tables have secondary indexes. Hbase 
> wants to replay the WAL before bringing up the region server. Phoenix wants 
> to talk to the index region during this, but can’t. It fails enough times 
> then stops. 
> 
> The more region servers / tables / indexes affected, the more likely that a 
> full restart will get stuck in a classic deadlock. A good old-fashioned data 
> center outage is a great way to get started with this kind of problem. You 
> might make some progress and get stuck again, or restart number N might get 
> those index regions initialized before the main table. 
> 
> The sure fire way to recover a cluster in this condition is to strategically 
> disable all the tables that are failing to come up. You can do this from the 
> Hbase shell as long as the master is running. If I remember right, it’s a 
> pain since the disable command will hang. You might need to disable a table, 
> kill the shell, disable the next table, etc. Then restart. You’ll eventually 
> have a cluster with all the region servers finally started, and a bunch of 
> disabled regions. If you disabled index tables, enable one, wait for it to 
> become available; eg its WAL edits will be replayed, then enable the 
> associated main table and wait for it to come online. If Hbase did it’s job 
> without error, and your failure didn’t include losing 4 disks at once, order 
> will be restored. Lather, rinse, repeat until everything is enabled and 
> online. 
> 
>  A big enough failure sprinkled with a little bit of bad luck and what 
> seems to be a Phoenix flaw == deadlock trying to get HBASE to start up. Fix 
> by forcing the order that Hbase brings regions online. Finally, never go full 
> restart. 
> 
> > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com 
> > > wrote:
> > 
> > After update web interface at Master show that every region server now 
> > 1.4.7 and no RITS.
> > 
> > Cluster recovered only when we restart all regions servers 4 times...
> > 
> >> On 11 Sep 2018, at 04:08, Josh Elser  >> > wrote:
> >> 
> >> Did you update the HBase jars on all RegionServers?
> >> 
> >> Make sure that you have all of the Regions assigned (no RITs). There could 
> >> be a pretty simple explanation as to why the index can't be written to.
> >> 
> >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
> >>> Correct me if im wrong.
> >>> But looks like if you have A and B region server that has index and 
> >>> primary table then possible situation like this.
> >>> A and B under writes on table with indexes
> >>> A - crash
> >>> B failed on index update because A is not operating then B starting 
> >>> aborting
> >>> A after restart try to rebuild index from WAL but B at this time is 
> >>> aborting then A starting aborting too
> >>> From this moment nothing happens (0 requests to region servers) and A and 
> >>> B is not responsible from Master-status web interface
>  On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com 
>     >> wrote:
>  
>  After update we still can't recover HBase cluster. Our region servers 
>  ABORTING over and over:
>  
>  prod003:
>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
>  [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
>  regionserver.HRegionServer: ABORTING region server 
>  prod003,60020,1536446665703: Could not update the index table, killing 
>  server region because couldn't write to an index table
>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
>  [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
>  regionserver.HRegionServer: ABORTING r

Re: ABORTING region server and following HBase cluster "crash"

2018-09-14 Thread Sergey Soldatov
Forgot to mention. That kind of problems can be mitigated by increasing the
number of threads for open regions. By default, it's 3 (?), but we haven't
seen any problems with increasing it up to several hundred for clusters
that have up to 2k regions per RS.
Thanks,
Sergey

On Fri, Sep 14, 2018 at 4:04 PM Sergey Soldatov 
wrote:

> That was the real problem quite a long time ago (couple years?). Can't say
> for sure in which version that was fixed, but now indexes has a priority
> over regular tables and their regions open first. So by the moment when we
> replay WALs for tables, all index regions are supposed to be online. If you
> see the problem on recent versions that usually means that cluster is not
> healthy and some of the index regions stuck in RiT state.
>
> Thanks,
> Sergey
>
> On Thu, Sep 13, 2018 at 8:12 PM Jonathan Leech  wrote:
>
>> This seems similar to a failure scenario I’ve seen a couple times. I
>> believe after multiple restarts you got lucky and tables were brought up by
>> Hbase in the correct order.
>>
>> What happens is some kind of semi-catastrophic failure where 1 or more
>> region servers go down with edits that weren’t flushed, and are only in the
>> WAL. These edits belong to regions whose tables have secondary indexes.
>> Hbase wants to replay the WAL before bringing up the region server. Phoenix
>> wants to talk to the index region during this, but can’t. It fails enough
>> times then stops.
>>
>> The more region servers / tables / indexes affected, the more likely that
>> a full restart will get stuck in a classic deadlock. A good old-fashioned
>> data center outage is a great way to get started with this kind of problem.
>> You might make some progress and get stuck again, or restart number N might
>> get those index regions initialized before the main table.
>>
>> The sure fire way to recover a cluster in this condition is to
>> strategically disable all the tables that are failing to come up. You can
>> do this from the Hbase shell as long as the master is running. If I
>> remember right, it’s a pain since the disable command will hang. You might
>> need to disable a table, kill the shell, disable the next table, etc. Then
>> restart. You’ll eventually have a cluster with all the region servers
>> finally started, and a bunch of disabled regions. If you disabled index
>> tables, enable one, wait for it to become available; eg its WAL edits will
>> be replayed, then enable the associated main table and wait for it to come
>> online. If Hbase did it’s job without error, and your failure didn’t
>> include losing 4 disks at once, order will be restored. Lather, rinse,
>> repeat until everything is enabled and online.
>>
>>  A big enough failure sprinkled with a little bit of bad luck and
>> what seems to be a Phoenix flaw == deadlock trying to get HBASE to start
>> up. Fix by forcing the order that Hbase brings regions online. Finally,
>> never go full restart. 
>>
>> > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com>
>> wrote:
>> >
>> > After update web interface at Master show that every region server now
>> 1.4.7 and no RITS.
>> >
>> > Cluster recovered only when we restart all regions servers 4 times...
>> >
>> >> On 11 Sep 2018, at 04:08, Josh Elser  wrote:
>> >>
>> >> Did you update the HBase jars on all RegionServers?
>> >>
>> >> Make sure that you have all of the Regions assigned (no RITs). There
>> could be a pretty simple explanation as to why the index can't be written
>> to.
>> >>
>> >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
>> >>> Correct me if im wrong.
>> >>> But looks like if you have A and B region server that has index and
>> primary table then possible situation like this.
>> >>> A and B under writes on table with indexes
>> >>> A - crash
>> >>> B failed on index update because A is not operating then B starting
>> aborting
>> >>> A after restart try to rebuild index from WAL but B at this time is
>> aborting then A starting aborting too
>> >>> From this moment nothing happens (0 requests to region servers) and A
>> and B is not responsible from Master-status web interface
>>  On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com
>> > wrote:
>> 
>>  After update we still can't recover HBase cluster. Our region
>> servers ABORTING over and over:
>> 
>>  prod003:
>>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
>> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020]
>> regionserver.HRegionServer: ABORTING region server
>> prod003,60020,1536446665703: Could not update the index table, killing
>> server region because couldn't write to an index table
>>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
>> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020]
>> regionserver.HRegionServer: ABORTING region server
>> prod003,60020,1536446665703: Could not update the index table, killing
>> server region because couldn't w

Re: ABORTING region server and following HBase cluster "crash"

2018-09-14 Thread Sergey Soldatov
That was the real problem quite a long time ago (couple years?). Can't say
for sure in which version that was fixed, but now indexes has a priority
over regular tables and their regions open first. So by the moment when we
replay WALs for tables, all index regions are supposed to be online. If you
see the problem on recent versions that usually means that cluster is not
healthy and some of the index regions stuck in RiT state.

Thanks,
Sergey

On Thu, Sep 13, 2018 at 8:12 PM Jonathan Leech  wrote:

> This seems similar to a failure scenario I’ve seen a couple times. I
> believe after multiple restarts you got lucky and tables were brought up by
> Hbase in the correct order.
>
> What happens is some kind of semi-catastrophic failure where 1 or more
> region servers go down with edits that weren’t flushed, and are only in the
> WAL. These edits belong to regions whose tables have secondary indexes.
> Hbase wants to replay the WAL before bringing up the region server. Phoenix
> wants to talk to the index region during this, but can’t. It fails enough
> times then stops.
>
> The more region servers / tables / indexes affected, the more likely that
> a full restart will get stuck in a classic deadlock. A good old-fashioned
> data center outage is a great way to get started with this kind of problem.
> You might make some progress and get stuck again, or restart number N might
> get those index regions initialized before the main table.
>
> The sure fire way to recover a cluster in this condition is to
> strategically disable all the tables that are failing to come up. You can
> do this from the Hbase shell as long as the master is running. If I
> remember right, it’s a pain since the disable command will hang. You might
> need to disable a table, kill the shell, disable the next table, etc. Then
> restart. You’ll eventually have a cluster with all the region servers
> finally started, and a bunch of disabled regions. If you disabled index
> tables, enable one, wait for it to become available; eg its WAL edits will
> be replayed, then enable the associated main table and wait for it to come
> online. If Hbase did it’s job without error, and your failure didn’t
> include losing 4 disks at once, order will be restored. Lather, rinse,
> repeat until everything is enabled and online.
>
>  A big enough failure sprinkled with a little bit of bad luck and
> what seems to be a Phoenix flaw == deadlock trying to get HBASE to start
> up. Fix by forcing the order that Hbase brings regions online. Finally,
> never go full restart. 
>
> > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com>
> wrote:
> >
> > After update web interface at Master show that every region server now
> 1.4.7 and no RITS.
> >
> > Cluster recovered only when we restart all regions servers 4 times...
> >
> >> On 11 Sep 2018, at 04:08, Josh Elser  wrote:
> >>
> >> Did you update the HBase jars on all RegionServers?
> >>
> >> Make sure that you have all of the Regions assigned (no RITs). There
> could be a pretty simple explanation as to why the index can't be written
> to.
> >>
> >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
> >>> Correct me if im wrong.
> >>> But looks like if you have A and B region server that has index and
> primary table then possible situation like this.
> >>> A and B under writes on table with indexes
> >>> A - crash
> >>> B failed on index update because A is not operating then B starting
> aborting
> >>> A after restart try to rebuild index from WAL but B at this time is
> aborting then A starting aborting too
> >>> From this moment nothing happens (0 requests to region servers) and A
> and B is not responsible from Master-status web interface
>  On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com
> > wrote:
> 
>  After update we still can't recover HBase cluster. Our region servers
> ABORTING over and over:
> 
>  prod003:
>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020]
> regionserver.HRegionServer: ABORTING region server
> prod003,60020,1536446665703: Could not update the index table, killing
> server region because couldn't write to an index table
>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020]
> regionserver.HRegionServer: ABORTING region server
> prod003,60020,1536446665703: Could not update the index table, killing
> server region because couldn't write to an index table
>  Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020]
> regionserver.HRegionServer: ABORTING region server
> prod003,60020,1536446665703: Could not update the index table, killing
> server region because couldn't write to an index table
>  Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL
> [RpcServer.def

Re: ABORTING region server and following HBase cluster "crash"

2018-09-13 Thread Jonathan Leech
This seems similar to a failure scenario I’ve seen a couple times. I believe 
after multiple restarts you got lucky and tables were brought up by Hbase in 
the correct order. 

What happens is some kind of semi-catastrophic failure where 1 or more region 
servers go down with edits that weren’t flushed, and are only in the WAL. These 
edits belong to regions whose tables have secondary indexes. Hbase wants to 
replay the WAL before bringing up the region server. Phoenix wants to talk to 
the index region during this, but can’t. It fails enough times then stops. 

The more region servers / tables / indexes affected, the more likely that a 
full restart will get stuck in a classic deadlock. A good old-fashioned data 
center outage is a great way to get started with this kind of problem. You 
might make some progress and get stuck again, or restart number N might get 
those index regions initialized before the main table. 

The sure fire way to recover a cluster in this condition is to strategically 
disable all the tables that are failing to come up. You can do this from the 
Hbase shell as long as the master is running. If I remember right, it’s a pain 
since the disable command will hang. You might need to disable a table, kill 
the shell, disable the next table, etc. Then restart. You’ll eventually have a 
cluster with all the region servers finally started, and a bunch of disabled 
regions. If you disabled index tables, enable one, wait for it to become 
available; eg its WAL edits will be replayed, then enable the associated main 
table and wait for it to come online. If Hbase did it’s job without error, and 
your failure didn’t include losing 4 disks at once, order will be restored. 
Lather, rinse, repeat until everything is enabled and online. 

 A big enough failure sprinkled with a little bit of bad luck and what 
seems to be a Phoenix flaw == deadlock trying to get HBASE to start up. Fix by 
forcing the order that Hbase brings regions online. Finally, never go full 
restart. 

> On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com> wrote:
> 
> After update web interface at Master show that every region server now 1.4.7 
> and no RITS.
> 
> Cluster recovered only when we restart all regions servers 4 times...
> 
>> On 11 Sep 2018, at 04:08, Josh Elser  wrote:
>> 
>> Did you update the HBase jars on all RegionServers?
>> 
>> Make sure that you have all of the Regions assigned (no RITs). There could 
>> be a pretty simple explanation as to why the index can't be written to.
>> 
>>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
>>> Correct me if im wrong.
>>> But looks like if you have A and B region server that has index and primary 
>>> table then possible situation like this.
>>> A and B under writes on table with indexes
>>> A - crash
>>> B failed on index update because A is not operating then B starting aborting
>>> A after restart try to rebuild index from WAL but B at this time is 
>>> aborting then A starting aborting too
>>> From this moment nothing happens (0 requests to region servers) and A and B 
>>> is not responsible from Master-status web interface
 On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com 
 > wrote:
 
 After update we still can't recover HBase cluster. Our region servers 
 ABORTING over and over:
 
 prod003:
 Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
 [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
 regionserver.HRegionServer: ABORTING region server 
 prod003,60020,1536446665703: Could not update the index table, killing 
 server region because couldn't write to an index table
 Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
 [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
 regionserver.HRegionServer: ABORTING region server 
 prod003,60020,1536446665703: Could not update the index table, killing 
 server region because couldn't write to an index table
 Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL 
 [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
 regionserver.HRegionServer: ABORTING region server 
 prod003,60020,1536446665703: Could not update the index table, killing 
 server region because couldn't write to an index table
 Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL 
 [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020] 
 regionserver.HRegionServer: ABORTING region server 
 prod003,60020,1536446665703: Could not update the index table, killing 
 server region because couldn't write to an index table
 Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096 FATAL 
 [RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020] 
 regionserver.HRegionServer: ABORTING region server 
 prod003,60020,1536450772841: Could not update the index table, killing 
 server r

Re: ABORTING region server and following HBase cluster "crash"

2018-09-10 Thread Batyrshin Alexander
After update web interface at Master show that every region server now 1.4.7 
and no RITS.

Cluster recovered only when we restart all regions servers 4 times...

> On 11 Sep 2018, at 04:08, Josh Elser  wrote:
> 
> Did you update the HBase jars on all RegionServers?
> 
> Make sure that you have all of the Regions assigned (no RITs). There could be 
> a pretty simple explanation as to why the index can't be written to.
> 
> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
>> Correct me if im wrong.
>> But looks like if you have A and B region server that has index and primary 
>> table then possible situation like this.
>> A and B under writes on table with indexes
>> A - crash
>> B failed on index update because A is not operating then B starting aborting
>> A after restart try to rebuild index from WAL but B at this time is aborting 
>> then A starting aborting too
>> From this moment nothing happens (0 requests to region servers) and A and B 
>> is not responsible from Master-status web interface
>>> On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com 
>>> > wrote:
>>> 
>>> After update we still can't recover HBase cluster. Our region servers 
>>> ABORTING over and over:
>>> 
>>> prod003:
>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod003,60020,1536446665703: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod003,60020,1536446665703: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod003,60020,1536446665703: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod003,60020,1536446665703: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod003,60020,1536450772841: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod003,60020,1536450772841: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> 
>>> prod004:
>>> Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09 02:52:13,541 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod004,60020,1536446387325: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09 02:52:50,264 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod004,60020,1536446387325: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09 02:53:40,709 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod004,60020,1536446387325: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09 02:54:00,060 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod004,60020,1536446387325: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> 
>>> prod005:
>>> Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09 02:52:50,661 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020] 
>>> regionserver.HRegionServer: ABORTING region server 
>>> prod005,60020,153644649: Could not update the index table, killing 
>>> server region because couldn't write to an index table
>>> Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09 02:53:27,542 FATAL 
>>> [RpcServer.default.FPBQ.Fifo.handler

Re: ABORTING region server and following HBase cluster "crash"

2018-09-10 Thread Josh Elser

Did you update the HBase jars on all RegionServers?

Make sure that you have all of the Regions assigned (no RITs). There 
could be a pretty simple explanation as to why the index can't be 
written to.


On 9/9/18 3:46 PM, Batyrshin Alexander wrote:

Correct me if im wrong.

But looks like if you have A and B region server that has index and 
primary table then possible situation like this.


A and B under writes on table with indexes
A - crash
B failed on index update because A is not operating then B starting aborting
A after restart try to rebuild index from WAL but B at this time is 
aborting then A starting aborting too
 From this moment nothing happens (0 requests to region servers) and A 
and B is not responsible from Master-status web interface



On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com 
> wrote:


After update we still can't recover HBase cluster. Our region servers 
ABORTING over and over:


prod003:
Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod003,60020,1536446665703: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod003,60020,1536446665703: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod003,60020,1536446665703: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod003,60020,1536446665703: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod003,60020,1536450772841: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod003,60020,1536450772841: Could not update the index table, 
killing server region because couldn't write to an index table


prod004:
Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09 02:52:13,541 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod004,60020,1536446387325: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09 02:52:50,264 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod004,60020,1536446387325: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09 02:53:40,709 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod004,60020,1536446387325: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09 02:54:00,060 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod004,60020,1536446387325: Could not update the index table, 
killing server region because couldn't write to an index table


prod005:
Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09 02:52:50,661 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod005,60020,153644649: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09 02:53:27,542 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=90,queue=0,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod005,60020,153644649: Could not update the index table, 
killing server region because couldn't write to an index table
Sep 09 02:54:00 prod005 hbase[3772]: 2018-09-09 02:53:59,915 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=7,queue=7,port=60020] 
regionserver.HRegionServer: ABORTING region 
server prod005,60020,153644649: Could not update the index table, 
killing server region because couldn't write to an index table
S

Re: ABORTING region server and following HBase cluster "crash"

2018-09-10 Thread Jaanai Zhang
The root cause could not be got from log information lastly. The index
might have been corrupted and it seems the action of aborting server still
continue due to Index handler failures policy.


   Yun Zhang
   Best regards!



Batyrshin Alexander <0x62...@gmail.com> 于2018年9月10日周一 上午3:46写道:

> Correct me if im wrong.
>
> But looks like if you have A and B region server that has index and
> primary table then possible situation like this.
>
> A and B under writes on table with indexes
> A - crash
> B failed on index update because A is not operating then B starting
> aborting
> A after restart try to rebuild index from WAL but B at this time is
> aborting then A starting aborting too
> From this moment nothing happens (0 requests to region servers) and A and
> B is not responsible from Master-status web interface
>
>
> On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com> wrote:
>
> After update we still can't recover HBase cluster. Our region servers
> ABORTING over and over:
>
> prod003:
> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod003,60020,1536446665703: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod003,60020,1536446665703: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod003,60020,1536446665703: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod003,60020,1536446665703: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod003,60020,1536450772841: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod003,60020,1536450772841: Could not update the index table,
> killing server region because couldn't write to an index table
>
> prod004:
> Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09 02:52:13,541 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod004,60020,1536446387325: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09 02:52:50,264 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod004,60020,1536446387325: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09 02:53:40,709 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod004,60020,1536446387325: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09 02:54:00,060 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod004,60020,1536446387325: Could not update the index table,
> killing server region because couldn't write to an index table
>
> prod005:
> Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09 02:52:50,661 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod005,60020,153644649: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09 02:53:27,542 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=90,queue=0,port=60020]
> regionserver.HRegionServer: ABORTING region
> server prod005,60020,153644649: Could not update the index table,
> killing server region because couldn't write to an index table
> Sep 09 02:54:00 prod005 hbase[3772]: 2018-09-09 02:53:59,915 FATAL
> [RpcServer.default.FPBQ.Fifo.handler=7,queue=7,port=60020]
> regionserver.HRegion

Re: ABORTING region server and following HBase cluster "crash"

2018-09-09 Thread Batyrshin Alexander
Correct me if im wrong.

But looks like if you have A and B region server that has index and primary 
table then possible situation like this.

A and B under writes on table with indexes
A - crash
B failed on index update because A is not operating then B starting aborting
A after restart try to rebuild index from WAL but B at this time is aborting 
then A starting aborting too
From this moment nothing happens (0 requests to region servers) and A and B is 
not responsible from Master-status web interface


> On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com> wrote:
> 
> After update we still can't recover HBase cluster. Our region servers 
> ABORTING over and over:
> 
> prod003:
> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod003,60020,1536446665703: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod003,60020,1536446665703: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod003,60020,1536446665703: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod003,60020,1536446665703: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod003,60020,1536450772841: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod003,60020,1536450772841: Could not update the index table, killing server 
> region because couldn't write to an index table
> 
> prod004:
> Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09 02:52:13,541 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod004,60020,1536446387325: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09 02:52:50,264 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod004,60020,1536446387325: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09 02:53:40,709 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod004,60020,1536446387325: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09 02:54:00,060 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod004,60020,1536446387325: Could not update the index table, killing server 
> region because couldn't write to an index table
> 
> prod005:
> Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09 02:52:50,661 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod005,60020,153644649: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09 02:53:27,542 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=90,queue=0,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod005,60020,153644649: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:54:00 prod005 hbase[3772]: 2018-09-09 02:53:59,915 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=7,queue=7,port=60020] 
> regionserver.HRegionServer: ABORTING region server 
> prod005,60020,153644649: Could not update the index table, killing server 
> region because couldn't write to an index table
> Sep 09 02:54:30 prod005 hbase[3772]: 2018-09-09 02:54:30,058 FATAL 
> [RpcServer.default.FPBQ.Fifo.handler=16,queue=6,port=60020] 
> reg

Re: ABORTING region server and following HBase cluster "crash"

2018-09-08 Thread Batyrshin Alexander
After update we still can't recover HBase cluster. Our region servers ABORTING 
over and over:

prod003:
Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
regionserver.HRegionServer: ABORTING region server prod003,60020,1536446665703: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
regionserver.HRegionServer: ABORTING region server prod003,60020,1536446665703: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:52:19 prod003 hbase[1440]: 2018-09-09 02:52:19,224 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=82,queue=2,port=60020] 
regionserver.HRegionServer: ABORTING region server prod003,60020,1536446665703: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:52:28 prod003 hbase[1440]: 2018-09-09 02:52:28,922 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=94,queue=4,port=60020] 
regionserver.HRegionServer: ABORTING region server prod003,60020,1536446665703: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:55:02 prod003 hbase[957]: 2018-09-09 02:55:02,096 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=95,queue=5,port=60020] 
regionserver.HRegionServer: ABORTING region server prod003,60020,1536450772841: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:55:18 prod003 hbase[957]: 2018-09-09 02:55:18,793 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=97,queue=7,port=60020] 
regionserver.HRegionServer: ABORTING region server prod003,60020,1536450772841: 
Could not update the index table, killing server region because couldn't write 
to an index table

prod004:
Sep 09 02:52:13 prod004 hbase[4890]: 2018-09-09 02:52:13,541 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=83,queue=3,port=60020] 
regionserver.HRegionServer: ABORTING region server prod004,60020,1536446387325: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:52:50 prod004 hbase[4890]: 2018-09-09 02:52:50,264 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=75,queue=5,port=60020] 
regionserver.HRegionServer: ABORTING region server prod004,60020,1536446387325: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:53:40 prod004 hbase[4890]: 2018-09-09 02:53:40,709 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=66,queue=6,port=60020] 
regionserver.HRegionServer: ABORTING region server prod004,60020,1536446387325: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:54:00 prod004 hbase[4890]: 2018-09-09 02:54:00,060 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=89,queue=9,port=60020] 
regionserver.HRegionServer: ABORTING region server prod004,60020,1536446387325: 
Could not update the index table, killing server region because couldn't write 
to an index table

prod005:
Sep 09 02:52:50 prod005 hbase[3772]: 2018-09-09 02:52:50,661 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=65,queue=5,port=60020] 
regionserver.HRegionServer: ABORTING region server prod005,60020,153644649: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:53:27 prod005 hbase[3772]: 2018-09-09 02:53:27,542 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=90,queue=0,port=60020] 
regionserver.HRegionServer: ABORTING region server prod005,60020,153644649: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:54:00 prod005 hbase[3772]: 2018-09-09 02:53:59,915 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=7,queue=7,port=60020] 
regionserver.HRegionServer: ABORTING region server prod005,60020,153644649: 
Could not update the index table, killing server region because couldn't write 
to an index table
Sep 09 02:54:30 prod005 hbase[3772]: 2018-09-09 02:54:30,058 FATAL 
[RpcServer.default.FPBQ.Fifo.handler=16,queue=6,port=60020] 
regionserver.HRegionServer: ABORTING region server prod005,60020,153644649: 
Could not update the index table, killing server region because couldn't write 
to an index table

And so on...

Trace is the same everywhere:

Sep 09 02:54:30 prod005 hbase[3772]: 
org.apache.phoenix.hbase.index.exception.MultiIndexWriteFailureException:  
disableIndexOnFailure=true, Failed to write to multiple index tables: [KM_IDX1, 
KM_IDX2, KM_HISTORY_IDX1, KM_HISTORY_IDX2, KM_HISTORY_IDX3]
Sep 09 02:54:30 prod005 hbase[3772]: at 
org.apache.phoenix.hbase.index.write.TrackingParallelWriterIndexCommitter.write(TrackingParallelWriterIndexCommitter.java:235)
Sep 09 02:54:30 prod005 hbase[3772]: at 
org.apache.phoenix.hbase.index.write.In

Re: ABORTING region server and following HBase cluster "crash"

2018-09-08 Thread Batyrshin Alexander
Thank you.
We're updating our cluster right now...


> On 9 Sep 2018, at 01:39, Ted Yu  wrote:
> 
> It seems you should deploy hbase with the following fix:
> 
> HBASE-21069 NPE in StoreScanner.updateReaders causes RS to crash
> 
> 1.4.7 was recently released.
> 
> FYI
> 
> On Sat, Sep 8, 2018 at 3:32 PM Batyrshin Alexander <0x62...@gmail.com 
> > wrote:
>  Hello,
> 
> We got this exception from prod006 server
> 
> Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532 FATAL 
> [MemStoreFlusher.1] regionserver.HRegionServer: ABORTING region server 
> prod006,60020,1536235102833: Replay of WAL required. Forcing server shutdown
> Sep 09 00:38:02 prod006 hbase[18907]: 
> org.apache.hadoop.hbase.DroppedSnapshotException: region: 
> KM,c\xEF\xBF\xBD\x16I7\xEF\xBF\xBD\x0A"A\xEF\xBF\xBDd\xEF\xBF\xBD\xEF\xBF\xBD\x19\x07t,1536178245576.60c121ba50e67f2429b9ca2ba2a11bad.
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2322)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2284)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2170)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2095)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:508)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:478)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> java.lang.Thread.run(Thread.java:748)
> Sep 09 00:38:02 prod006 hbase[18907]: Caused by: 
> java.lang.NullPointerException
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> java.util.ArrayList.(ArrayList.java:178)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.updateReaders(StoreScanner.java:863)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HStore.notifyChangedReadersObservers(HStore.java:1172)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1145)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HStore.access$900(HStore.java:122)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2505)
> Sep 09 00:38:02 prod006 hbase[18907]: at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2600)
> Sep 09 00:38:02 prod006 hbase[18907]: ... 9 more
> Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532 FATAL 
> [MemStoreFlusher.1] regionserver.HRegionServer: RegionServer abort: loaded 
> coprocessors are: 
> [org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator, 
> org.apache.phoenix.coprocessor.SequenceRegionObserver, org.apache.phoenix.c
> 
> After that we got ABORTING on almost every Region Servers in cluster with 
> different reasons:
> 
> prod003
> Sep 09 01:12:11 prod003 hbase[11552]: 2018-09-09 01:12:11,799 FATAL 
> [PostOpenDeployTasks:88bfac1dfd807c4cd1e9c1f31b4f053f] 
> regionserver.HRegionServer: ABORTING region server 
> prod003,60020,1536444066291: Exception running postOpenDeployTasks; 
> region=88bfac1dfd807c4cd1e9c1f31b4f053f
> Sep 09 01:12:11 prod003 hbase[11552]: java.io.InterruptedIOException: #139, 
> interrupted. currentNumberOfTask=8
> Sep 09 01:12:11 prod003 hbase[11552]: at 
> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1853)
> Sep 09 01:12:11 prod003 hbase[11552]: at 
> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1823)
> Sep 09 01:12:11 prod003 hbase[11552]: at 
> org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1899)
> Sep 09 01:12:11 prod003 hbase[11552]: at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:250)
> Sep 09 01:12:11 prod003 hbase[11552]: at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:213)
> Sep 09 01:12:11 prod003 hbase[11552]: at 
> org.apache.hadoop.hba

Re: ABORTING region server and following HBase cluster "crash"

2018-09-08 Thread Ted Yu
It seems you should deploy hbase with the following fix:

HBASE-21069 NPE in StoreScanner.updateReaders causes RS to crash

1.4.7 was recently released.

FYI

On Sat, Sep 8, 2018 at 3:32 PM Batyrshin Alexander <0x62...@gmail.com>
wrote:

>  Hello,
>
> We got this exception from *prod006* server
>
> Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532 FATAL
> [MemStoreFlusher.1] regionserver.HRegionServer: ABORTING region server
> prod006,60020,1536235102833: Replay of WAL required. Forcing server shutdown
> Sep 09 00:38:02 prod006 hbase[18907]:
> org.apache.hadoop.hbase.DroppedSnapshotException:
> region: 
> KM,c\xEF\xBF\xBD\x16I7\xEF\xBF\xBD\x0A"A\xEF\xBF\xBDd\xEF\xBF\xBD\xEF\xBF\xBD\x19\x07t,1536178245576.60c121ba50e67f2429b9ca2ba2a11bad.
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2322)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2284)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2170)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2095)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:508)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:478)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:76)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:264)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> java.lang.Thread.run(Thread.java:748)
> Sep 09 00:38:02 prod006 hbase[18907]: Caused by:
> java.lang.NullPointerException
> Sep 09 00:38:02 prod006 hbase[18907]: at
> java.util.ArrayList.(ArrayList.java:178)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.StoreScanner.updateReaders(StoreScanner.java:863)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HStore.notifyChangedReadersObservers(HStore.java:1172)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HStore.updateStorefiles(HStore.java:1145)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HStore.access$900(HStore.java:122)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.commit(HStore.java:2505)
> Sep 09 00:38:02 prod006 hbase[18907]: at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2600)
> Sep 09 00:38:02 prod006 hbase[18907]: ... 9 more
> Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532 FATAL
> [MemStoreFlusher.1] regionserver.HRegionServer: RegionServer abort: loaded
> coprocessors
> are: [org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> org.apache.phoenix.coprocessor.SequenceRegionObserver, org.apache.phoenix.c
>
> After that we got ABORTING on almost every Region Servers in cluster with
> different reasons:
>
> *prod003*
> Sep 09 01:12:11 prod003 hbase[11552]: 2018-09-09 01:12:11,799 FATAL
> [PostOpenDeployTasks:88bfac1dfd807c4cd1e9c1f31b4f053f]
> regionserver.HRegionServer: ABORTING region
> server prod003,60020,1536444066291: Exception running postOpenDeployTasks;
> region=88bfac1dfd807c4cd1e9c1f31b4f053f
> Sep 09 01:12:11 prod003 hbase[11552]: java.io.InterruptedIOException:
> #139, interrupted. currentNumberOfTask=8
> Sep 09 01:12:11 prod003 hbase[11552]: at
> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1853)
> Sep 09 01:12:11 prod003 hbase[11552]: at
> org.apache.hadoop.hbase.client.AsyncProcess.waitForMaximumCurrentTasks(AsyncProcess.java:1823)
> Sep 09 01:12:11 prod003 hbase[11552]: at
> org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1899)
> Sep 09 01:12:11 prod003 hbase[11552]: at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:250)
> Sep 09 01:12:11 prod003 hbase[11552]: at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:213)
> Sep 09 01:12:11 prod003 hbase[11552]: at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1484)
> Sep 09 01:12:11 prod003 hbase[11552]: at
> org.apache.hadoop.hbase.client.HTable.put(HTable.java:1031)
> Sep 09 01:12:11 p