Re: ABORTING region server and following HBase cluster "crash"

2018-09-15 Thread Sergey Soldatov
Obviously yes.  If it's not configured than default handlers would be used
for index writes and may lead to the distributed deadlock.

Thanks,
Sergey

On Sat, Sep 15, 2018 at 11:36 AM Batyrshin Alexander <0x62...@gmail.com>
wrote:

> I've found that we still not configured this:
>
> hbase.region.server.rpc.scheduler.factory.class
> = org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory
>
> Can this misconfiguration leads to our problems?
>
> On 15 Sep 2018, at 02:04, Sergey Soldatov 
> wrote:
>
> That was the real problem quite a long time ago (couple years?). Can't say
> for sure in which version that was fixed, but now indexes has a priority
> over regular tables and their regions open first. So by the moment when we
> replay WALs for tables, all index regions are supposed to be online. If you
> see the problem on recent versions that usually means that cluster is not
> healthy and some of the index regions stuck in RiT state.
>
> Thanks,
> Sergey
>
> On Thu, Sep 13, 2018 at 8:12 PM Jonathan Leech  wrote:
>
>> This seems similar to a failure scenario I’ve seen a couple times. I
>> believe after multiple restarts you got lucky and tables were brought up by
>> Hbase in the correct order.
>>
>> What happens is some kind of semi-catastrophic failure where 1 or more
>> region servers go down with edits that weren’t flushed, and are only in the
>> WAL. These edits belong to regions whose tables have secondary indexes.
>> Hbase wants to replay the WAL before bringing up the region server. Phoenix
>> wants to talk to the index region during this, but can’t. It fails enough
>> times then stops.
>>
>> The more region servers / tables / indexes affected, the more likely that
>> a full restart will get stuck in a classic deadlock. A good old-fashioned
>> data center outage is a great way to get started with this kind of problem.
>> You might make some progress and get stuck again, or restart number N might
>> get those index regions initialized before the main table.
>>
>> The sure fire way to recover a cluster in this condition is to
>> strategically disable all the tables that are failing to come up. You can
>> do this from the Hbase shell as long as the master is running. If I
>> remember right, it’s a pain since the disable command will hang. You might
>> need to disable a table, kill the shell, disable the next table, etc. Then
>> restart. You’ll eventually have a cluster with all the region servers
>> finally started, and a bunch of disabled regions. If you disabled index
>> tables, enable one, wait for it to become available; eg its WAL edits will
>> be replayed, then enable the associated main table and wait for it to come
>> online. If Hbase did it’s job without error, and your failure didn’t
>> include losing 4 disks at once, order will be restored. Lather, rinse,
>> repeat until everything is enabled and online.
>>
>>  A big enough failure sprinkled with a little bit of bad luck and
>> what seems to be a Phoenix flaw == deadlock trying to get HBASE to start
>> up. Fix by forcing the order that Hbase brings regions online. Finally,
>> never go full restart. 
>>
>> > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com>
>> wrote:
>> >
>> > After update web interface at Master show that every region server now
>> 1.4.7 and no RITS.
>> >
>> > Cluster recovered only when we restart all regions servers 4 times...
>> >
>> >> On 11 Sep 2018, at 04:08, Josh Elser  wrote:
>> >>
>> >> Did you update the HBase jars on all RegionServers?
>> >>
>> >> Make sure that you have all of the Regions assigned (no RITs). There
>> could be a pretty simple explanation as to why the index can't be written
>> to.
>> >>
>> >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
>> >>> Correct me if im wrong.
>> >>> But looks like if you have A and B region server that has index and
>> primary table then possible situation like this.
>> >>> A and B under writes on table with indexes
>> >>> A - crash
>> >>> B failed on index update because A is not operating then B starting
>> aborting
>> >>> A after restart try to rebuild index from WAL but B at this time is
>> aborting then A starting aborting too
>> >>> From this moment nothing happens (0 requests to region servers) and A
>> and B is not responsible from Master-status web interface
>>  On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com
>> > wrote:
>> 
>>  After update we still can't recover HBase cluster. Our region
>> servers ABORTING over and over:
>> 
>>  prod003:
>>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
>> [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020]
>> regionserver.HRegionServer: ABORTING region server
>> prod003,60020,1536446665703: Could not update the index table, killing
>> server region because couldn't write to an index table
>>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL
>> [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port

Re: ABORTING region server and following HBase cluster "crash"

2018-09-15 Thread Batyrshin Alexander
I've found that we still not configured this:

hbase.region.server.rpc.scheduler.factory.class = 
org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory

Can this misconfiguration leads to our problems?

> On 15 Sep 2018, at 02:04, Sergey Soldatov  wrote:
> 
> That was the real problem quite a long time ago (couple years?). Can't say 
> for sure in which version that was fixed, but now indexes has a priority over 
> regular tables and their regions open first. So by the moment when we replay 
> WALs for tables, all index regions are supposed to be online. If you see the 
> problem on recent versions that usually means that cluster is not healthy and 
> some of the index regions stuck in RiT state.
> 
> Thanks,
> Sergey
> 
> On Thu, Sep 13, 2018 at 8:12 PM Jonathan Leech  > wrote:
> This seems similar to a failure scenario I’ve seen a couple times. I believe 
> after multiple restarts you got lucky and tables were brought up by Hbase in 
> the correct order. 
> 
> What happens is some kind of semi-catastrophic failure where 1 or more region 
> servers go down with edits that weren’t flushed, and are only in the WAL. 
> These edits belong to regions whose tables have secondary indexes. Hbase 
> wants to replay the WAL before bringing up the region server. Phoenix wants 
> to talk to the index region during this, but can’t. It fails enough times 
> then stops. 
> 
> The more region servers / tables / indexes affected, the more likely that a 
> full restart will get stuck in a classic deadlock. A good old-fashioned data 
> center outage is a great way to get started with this kind of problem. You 
> might make some progress and get stuck again, or restart number N might get 
> those index regions initialized before the main table. 
> 
> The sure fire way to recover a cluster in this condition is to strategically 
> disable all the tables that are failing to come up. You can do this from the 
> Hbase shell as long as the master is running. If I remember right, it’s a 
> pain since the disable command will hang. You might need to disable a table, 
> kill the shell, disable the next table, etc. Then restart. You’ll eventually 
> have a cluster with all the region servers finally started, and a bunch of 
> disabled regions. If you disabled index tables, enable one, wait for it to 
> become available; eg its WAL edits will be replayed, then enable the 
> associated main table and wait for it to come online. If Hbase did it’s job 
> without error, and your failure didn’t include losing 4 disks at once, order 
> will be restored. Lather, rinse, repeat until everything is enabled and 
> online. 
> 
>  A big enough failure sprinkled with a little bit of bad luck and what 
> seems to be a Phoenix flaw == deadlock trying to get HBASE to start up. Fix 
> by forcing the order that Hbase brings regions online. Finally, never go full 
> restart. 
> 
> > On Sep 10, 2018, at 7:30 PM, Batyrshin Alexander <0x62...@gmail.com 
> > > wrote:
> > 
> > After update web interface at Master show that every region server now 
> > 1.4.7 and no RITS.
> > 
> > Cluster recovered only when we restart all regions servers 4 times...
> > 
> >> On 11 Sep 2018, at 04:08, Josh Elser  >> > wrote:
> >> 
> >> Did you update the HBase jars on all RegionServers?
> >> 
> >> Make sure that you have all of the Regions assigned (no RITs). There could 
> >> be a pretty simple explanation as to why the index can't be written to.
> >> 
> >>> On 9/9/18 3:46 PM, Batyrshin Alexander wrote:
> >>> Correct me if im wrong.
> >>> But looks like if you have A and B region server that has index and 
> >>> primary table then possible situation like this.
> >>> A and B under writes on table with indexes
> >>> A - crash
> >>> B failed on index update because A is not operating then B starting 
> >>> aborting
> >>> A after restart try to rebuild index from WAL but B at this time is 
> >>> aborting then A starting aborting too
> >>> From this moment nothing happens (0 requests to region servers) and A and 
> >>> B is not responsible from Master-status web interface
>  On 9 Sep 2018, at 04:38, Batyrshin Alexander <0x62...@gmail.com 
>     >> wrote:
>  
>  After update we still can't recover HBase cluster. Our region servers 
>  ABORTING over and over:
>  
>  prod003:
>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
>  [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] 
>  regionserver.HRegionServer: ABORTING region server 
>  prod003,60020,1536446665703: Could not update the index table, killing 
>  server region because couldn't write to an index table
>  Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL 
>  [RpcServer.default.FPBQ.Fifo.handler=77,queue=7,port=60020] 
>  regionserver.HRegionServer: ABORTING r