Re: ABORTING region server and following HBase cluster "crash"

2018-11-05 Thread Josh Elser
Thanks, Neelesh. It came off to me like "Phoenix is no good, Cassandra has something that works better". I appreciate you taking the time to clarify! That really means a lot. On 11/2/18 8:14 PM, Neelesh wrote: By no means am I judging Phoenix based on this. This is simply a design trade-off

Re: ABORTING region server and following HBase cluster "crash"

2018-11-02 Thread Vincent Poon
Indexes in Phoenix should not in theory cause any cluster outage. An index write failure should just disable the index, not cause a crash. In practice, there have been some bugs around race conditions, the most dangerous of which accidentally trigger a KillServerOnFailurePolicy which then

Re: ABORTING region server and following HBase cluster "crash"

2018-11-02 Thread Neelesh
By no means am I judging Phoenix based on this. This is simply a design trade-off (scylladb goes the same route and builds global indexes). I appreciate all the effort that has gone in to Phoenix, and it was indeed a life saver. But the technical point remains that single node failures have

Re: ABORTING region server and following HBase cluster "crash"

2018-11-02 Thread Josh Elser
I would strongly disagree with the assertion that this is some unavoidable problem. Yes, an inverted index is a data structure which, by design, creates a hotspot (phrased another way, this is "data locality"). Lots of extremely smart individuals have spent a significant amount of time and

Re: ABORTING region server and following HBase cluster "crash"

2018-11-02 Thread Neelesh
I think this is an unavoidable problem in some sense, if global indexes are used. Essentially global indexes create a graph of dependent region servers due to index rpc calls from one RS to another. Any single failure is bound to affect the entire graph, which under reasonable load becomes the

Re: ABORTING region server and following HBase cluster "crash"

2018-10-02 Thread Batyrshin Alexander
Still observing chaining of region server restarts. Our Phoenix version is 4.14-HBase-1.4 at commit https://github.com/apache/phoenix/commit/52893c240e4f24e2bfac0834d35205f866c16ed8 At prod022 got this: Oct

Re: ABORTING region server and following HBase cluster "crash"

2018-09-15 Thread Sergey Soldatov
Obviously yes. If it's not configured than default handlers would be used for index writes and may lead to the distributed deadlock. Thanks, Sergey On Sat, Sep 15, 2018 at 11:36 AM Batyrshin Alexander <0x62...@gmail.com> wrote: > I've found that we still not configured this: > >

Re: ABORTING region server and following HBase cluster "crash"

2018-09-15 Thread Batyrshin Alexander
I've found that we still not configured this: hbase.region.server.rpc.scheduler.factory.class = org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory Can this misconfiguration leads to our problems? > On 15 Sep 2018, at 02:04, Sergey Soldatov wrote: > > That was the real problem quite a

Re: ABORTING region server and following HBase cluster "crash"

2018-09-14 Thread Sergey Soldatov
Forgot to mention. That kind of problems can be mitigated by increasing the number of threads for open regions. By default, it's 3 (?), but we haven't seen any problems with increasing it up to several hundred for clusters that have up to 2k regions per RS. Thanks, Sergey On Fri, Sep 14, 2018 at

Re: ABORTING region server and following HBase cluster "crash"

2018-09-14 Thread Sergey Soldatov
That was the real problem quite a long time ago (couple years?). Can't say for sure in which version that was fixed, but now indexes has a priority over regular tables and their regions open first. So by the moment when we replay WALs for tables, all index regions are supposed to be online. If you

Re: ABORTING region server and following HBase cluster "crash"

2018-09-13 Thread Jonathan Leech
This seems similar to a failure scenario I’ve seen a couple times. I believe after multiple restarts you got lucky and tables were brought up by Hbase in the correct order. What happens is some kind of semi-catastrophic failure where 1 or more region servers go down with edits that weren’t

Re: ABORTING region server and following HBase cluster "crash"

2018-09-10 Thread Batyrshin Alexander
After update web interface at Master show that every region server now 1.4.7 and no RITS. Cluster recovered only when we restart all regions servers 4 times... > On 11 Sep 2018, at 04:08, Josh Elser wrote: > > Did you update the HBase jars on all RegionServers? > > Make sure that you have

Re: ABORTING region server and following HBase cluster "crash"

2018-09-10 Thread Josh Elser
Did you update the HBase jars on all RegionServers? Make sure that you have all of the Regions assigned (no RITs). There could be a pretty simple explanation as to why the index can't be written to. On 9/9/18 3:46 PM, Batyrshin Alexander wrote: Correct me if im wrong. But looks like if you

Re: ABORTING region server and following HBase cluster "crash"

2018-09-10 Thread Jaanai Zhang
The root cause could not be got from log information lastly. The index might have been corrupted and it seems the action of aborting server still continue due to Index handler failures policy. Yun Zhang Best regards! Batyrshin Alexander

Re: ABORTING region server and following HBase cluster "crash"

2018-09-09 Thread Batyrshin Alexander
Correct me if im wrong. But looks like if you have A and B region server that has index and primary table then possible situation like this. A and B under writes on table with indexes A - crash B failed on index update because A is not operating then B starting aborting A after restart try to

Re: ABORTING region server and following HBase cluster "crash"

2018-09-08 Thread Batyrshin Alexander
After update we still can't recover HBase cluster. Our region servers ABORTING over and over: prod003: Sep 09 02:51:27 prod003 hbase[1440]: 2018-09-09 02:51:27,395 FATAL [RpcServer.default.FPBQ.Fifo.handler=92,queue=2,port=60020] regionserver.HRegionServer: ABORTING region server

Re: ABORTING region server and following HBase cluster "crash"

2018-09-08 Thread Batyrshin Alexander
Thank you. We're updating our cluster right now... > On 9 Sep 2018, at 01:39, Ted Yu wrote: > > It seems you should deploy hbase with the following fix: > > HBASE-21069 NPE in StoreScanner.updateReaders causes RS to crash > > 1.4.7 was recently released. > > FYI > > On Sat, Sep 8, 2018 at

Re: ABORTING region server and following HBase cluster "crash"

2018-09-08 Thread Ted Yu
It seems you should deploy hbase with the following fix: HBASE-21069 NPE in StoreScanner.updateReaders causes RS to crash 1.4.7 was recently released. FYI On Sat, Sep 8, 2018 at 3:32 PM Batyrshin Alexander <0x62...@gmail.com> wrote: > Hello, > > We got this exception from *prod006* server >

ABORTING region server and following HBase cluster "crash"

2018-09-08 Thread Batyrshin Alexander
Hello, We got this exception from prod006 server Sep 09 00:38:02 prod006 hbase[18907]: 2018-09-09 00:38:02,532 FATAL [MemStoreFlusher.1] regionserver.HRegionServer: ABORTING region server prod006,60020,1536235102833: Replay of WAL required. Forcing server shutdown Sep 09 00:38:02 prod006