Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-05-24 Thread Adam Sjøgren
Udo writes:

> https://hbase.apache.org/book.html#data.block.encoding.types contains
> a detailed description of compression in HBase. The final solution for
> us was to configure the snappy codec in hbase-site.xml:
>
> 
> hbase.io.compress.snappy.codec
> 
> org.apache.hadoop.hbase.io.compress.aircompressor.SnappyCodec
> 

Thanks - we didn't have the codec defined explicitly previously and
Snappy worked with the native libraries, but it seems to be mandatory
to configure it now.

It works for me explicitly setting either .aircompressor. or .xerial.
now - so that's a nice solution :-)


  Thanks for the tip!

Adam

-- 
 "How come we play war and not peace?"  Adam Sjøgren
 "Too few role models."a...@koldfront.dk



Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-05-04 Thread Duo Zhang
OK, so finally the problem is that we are not load snappy compression
so all the regions with snappy compression enabled can not online?

It is good that you finally find the root cause.

For the bin with or without hadoop3, the difference is the version of
the hadoop jars bundled in the tarball.

In HBase, we make use of some internal classes of hadoop, so even if
we have done some reflection works to make sure the code can work with
different hadoop versions, drop in replacement is impossible,
especially if you replace hadoop2 jars with hadoop3, you will hit some
errors when starting HBase cluster.

So starting from hbase 2.5.x, we decided to publish two types of
tarballs, the tarballs without hadoop3 are built with hadoop2,
typically hadoop 2.10.2, and the tarballs with hadoop3 are built with
hadoop3, for branch-2.5 it is hadopp 3.2.4, for the up coming 2.6 it
is hadoop 3.3.5.

Notice that, for later branch-2.x releases we will keep this pattern
but for future hbase 3.x, we will not have special hadoop3 tarballs as
we have dropped hadoop2 support for hbase 3.x.

Thanks.

Udo Offermann  于2024年5月2日周四 22:13写道:
>
> The problem was actually with the Snappy codec or the native Snappy 
> libraries. After configuring the Snappy
> Java implementation, the cluster started without any problems.
>
> I have a final question regarding the Hbase distributions. Can you please 
> tell me the difference between the distributions:
> bin: https://www.apache.org/dyn/closer.lua/hbase/2.5.8/hbase-2.5.8-bin.tar.gz
> and
> hadoop3-bin: 
> https://www.apache.org/dyn/closer.lua/hbase/2.5.8/hbase-2.5.8-hadoop3-bin.tar.gz
>
> I can't find a description of this. The same applies to the client libraries 
> client-bin and hadoop3-client-bin.
>
>
> Best regards
> Udo
>
>
>
> > Am 30.04.2024 um 04:42 schrieb 张铎(Duo Zhang) :
> >
> > Oh, there is a typo, I mean the ServerCrashProcedure should not block other
> > procedures if it is in claim replication queue stage.
> >
> > 张铎(Duo Zhang) 于2024年4月30日 周二10:41写道:
> >
> >> Sorry to be a pain as the procedure store is a big problem before HBase
> >> 2.3 so we have done a big refactoring on HBase 2.3+ so we have a migration
> >> which makes the upgrading a bit complicated.
> >>
> >> And on the upgrading, you do not need to mix up HBase and Hadoop, you can
> >> upgrading them separately. Second, rolling upgrading is also a bit
> >> complicated, so I suggest you try fully down/up upgrading first, if you
> >> have successfully done an upgrading, then you can start to try rolling
> >> upgrading.
> >>
> >> To your scenario, I suggest, you first upgrading Hadoop, including
> >> namenode and datanode, HBase should be functional after the upgrading. And
> >> then, as discussed above, turn off the balancer, view the master page to
> >> make sure there are no RITs and no procedures, then shutdown master, and
> >> then shutdown all the region servers. And then, start master(do not need to
> >> wait the master finishes start up, as it relies on meta region online,
> >> where we must have at least one region server), and then all the region
> >> servers, to see if the cluster can go back to normal.
> >>
> >> On the ServerCrashProcedure, it is blocked in claim replication queue,
> >> which should be blocked other procedures as the region assignment should
> >> have already been finished. Does your cluster has replication peers? If
> >> not, it is a bit strange that why your procedure is blocked in the claim
> >> replication queue stage…
> >>
> >> Thanks.
> >>
> >> Udo Offermann 于2024年4月29日 周一21:26写道:
> >>
> >>> This time we made progress.
> >>> I first upgraded the Master Hadoop and HBase wise (after making sure that
> >>> there are no regions in transition and no running procedures) with keeping
> >>> Zookeeper running. Master was started with new version 2.8.5 telling that
> >>> there are 6 nodes with inconsistent version (what was to be expected). Now
> >>> the startup process completes with "Starting cluster schema service
> >>> COMPLETE“,
> >>> all regions were assigned and the cluster seemed to be stable.
> >>>
> >>> Again there were no regions in transitions and no procedures running and
> >>> so I started to upgrade the data nodes one by one.
> >>> The problem now is that the new region servers are not assigned regions
> >>> except of 3: hbase:namespace, hbase:meta and one of our application level
> >>> tables (which is empty most of the time).
> >>> The more data nodes I migrated, the more regions were accumulated on the
> >>> nodes running the old version until the last old data node has managed all
> >>> regions except for 3.
> >>>
> >>>
> >>>
> >>> After all regions have been transitioned I migrated the last node which
> >>> yields that all regions are in transition and look like this one:
> >>>
> >>> 21852184WAITING_TIMEOUT seritrack
> >>> TransitRegionStateProcedure table=tt_items,
> >>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN Mon Apr 29 14:12:36
> >>> CEST 2024   

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-05-02 Thread Udo Offermann
Hi Adam, 

https://hbase.apache.org/book.html#data.block.encoding.types contains a 
detailed description of compression in HBase. The final solution for us was to 
configure the snappy codec in hbase-site.xml:


hbase.io.compress.snappy.codec
org.apache.hadoop.hbase.io.compress.aircompressor.SnappyCodec


It seems to me that Hadoop completely covers this internally itself. Apart from 
the fact that there are no more native libraries in hadoop/lib/native, we have 
not changed anything in the Hadoop configuration.

Best regards
Udo

> Am 02.05.2024 um 16:38 schrieb Adam Sjøgren :
> 
> Udo writes:
> 
>> Then I saw that the region servers had problems with Snappy
>> compression. I'm not sure, but I believe the native Snappy libs were
>> part of the previous Hadoop distribution, at least they are not
>> included in the current one. After copying them over it seems to work
>> now. But what is the recommended way to enable snappy compression in
>> Hbase now?
> 
> I am very interested in this as well, as I have been having problems
> upgrading Hadoop past 3.2.4 with HBase due to snappy compression not
> working¹.
> 
> How/what did you copy over?
> 
> 
>  Best regards,
> 
>Adam
> 
> 
> ¹ https://lists.apache.org/thread/jpm5m6wv8odg69worhdh187h1hy2vr9p
> 
> -- 
> "Ni kan skratta om ni vill Adam Sjøgren
>  Håna oss, vi rör oss, ni står still" a...@koldfront.dk
> 



Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-05-02 Thread Adam Sjøgren
Udo writes:

> Then I saw that the region servers had problems with Snappy
> compression. I'm not sure, but I believe the native Snappy libs were
> part of the previous Hadoop distribution, at least they are not
> included in the current one. After copying them over it seems to work
> now. But what is the recommended way to enable snappy compression in
> Hbase now?

I am very interested in this as well, as I have been having problems
upgrading Hadoop past 3.2.4 with HBase due to snappy compression not
working¹.

How/what did you copy over?


  Best regards,

Adam


¹ https://lists.apache.org/thread/jpm5m6wv8odg69worhdh187h1hy2vr9p

-- 
 "Ni kan skratta om ni vill Adam Sjøgren
  Håna oss, vi rör oss, ni står still" a...@koldfront.dk



Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-05-02 Thread Udo Offermann
The problem was actually with the Snappy codec or the native Snappy libraries. 
After configuring the Snappy
Java implementation, the cluster started without any problems.

I have a final question regarding the Hbase distributions. Can you please tell 
me the difference between the distributions: 
bin: https://www.apache.org/dyn/closer.lua/hbase/2.5.8/hbase-2.5.8-bin.tar.gz 
and
hadoop3-bin: 
https://www.apache.org/dyn/closer.lua/hbase/2.5.8/hbase-2.5.8-hadoop3-bin.tar.gz

I can't find a description of this. The same applies to the client libraries 
client-bin and hadoop3-client-bin.


Best regards
Udo



> Am 30.04.2024 um 04:42 schrieb 张铎(Duo Zhang) :
> 
> Oh, there is a typo, I mean the ServerCrashProcedure should not block other
> procedures if it is in claim replication queue stage.
> 
> 张铎(Duo Zhang) 于2024年4月30日 周二10:41写道:
> 
>> Sorry to be a pain as the procedure store is a big problem before HBase
>> 2.3 so we have done a big refactoring on HBase 2.3+ so we have a migration
>> which makes the upgrading a bit complicated.
>> 
>> And on the upgrading, you do not need to mix up HBase and Hadoop, you can
>> upgrading them separately. Second, rolling upgrading is also a bit
>> complicated, so I suggest you try fully down/up upgrading first, if you
>> have successfully done an upgrading, then you can start to try rolling
>> upgrading.
>> 
>> To your scenario, I suggest, you first upgrading Hadoop, including
>> namenode and datanode, HBase should be functional after the upgrading. And
>> then, as discussed above, turn off the balancer, view the master page to
>> make sure there are no RITs and no procedures, then shutdown master, and
>> then shutdown all the region servers. And then, start master(do not need to
>> wait the master finishes start up, as it relies on meta region online,
>> where we must have at least one region server), and then all the region
>> servers, to see if the cluster can go back to normal.
>> 
>> On the ServerCrashProcedure, it is blocked in claim replication queue,
>> which should be blocked other procedures as the region assignment should
>> have already been finished. Does your cluster has replication peers? If
>> not, it is a bit strange that why your procedure is blocked in the claim
>> replication queue stage…
>> 
>> Thanks.
>> 
>> Udo Offermann 于2024年4月29日 周一21:26写道:
>> 
>>> This time we made progress.
>>> I first upgraded the Master Hadoop and HBase wise (after making sure that
>>> there are no regions in transition and no running procedures) with keeping
>>> Zookeeper running. Master was started with new version 2.8.5 telling that
>>> there are 6 nodes with inconsistent version (what was to be expected). Now
>>> the startup process completes with "Starting cluster schema service
>>> COMPLETE“,
>>> all regions were assigned and the cluster seemed to be stable.
>>> 
>>> Again there were no regions in transitions and no procedures running and
>>> so I started to upgrade the data nodes one by one.
>>> The problem now is that the new region servers are not assigned regions
>>> except of 3: hbase:namespace, hbase:meta and one of our application level
>>> tables (which is empty most of the time).
>>> The more data nodes I migrated, the more regions were accumulated on the
>>> nodes running the old version until the last old data node has managed all
>>> regions except for 3.
>>> 
>>> 
>>> 
>>> After all regions have been transitioned I migrated the last node which
>>> yields that all regions are in transition and look like this one:
>>> 
>>> 21852184WAITING_TIMEOUT seritrack
>>> TransitRegionStateProcedure table=tt_items,
>>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN Mon Apr 29 14:12:36
>>> CEST 2024   Mon Apr 29 14:59:44 CEST 2024   pid=2185, ppid=2184,
>>> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE,
>>> locked=true; TransitRegionStateProcedure table=tt_items,
>>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN
>>> 
>>> They are all waiting on this one:
>>> 
>>> 2184WAITING seritrack   ServerCrashProcedure
>>> datanode06ct.gmd9.intern,16020,1714378085579   Mon Apr 29 14:12:36 CEST
>>> 2024   Mon Apr 29 14:12:36 CEST 2024   pid=2184,
>>> state=WAITING:SERVER_CRASH_CLAIM_REPLICATION_QUEUES, locked=true;
>>> ServerCrashProcedure datanode06ct.gmd9.intern,16020,1714378085579,
>>> splitWal=true, meta=false
>>> 
>>> Again „ServerCrashProcedure“! Why are they not processed?
>>> Why is it so hard to upgrade the cluster? Is it worthwhile to take the
>>> next stable version 2.5.8?
>>> And - btw- what is the difference between the two distributions „bin“ and
>>> „hadoop3-bin“?
>>> 
>>> Best regards
>>> Udo
>>> 
>>> 
>>> 
>>> 
>>> 
 Am 28.04.2024 um 03:03 schrieb 张铎(Duo Zhang) :
 
 Better turn it off, and observe the master page until there is no RITs
 and no other procedures, then call hbase-daemon.sh stop master, and
 then hbase-daemon.sh stop regionserver.
 
 I'm not 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-30 Thread Udo Offermann
Found it:

in hbase-site.xml:

hbase.io.compress.snappy.codec
org.apache.hadoop.hbase.io.compress.xerial.SnappyCodec


> Am 30.04.2024 um 16:32 schrieb Udo Offermann :
> 
> I think we finally made it.
> There were a few more problems: First, I made sure that the class paths were 
> clean - Classpath hygiene in Java is the be-all and end-all ;-)
> Then I saw that the region servers had problems with Snappy compression. I'm 
> not sure, but I believe the native Snappy libs were part of the previous 
> Hadoop distribution, at least they are not included in the current one. After 
> copying them over it seems to work now. But what is the recommended way to 
> enable snappy compression in Hbase now?
> 
> 
> I noticed another small error on the Master Web UI: The „Regions in 
> transition“ JSP throws a NullPointerException when clicking on the link:
> 
> http://master1ct:16010/rits.jsp 
> HTTP ERROR 500 java.lang.NullPointerException
> URI:  /rits.jsp
> STATUS:   500
> MESSAGE:  java.lang.NullPointerException
> SERVLET:  org.apache.hadoop.hbase.generated.master.rits_jsp
> CAUSED BY:java.lang.NullPointerException
> Caused by:
> 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.generated.master.rits_jsp._jspService(rits_jsp.java:113)
>   at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:111)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHolder$NotAsync.service(ServletHolder.java:1450)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
>   at 
> org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:117)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>   at 
> org.apache.hadoop.hbase.http.SecurityHeadersFilter.doFilter(SecurityHeadersFilter.java:65)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>   at 
> org.apache.hadoop.hbase.http.ClickjackingPreventionFilter.doFilter(ClickjackingPreventionFilter.java:49)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>   at 
> org.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1521)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>   at 
> org.apache.hadoop.hbase.http.NoCacheFilter.doFilter(NoCacheFilter.java:47)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
>   at 
> org.apache.hbase.thirdparty.org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
>   at 
> 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-30 Thread Udo Offermann
I think we finally made it.
There were a few more problems: First, I made sure that the class paths were 
clean - Classpath hygiene in Java is the be-all and end-all ;-)
Then I saw that the region servers had problems with Snappy compression. I'm 
not sure, but I believe the native Snappy libs were part of the previous Hadoop 
distribution, at least they are not included in the current one. After copying 
them over it seems to work now. But what is the recommended way to enable 
snappy compression in Hbase now?


I noticed another small error on the Master Web UI: The „Regions in transition“ 
JSP throws a NullPointerException when clicking on the link:

http://master1ct:16010/rits.jsp 
HTTP ERROR 500 java.lang.NullPointerException
URI:/rits.jsp
STATUS: 500
MESSAGE:java.lang.NullPointerException
SERVLET:org.apache.hadoop.hbase.generated.master.rits_jsp
CAUSED BY:  java.lang.NullPointerException
Caused by:

java.lang.NullPointerException
at 
org.apache.hadoop.hbase.generated.master.rits_jsp._jspService(rits_jsp.java:113)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:111)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHolder$NotAsync.service(ServletHolder.java:1450)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
at 
org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:117)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at 
org.apache.hadoop.hbase.http.SecurityHeadersFilter.doFilter(SecurityHeadersFilter.java:65)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at 
org.apache.hadoop.hbase.http.ClickjackingPreventionFilter.doFilter(ClickjackingPreventionFilter.java:49)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at 
org.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1521)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at 
org.apache.hadoop.hbase.http.NoCacheFilter.doFilter(NoCacheFilter.java:47)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
at 
org.apache.hbase.thirdparty.org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-29 Thread Duo Zhang
Oh, there is a typo, I mean the ServerCrashProcedure should not block other
procedures if it is in claim replication queue stage.

张铎(Duo Zhang) 于2024年4月30日 周二10:41写道:

> Sorry to be a pain as the procedure store is a big problem before HBase
> 2.3 so we have done a big refactoring on HBase 2.3+ so we have a migration
> which makes the upgrading a bit complicated.
>
> And on the upgrading, you do not need to mix up HBase and Hadoop, you can
> upgrading them separately. Second, rolling upgrading is also a bit
> complicated, so I suggest you try fully down/up upgrading first, if you
> have successfully done an upgrading, then you can start to try rolling
> upgrading.
>
> To your scenario, I suggest, you first upgrading Hadoop, including
> namenode and datanode, HBase should be functional after the upgrading. And
> then, as discussed above, turn off the balancer, view the master page to
> make sure there are no RITs and no procedures, then shutdown master, and
> then shutdown all the region servers. And then, start master(do not need to
> wait the master finishes start up, as it relies on meta region online,
> where we must have at least one region server), and then all the region
> servers, to see if the cluster can go back to normal.
>
> On the ServerCrashProcedure, it is blocked in claim replication queue,
> which should be blocked other procedures as the region assignment should
> have already been finished. Does your cluster has replication peers? If
> not, it is a bit strange that why your procedure is blocked in the claim
> replication queue stage…
>
> Thanks.
>
> Udo Offermann 于2024年4月29日 周一21:26写道:
>
>> This time we made progress.
>> I first upgraded the Master Hadoop and HBase wise (after making sure that
>> there are no regions in transition and no running procedures) with keeping
>> Zookeeper running. Master was started with new version 2.8.5 telling that
>> there are 6 nodes with inconsistent version (what was to be expected). Now
>> the startup process completes with "Starting cluster schema service
>>  COMPLETE“,
>>  all regions were assigned and the cluster seemed to be stable.
>>
>> Again there were no regions in transitions and no procedures running and
>> so I started to upgrade the data nodes one by one.
>> The problem now is that the new region servers are not assigned regions
>> except of 3: hbase:namespace, hbase:meta and one of our application level
>> tables (which is empty most of the time).
>> The more data nodes I migrated, the more regions were accumulated on the
>> nodes running the old version until the last old data node has managed all
>> regions except for 3.
>>
>>
>>
>> After all regions have been transitioned I migrated the last node which
>> yields that all regions are in transition and look like this one:
>>
>> 21852184WAITING_TIMEOUT seritrack
>>  TransitRegionStateProcedure table=tt_items,
>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN Mon Apr 29 14:12:36
>> CEST 2024   Mon Apr 29 14:59:44 CEST 2024   pid=2185, ppid=2184,
>> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE,
>> locked=true; TransitRegionStateProcedure table=tt_items,
>> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN
>>
>> They are all waiting on this one:
>>
>> 2184WAITING seritrack   ServerCrashProcedure
>> datanode06ct.gmd9.intern,16020,1714378085579   Mon Apr 29 14:12:36 CEST
>> 2024   Mon Apr 29 14:12:36 CEST 2024   pid=2184,
>> state=WAITING:SERVER_CRASH_CLAIM_REPLICATION_QUEUES, locked=true;
>> ServerCrashProcedure datanode06ct.gmd9.intern,16020,1714378085579,
>> splitWal=true, meta=false
>>
>> Again „ServerCrashProcedure“! Why are they not processed?
>> Why is it so hard to upgrade the cluster? Is it worthwhile to take the
>> next stable version 2.5.8?
>> And - btw- what is the difference between the two distributions „bin“ and
>> „hadoop3-bin“?
>>
>> Best regards
>> Udo
>>
>>
>>
>>
>>
>> > Am 28.04.2024 um 03:03 schrieb 张铎(Duo Zhang) :
>> >
>> > Better turn it off, and observe the master page until there is no RITs
>> > and no other procedures, then call hbase-daemon.sh stop master, and
>> > then hbase-daemon.sh stop regionserver.
>> >
>> > I'm not 100% sure about the shell command, you'd better search try it
>> > by yourself. The key here is to stop master first and make sure there
>> > is no procedure, so we can safely remove MasterProcWALs, and then stop
>> > all region servers.
>> >
>> > Thanks.
>> >
>> > Udo Offermann  于2024年4月26日周五 23:34写道:
>> >>
>> >> I know, but is it necessary or beneficial to turn it off - and if so -
>> when?
>> >> And what is your recommendation about stopping the region servers? Just
>> >> hbase-daemon.sh stop regionserver
>> >> or
>> >> gracefull_stop.sh localhost
>> >> ?
>> >>
>> >>> Am 26.04.2024 um 17:22 schrieb 张铎(Duo Zhang) :
>> >>>
>> >>> Turning off balancer is to make sure that the balancer will not
>> >>> schedule any procedures to balance the cluster.
>> >>>
>> >>> Udo 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-29 Thread Duo Zhang
Sorry to be a pain as the procedure store is a big problem before HBase 2.3
so we have done a big refactoring on HBase 2.3+ so we have a migration
which makes the upgrading a bit complicated.

And on the upgrading, you do not need to mix up HBase and Hadoop, you can
upgrading them separately. Second, rolling upgrading is also a bit
complicated, so I suggest you try fully down/up upgrading first, if you
have successfully done an upgrading, then you can start to try rolling
upgrading.

To your scenario, I suggest, you first upgrading Hadoop, including namenode
and datanode, HBase should be functional after the upgrading. And then, as
discussed above, turn off the balancer, view the master page to make sure
there are no RITs and no procedures, then shutdown master, and then
shutdown all the region servers. And then, start master(do not need to wait
the master finishes start up, as it relies on meta region online, where we
must have at least one region server), and then all the region servers, to
see if the cluster can go back to normal.

On the ServerCrashProcedure, it is blocked in claim replication queue,
which should be blocked other procedures as the region assignment should
have already been finished. Does your cluster has replication peers? If
not, it is a bit strange that why your procedure is blocked in the claim
replication queue stage…

Thanks.

Udo Offermann 于2024年4月29日 周一21:26写道:

> This time we made progress.
> I first upgraded the Master Hadoop and HBase wise (after making sure that
> there are no regions in transition and no running procedures) with keeping
> Zookeeper running. Master was started with new version 2.8.5 telling that
> there are 6 nodes with inconsistent version (what was to be expected). Now
> the startup process completes with "Starting cluster schema service
>  COMPLETE“,
>  all regions were assigned and the cluster seemed to be stable.
>
> Again there were no regions in transitions and no procedures running and
> so I started to upgrade the data nodes one by one.
> The problem now is that the new region servers are not assigned regions
> except of 3: hbase:namespace, hbase:meta and one of our application level
> tables (which is empty most of the time).
> The more data nodes I migrated, the more regions were accumulated on the
> nodes running the old version until the last old data node has managed all
> regions except for 3.
>
>
>
> After all regions have been transitioned I migrated the last node which
> yields that all regions are in transition and look like this one:
>
> 21852184WAITING_TIMEOUT seritrack
>  TransitRegionStateProcedure table=tt_items,
> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN Mon Apr 29 14:12:36
> CEST 2024   Mon Apr 29 14:59:44 CEST 2024   pid=2185, ppid=2184,
> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE,
> locked=true; TransitRegionStateProcedure table=tt_items,
> region=d7a411647663dd9e0fc972c7e14088a5, ASSIGN
>
> They are all waiting on this one:
>
> 2184WAITING seritrack   ServerCrashProcedure
> datanode06ct.gmd9.intern,16020,1714378085579   Mon Apr 29 14:12:36 CEST
> 2024   Mon Apr 29 14:12:36 CEST 2024   pid=2184,
> state=WAITING:SERVER_CRASH_CLAIM_REPLICATION_QUEUES, locked=true;
> ServerCrashProcedure datanode06ct.gmd9.intern,16020,1714378085579,
> splitWal=true, meta=false
>
> Again „ServerCrashProcedure“! Why are they not processed?
> Why is it so hard to upgrade the cluster? Is it worthwhile to take the
> next stable version 2.5.8?
> And - btw- what is the difference between the two distributions „bin“ and
> „hadoop3-bin“?
>
> Best regards
> Udo
>
>
>
>
>
> > Am 28.04.2024 um 03:03 schrieb 张铎(Duo Zhang) :
> >
> > Better turn it off, and observe the master page until there is no RITs
> > and no other procedures, then call hbase-daemon.sh stop master, and
> > then hbase-daemon.sh stop regionserver.
> >
> > I'm not 100% sure about the shell command, you'd better search try it
> > by yourself. The key here is to stop master first and make sure there
> > is no procedure, so we can safely remove MasterProcWALs, and then stop
> > all region servers.
> >
> > Thanks.
> >
> > Udo Offermann  于2024年4月26日周五 23:34写道:
> >>
> >> I know, but is it necessary or beneficial to turn it off - and if so -
> when?
> >> And what is your recommendation about stopping the region servers? Just
> >> hbase-daemon.sh stop regionserver
> >> or
> >> gracefull_stop.sh localhost
> >> ?
> >>
> >>> Am 26.04.2024 um 17:22 schrieb 张铎(Duo Zhang) :
> >>>
> >>> Turning off balancer is to make sure that the balancer will not
> >>> schedule any procedures to balance the cluster.
> >>>
> >>> Udo Offermann  于2024年4月26日周五 23:03写道:
> 
>  and what’s about turning of Hbase balancer before stopping hmaster?
> 
> > Am 26.04.2024 um 17:00 schrieb Udo Offermann <
> udo.offerm...@zfabrik.de>:
> >
> > So there is no need for
> >
> > hbase/bin/graceful_stop.sh localhost
> 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-23 Thread Udo Offermann
Hi Duo, 

To be honest, we only use one master, so there is no way to swap them.
After the attempt throwing the NPE we ran the hbck tool again in which case no 
NPE was thrown but also nothing else - the log looked just the one I’ve send 
you just without the exception stack trace and nothing more. The problem also 
remain the same.

I think we go just back to our snapshot and try the migration again from the 
start.



> Am 23.04.2024 um 09:36 schrieb 张铎(Duo Zhang) :
> 
> Strange, I checked the code, it seems we get NPE on this line
> 
> https://github.com/apache/hbase/blob/4d7ce1aac724fbf09e526fc422b5a11e530c32f0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2872
> 
> Could you please confirm that you connect to the correct active master
> which is hanging? It seems that you are connecting the backup
> master...
> 
> Thanks.
> 
> 张铎(Duo Zhang)  于2024年4月23日周二 15:31写道:
>> 
>> Ah, NPE usually means a code bug, then there is no simple way to fix
>> it, need to take a deep look on the code :(
>> 
>> Sorry.
>> 
>> Udo Offermann  于2024年4月22日周一 15:32写道:
>>> 
>>> Unfortunately not.
>>> I’ve found the node hosting the meta region and was able to run hack 
>>> scheduleRecoveries using hbase-operator-tools-1.2.0.
>>> The tool however stops with an NPE:
>>> 
>>> 09:22:00.532 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable 
>>> to load native-hadoop library for your platform... using builtin-java 
>>> classes where applicable
>>> 09:22:00.703 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation 
>>> - hbase.client.pause.cqtbe is deprecated. Instead, use 
>>> hbase.client.pause.server.overloaded
>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
>>> environment:zookeeper.version=3.8.3-6ad6d364c7c0bcf0de452d54ebefa3058098ab56,
>>>  built on 2023-10-05 10:34 UTC
>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
>>> environment:host.name=HBaseMaster.gmd9.intern
>>> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
>>> environment:java.version=1.8.0_402
>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
>>> environment:java.vendor=Red Hat, Inc.
>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
>>> environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-2.el8.x86_64/jre
>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
>>> environment:java.class.path=hbase-operator-tools-1.2.0/hbase-hbck2/hbase-hbck2-1.2.0.jar:hbase/conf:/opt/seritrack/tt/jdk/lib/tools.jar:/opt/seritrack/tt/nosql/hbase:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-mapreduce-2.5.7.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/audience-annotations-0.13.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/commons-logging-1.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/htrace-core4-4.1.0-incubating.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jcl-over-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jul-to-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-api-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-context-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-semconv-1.15.0-alpha.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/slf4j-api-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-client-2.5.7.jar:/opt/seritrack/tt/nosql/pl_nosql_ext/libs/pl_nosql_ext-3.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-1.2-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-core-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-slf4j-impl-2.17.2.jar:/opt/seritrack/tt/prometheus_exporters/jmx_exporter/jmx_prometheus_javaagent.jar
>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
>>> environment:java.library.path=/opt/seritrack/tt/nosql/hadoop/lib/native
>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
>>> environment:java.io.tmpdir=/tmp
>>> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
>>> 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-23 Thread Duo Zhang
Strange, I checked the code, it seems we get NPE on this line

https://github.com/apache/hbase/blob/4d7ce1aac724fbf09e526fc422b5a11e530c32f0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java#L2872

Could you please confirm that you connect to the correct active master
which is hanging? It seems that you are connecting the backup
master...

Thanks.

张铎(Duo Zhang)  于2024年4月23日周二 15:31写道:
>
> Ah, NPE usually means a code bug, then there is no simple way to fix
> it, need to take a deep look on the code :(
>
> Sorry.
>
> Udo Offermann  于2024年4月22日周一 15:32写道:
> >
> > Unfortunately not.
> > I’ve found the node hosting the meta region and was able to run hack 
> > scheduleRecoveries using hbase-operator-tools-1.2.0.
> > The tool however stops with an NPE:
> >
> > 09:22:00.532 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable 
> > to load native-hadoop library for your platform... using builtin-java 
> > classes where applicable
> > 09:22:00.703 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation 
> > - hbase.client.pause.cqtbe is deprecated. Instead, use 
> > hbase.client.pause.server.overloaded
> > 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:zookeeper.version=3.8.3-6ad6d364c7c0bcf0de452d54ebefa3058098ab56,
> >  built on 2023-10-05 10:34 UTC
> > 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:host.name=HBaseMaster.gmd9.intern
> > 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:java.version=1.8.0_402
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:java.vendor=Red Hat, Inc.
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-2.el8.x86_64/jre
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:java.class.path=hbase-operator-tools-1.2.0/hbase-hbck2/hbase-hbck2-1.2.0.jar:hbase/conf:/opt/seritrack/tt/jdk/lib/tools.jar:/opt/seritrack/tt/nosql/hbase:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-mapreduce-2.5.7.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/audience-annotations-0.13.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/commons-logging-1.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/htrace-core4-4.1.0-incubating.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jcl-over-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jul-to-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-api-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-context-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-semconv-1.15.0-alpha.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/slf4j-api-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-client-2.5.7.jar:/opt/seritrack/tt/nosql/pl_nosql_ext/libs/pl_nosql_ext-3.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-1.2-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-core-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-slf4j-impl-2.17.2.jar:/opt/seritrack/tt/prometheus_exporters/jmx_exporter/jmx_prometheus_javaagent.jar
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:java.library.path=/opt/seritrack/tt/nosql/hadoop/lib/native
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:java.io.tmpdir=/tmp
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:java.compiler=
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:os.name=Linux
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> > environment:os.arch=amd64
> > 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> > 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-23 Thread Duo Zhang
Ah, NPE usually means a code bug, then there is no simple way to fix
it, need to take a deep look on the code :(

Sorry.

Udo Offermann  于2024年4月22日周一 15:32写道:
>
> Unfortunately not.
> I’ve found the node hosting the meta region and was able to run hack 
> scheduleRecoveries using hbase-operator-tools-1.2.0.
> The tool however stops with an NPE:
>
> 09:22:00.532 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> 09:22:00.703 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - 
> hbase.client.pause.cqtbe is deprecated. Instead, use 
> hbase.client.pause.server.overloaded
> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:zookeeper.version=3.8.3-6ad6d364c7c0bcf0de452d54ebefa3058098ab56, 
> built on 2023-10-05 10:34 UTC
> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:host.name=HBaseMaster.gmd9.intern
> 09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:java.version=1.8.0_402
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:java.vendor=Red Hat, Inc.
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-2.el8.x86_64/jre
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:java.class.path=hbase-operator-tools-1.2.0/hbase-hbck2/hbase-hbck2-1.2.0.jar:hbase/conf:/opt/seritrack/tt/jdk/lib/tools.jar:/opt/seritrack/tt/nosql/hbase:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-mapreduce-2.5.7.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/audience-annotations-0.13.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/commons-logging-1.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/htrace-core4-4.1.0-incubating.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jcl-over-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jul-to-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-api-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-context-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-semconv-1.15.0-alpha.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/slf4j-api-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-client-2.5.7.jar:/opt/seritrack/tt/nosql/pl_nosql_ext/libs/pl_nosql_ext-3.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-1.2-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-core-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-slf4j-impl-2.17.2.jar:/opt/seritrack/tt/prometheus_exporters/jmx_exporter/jmx_prometheus_javaagent.jar
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:java.library.path=/opt/seritrack/tt/nosql/hadoop/lib/native
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:java.io.tmpdir=/tmp
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:java.compiler=
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:os.name=Linux
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:os.arch=amd64
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:os.version=4.18.0-513.18.1.el8_9.x86_64
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:user.name=seritrack
> 09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
> org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
> environment:user.home=/opt/seritrack
> 09:22:00.766 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-22 Thread Udo Offermann
Unfortunately not.
I’ve found the node hosting the meta region and was able to run hack 
scheduleRecoveries using hbase-operator-tools-1.2.0.
The tool however stops with an NPE:

09:22:00.532 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
09:22:00.703 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - 
hbase.client.pause.cqtbe is deprecated. Instead, use 
hbase.client.pause.server.overloaded
09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:zookeeper.version=3.8.3-6ad6d364c7c0bcf0de452d54ebefa3058098ab56, 
built on 2023-10-05 10:34 UTC
09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:host.name=HBaseMaster.gmd9.intern
09:22:00.765 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:java.version=1.8.0_402
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:java.vendor=Red Hat, Inc.
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.402.b06-2.el8.x86_64/jre
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:java.class.path=hbase-operator-tools-1.2.0/hbase-hbck2/hbase-hbck2-1.2.0.jar:hbase/conf:/opt/seritrack/tt/jdk/lib/tools.jar:/opt/seritrack/tt/nosql/hbase:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-mapreduce-2.5.7.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/audience-annotations-0.13.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/commons-logging-1.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/htrace-core4-4.1.0-incubating.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jcl-over-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/jul-to-slf4j-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-api-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-context-1.15.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/opentelemetry-semconv-1.15.0-alpha.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/slf4j-api-1.7.33.jar:/opt/seritrack/tt/nosql/hbase/lib/shaded-clients/hbase-shaded-client-2.5.7.jar:/opt/seritrack/tt/nosql/pl_nosql_ext/libs/pl_nosql_ext-3.0.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-1.2-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-api-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-core-2.17.2.jar:/opt/seritrack/tt/nosql/hbase/lib/client-facing-thirdparty/log4j-slf4j-impl-2.17.2.jar:/opt/seritrack/tt/prometheus_exporters/jmx_exporter/jmx_prometheus_javaagent.jar
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:java.library.path=/opt/seritrack/tt/nosql/hadoop/lib/native
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:java.io.tmpdir=/tmp
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:java.compiler=
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:os.name=Linux
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:os.arch=amd64
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:os.version=4.18.0-513.18.1.el8_9.x86_64
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:user.name=seritrack
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:user.home=/opt/seritrack
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - Client 
environment:user.dir=/opt/seritrack/tt/nosql_3.0
09:22:00.766 [ReadOnlyZKClient-HBaseMaster:2181@0x7d9f158f] INFO  
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper - 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-21 Thread Duo Zhang
Before upgrading, disable the balancer, make sure there is no region
in transition, and there are no dead region servers currently being
processed, i.e, no ServerCrashProcedure.
This is to make sure that there is no procedure before shutting down,
so it is safe to just remove all the MasterProcWALs.

Then you can stop both masters, active and standby, remove the
MasterProcWALs if it is too big, and then start master with new code.

Usually in this way the new master could start successfully.

Thanks.

Udo Offermann  于2024年4月20日周六 23:32写道:
>
> Thank you, I can check on Monday.
>
> This is the upgrade of the test system and serves as training for the upgrade 
> of the production system. What do we need to do to prevent this problem?
>
> We had some problems starting zookeeper after the upgrade and I had to start 
> it with "zookeeper.snapshot.trust.empty=true“.
> BTW, is it ok to delete the zookeeper directory?
>
> Best regards
> Udo
>
>
> > Am 20.04.2024 um 15:53 schrieb 张铎(Duo Zhang) :
> >
> > OK, it was waitForMetaOnline.
> >
> > Maybe the problem is that you do have some correct procedures before
> > upgrading, like ServerCrashProcedure, but then you delete all the
> > procedure wals so the ServerCrashProcedure is also gone, so meta can
> > never be online.
> >
> > Please check the /hbase/meta-region-server znode on zookeeper, dump
> > its content, it is protobuf based but anyway, you could see the
> > encoded server name which hosts meta region.
> >
> > Then use HBCK2, to schedule a SCP for this region server, to see if it
> > can fix the problem.
> >
> > https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md
> >
> > This is the document for HBCK2, you should use the scheduleRecoveries 
> > command.
> >
> > Hope this could fix your problem.
> >
> > Thread 92 (master/masterserver:16000:becomeActiveMaster):
> >  State: TIMED_WAITING
> >  Blocked count: 165
> >  Waited count: 404
> >  Stack:
> >java.lang.Thread.sleep(Native Method)
> >org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:125)
> >org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1358)
> >
> > org.apache.hadoop.hbase.master.HMaster.waitForMetaOnline(HMaster.java:1328)
> >
> > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1069)
> >
> > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2405)
> >org.apache.hadoop.hbase.master.HMaster.lambda$null$0(HMaster.java:565)
> >
> > org.apache.hadoop.hbase.master.HMaster$$Lambda$265/1598878738.run(Unknown
> > Source)
> >org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187)
> >org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177)
> >org.apache.hadoop.hbase.master.HMaster.lambda$run$1(HMaster.java:562)
> >
> > org.apache.hadoop.hbase.master.HMaster$$Lambda$264/1129144214.run(Unknown
> > Source)
> >java.lang.Thread.run(Thread.java:750)
> >
> > Udo Offermann mailto:udo.offerm...@zfabrik.de>> 
> > 于2024年4月20日周六 21:13写道:
> >>
> >> Master status for masterserver.gmd9.intern,16000,1713515965162 as of Fri
> >> Apr 19 10:55:22 CEST 2024
> >>
> >>
> >> Version Info:
> >> ===
> >> HBase 2.5.7
> >> Source code repository
> >> git://buildbox.localdomain/home/apurtell/tmp/RM/hbase
> >> revision=6788f98356dd70b4a7ff766ea7a8298e022e7b95
> >> Compiled by apurtell on Thu Dec 14 15:59:16 PST 2023
> >> From source with checksum
> >> 1501d7fdf72398791ee335a229d099fc972cea7c2a952da7622eb087ddf975361f107cbbbee5d0ad6f603466e9afa1f4fd242ffccbd4371eb0b56059bb3b5402
> >> Hadoop 2.10.2
> >> Source code repository Unknown
> >> revision=965fd380006fa78b2315668fbc7eb432e1d8200f
> >> Compiled by ubuntu on 2022-05-25T00:12Z
> >>
> >>
> >> Tasks:
> >> ===
> >> Task: Master startup
> >> Status: RUNNING:Starting assignment manager
> >> Running for 954s
> >>
> >> Task: Flushing master:store,,1.1595e783b53d99cd5eef43b6debb2682.
> >> Status: COMPLETE:Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY,
> >> failureReason:Nothing to flush,flush seq id14
> >> Completed 49s ago
> >> Ran for 0s
> >>
> >> Task: RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000
> >> Status: WAITING:Waiting for a call
> >> Running for 951s
> >>
> >> Task: RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000
> >> Status: WAITING:Waiting for a call
> >> Running for 951s
> >>
> >>
> >>
> >> Servers:
> >> ===
> >> servername1ct.gmd9.intern,16020,1713514863737: requestsPerSecond=0.0,
> >> numberOfOnlineRegions=0, usedHeapMB=37.0MB, maxHeapMB=2966.0MB,
> >> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
> >> maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
> >> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
> >> filteredReadRequestsCount=0, writeRequestsCount=0, 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-20 Thread Udo Offermann
Thank you, I can check on Monday. 

This is the upgrade of the test system and serves as training for the upgrade 
of the production system. What do we need to do to prevent this problem? 

We had some problems starting zookeeper after the upgrade and I had to start it 
with "zookeeper.snapshot.trust.empty=true“. 
BTW, is it ok to delete the zookeeper directory?

Best regards
Udo


> Am 20.04.2024 um 15:53 schrieb 张铎(Duo Zhang) :
> 
> OK, it was waitForMetaOnline.
> 
> Maybe the problem is that you do have some correct procedures before
> upgrading, like ServerCrashProcedure, but then you delete all the
> procedure wals so the ServerCrashProcedure is also gone, so meta can
> never be online.
> 
> Please check the /hbase/meta-region-server znode on zookeeper, dump
> its content, it is protobuf based but anyway, you could see the
> encoded server name which hosts meta region.
> 
> Then use HBCK2, to schedule a SCP for this region server, to see if it
> can fix the problem.
> 
> https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md
> 
> This is the document for HBCK2, you should use the scheduleRecoveries command.
> 
> Hope this could fix your problem.
> 
> Thread 92 (master/masterserver:16000:becomeActiveMaster):
>  State: TIMED_WAITING
>  Blocked count: 165
>  Waited count: 404
>  Stack:
>java.lang.Thread.sleep(Native Method)
>org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:125)
>org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1358)
> 
> org.apache.hadoop.hbase.master.HMaster.waitForMetaOnline(HMaster.java:1328)
> 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1069)
> 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2405)
>org.apache.hadoop.hbase.master.HMaster.lambda$null$0(HMaster.java:565)
> 
> org.apache.hadoop.hbase.master.HMaster$$Lambda$265/1598878738.run(Unknown
> Source)
>org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187)
>org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177)
>org.apache.hadoop.hbase.master.HMaster.lambda$run$1(HMaster.java:562)
> 
> org.apache.hadoop.hbase.master.HMaster$$Lambda$264/1129144214.run(Unknown
> Source)
>java.lang.Thread.run(Thread.java:750)
> 
> Udo Offermann mailto:udo.offerm...@zfabrik.de>> 
> 于2024年4月20日周六 21:13写道:
>> 
>> Master status for masterserver.gmd9.intern,16000,1713515965162 as of Fri
>> Apr 19 10:55:22 CEST 2024
>> 
>> 
>> Version Info:
>> ===
>> HBase 2.5.7
>> Source code repository
>> git://buildbox.localdomain/home/apurtell/tmp/RM/hbase
>> revision=6788f98356dd70b4a7ff766ea7a8298e022e7b95
>> Compiled by apurtell on Thu Dec 14 15:59:16 PST 2023
>> From source with checksum
>> 1501d7fdf72398791ee335a229d099fc972cea7c2a952da7622eb087ddf975361f107cbbbee5d0ad6f603466e9afa1f4fd242ffccbd4371eb0b56059bb3b5402
>> Hadoop 2.10.2
>> Source code repository Unknown
>> revision=965fd380006fa78b2315668fbc7eb432e1d8200f
>> Compiled by ubuntu on 2022-05-25T00:12Z
>> 
>> 
>> Tasks:
>> ===
>> Task: Master startup
>> Status: RUNNING:Starting assignment manager
>> Running for 954s
>> 
>> Task: Flushing master:store,,1.1595e783b53d99cd5eef43b6debb2682.
>> Status: COMPLETE:Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY,
>> failureReason:Nothing to flush,flush seq id14
>> Completed 49s ago
>> Ran for 0s
>> 
>> Task: RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000
>> Status: WAITING:Waiting for a call
>> Running for 951s
>> 
>> Task: RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000
>> Status: WAITING:Waiting for a call
>> Running for 951s
>> 
>> 
>> 
>> Servers:
>> ===
>> servername1ct.gmd9.intern,16020,1713514863737: requestsPerSecond=0.0,
>> numberOfOnlineRegions=0, usedHeapMB=37.0MB, maxHeapMB=2966.0MB,
>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
>> maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
>> filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
>> currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
>> servername2ct.gmd9.intern,16020,1713514925960: requestsPerSecond=0.0,
>> numberOfOnlineRegions=0, usedHeapMB=20.0MB, maxHeapMB=2966.0MB,
>> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
>> maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
>> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
>> filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
>> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
>> currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
>> servername3ct.gmd9.intern,16020,1713514937151: 

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-20 Thread Duo Zhang
OK, it was waitForMetaOnline.

Maybe the problem is that you do have some correct procedures before
upgrading, like ServerCrashProcedure, but then you delete all the
procedure wals so the ServerCrashProcedure is also gone, so meta can
never be online.

Please check the /hbase/meta-region-server znode on zookeeper, dump
its content, it is protobuf based but anyway, you could see the
encoded server name which hosts meta region.

Then use HBCK2, to schedule a SCP for this region server, to see if it
can fix the problem.

https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md

This is the document for HBCK2, you should use the scheduleRecoveries command.

Hope this could fix your problem.

Thread 92 (master/masterserver:16000:becomeActiveMaster):
  State: TIMED_WAITING
  Blocked count: 165
  Waited count: 404
  Stack:
java.lang.Thread.sleep(Native Method)
org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:125)
org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1358)

org.apache.hadoop.hbase.master.HMaster.waitForMetaOnline(HMaster.java:1328)

org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1069)

org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2405)
org.apache.hadoop.hbase.master.HMaster.lambda$null$0(HMaster.java:565)

org.apache.hadoop.hbase.master.HMaster$$Lambda$265/1598878738.run(Unknown
Source)
org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187)
org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:177)
org.apache.hadoop.hbase.master.HMaster.lambda$run$1(HMaster.java:562)

org.apache.hadoop.hbase.master.HMaster$$Lambda$264/1129144214.run(Unknown
Source)
java.lang.Thread.run(Thread.java:750)

Udo Offermann  于2024年4月20日周六 21:13写道:
>
> Master status for masterserver.gmd9.intern,16000,1713515965162 as of Fri
> Apr 19 10:55:22 CEST 2024
>
>
> Version Info:
> ===
> HBase 2.5.7
> Source code repository
> git://buildbox.localdomain/home/apurtell/tmp/RM/hbase
> revision=6788f98356dd70b4a7ff766ea7a8298e022e7b95
> Compiled by apurtell on Thu Dec 14 15:59:16 PST 2023
> From source with checksum
> 1501d7fdf72398791ee335a229d099fc972cea7c2a952da7622eb087ddf975361f107cbbbee5d0ad6f603466e9afa1f4fd242ffccbd4371eb0b56059bb3b5402
> Hadoop 2.10.2
> Source code repository Unknown
> revision=965fd380006fa78b2315668fbc7eb432e1d8200f
> Compiled by ubuntu on 2022-05-25T00:12Z
>
>
> Tasks:
> ===
> Task: Master startup
> Status: RUNNING:Starting assignment manager
> Running for 954s
>
> Task: Flushing master:store,,1.1595e783b53d99cd5eef43b6debb2682.
> Status: COMPLETE:Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY,
> failureReason:Nothing to flush,flush seq id14
> Completed 49s ago
> Ran for 0s
>
> Task: RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000
> Status: WAITING:Waiting for a call
> Running for 951s
>
> Task: RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000
> Status: WAITING:Waiting for a call
> Running for 951s
>
>
>
> Servers:
> ===
> servername1ct.gmd9.intern,16020,1713514863737: requestsPerSecond=0.0,
> numberOfOnlineRegions=0, usedHeapMB=37.0MB, maxHeapMB=2966.0MB,
> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
> maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
> filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
> currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
> servername2ct.gmd9.intern,16020,1713514925960: requestsPerSecond=0.0,
> numberOfOnlineRegions=0, usedHeapMB=20.0MB, maxHeapMB=2966.0MB,
> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
> maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
> filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
> currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
> servername3ct.gmd9.intern,16020,1713514937151: requestsPerSecond=0.0,
> numberOfOnlineRegions=0, usedHeapMB=67.0MB, maxHeapMB=2966.0MB,
> numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
> maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
> storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
> filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
> totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
> currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
> servername4ct.gmd9.intern,16020,1713514968019: requestsPerSecond=0.0,
> numberOfOnlineRegions=0, usedHeapMB=24.0MB, maxHeapMB=2966.0MB,

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-20 Thread Udo Offermann
Master status for masterserver.gmd9.intern,16000,1713515965162 as of Fri
Apr 19 10:55:22 CEST 2024


Version Info:
===
HBase 2.5.7
Source code repository
git://buildbox.localdomain/home/apurtell/tmp/RM/hbase
revision=6788f98356dd70b4a7ff766ea7a8298e022e7b95
Compiled by apurtell on Thu Dec 14 15:59:16 PST 2023
>From source with checksum
1501d7fdf72398791ee335a229d099fc972cea7c2a952da7622eb087ddf975361f107cbbbee5d0ad6f603466e9afa1f4fd242ffccbd4371eb0b56059bb3b5402
Hadoop 2.10.2
Source code repository Unknown
revision=965fd380006fa78b2315668fbc7eb432e1d8200f
Compiled by ubuntu on 2022-05-25T00:12Z


Tasks:
===
Task: Master startup
Status: RUNNING:Starting assignment manager
Running for 954s

Task: Flushing master:store,,1.1595e783b53d99cd5eef43b6debb2682.
Status: COMPLETE:Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY,
failureReason:Nothing to flush,flush seq id14
Completed 49s ago
Ran for 0s

Task: RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000
Status: WAITING:Waiting for a call
Running for 951s

Task: RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000
Status: WAITING:Waiting for a call
Running for 951s



Servers:
===
servername1ct.gmd9.intern,16020,1713514863737: requestsPerSecond=0.0,
numberOfOnlineRegions=0, usedHeapMB=37.0MB, maxHeapMB=2966.0MB,
numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
servername2ct.gmd9.intern,16020,1713514925960: requestsPerSecond=0.0,
numberOfOnlineRegions=0, usedHeapMB=20.0MB, maxHeapMB=2966.0MB,
numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
servername3ct.gmd9.intern,16020,1713514937151: requestsPerSecond=0.0,
numberOfOnlineRegions=0, usedHeapMB=67.0MB, maxHeapMB=2966.0MB,
numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
servername4ct.gmd9.intern,16020,1713514968019: requestsPerSecond=0.0,
numberOfOnlineRegions=0, usedHeapMB=24.0MB, maxHeapMB=2966.0MB,
numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
servername5ct.gmd9.intern,16020,1713514979294: requestsPerSecond=0.0,
numberOfOnlineRegions=0, usedHeapMB=58.0MB, maxHeapMB=2966.0MB,
numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]
servername6ct.gmd9.intern,16020,1713514994770: requestsPerSecond=0.0,
numberOfOnlineRegions=0, usedHeapMB=31.0MB, maxHeapMB=2966.0MB,
numberOfStores=0, numberOfStorefiles=0, storeRefCount=0,
maxCompactedStoreFileRefCount=0, storefileUncompressedSizeMB=0,
storefileSizeMB=0, memstoreSizeMB=0, readRequestsCount=0,
filteredReadRequestsCount=0, writeRequestsCount=0, rootIndexSizeKB=0,
totalStaticIndexSizeKB=0, totalStaticBloomSizeKB=0, totalCompactingKVs=0,
currentCompactedKVs=0, compactionProgressPct=NaN, coprocessors=[]


Regions-in-transition:
===


Executors:
===
  Status for executor:
Executor-4-MASTER_META_SERVER_OPERATIONS-master/masterserver:16000
  ===
  0 events queued, 0 running
  Status for executor:
Executor-6-MASTER_SNAPSHOT_OPERATIONS-master/masterserver:16000
  ===
  0 events queued, 0 running
  Status for executor:

Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-20 Thread Duo Zhang
Just post it somewhere so we can check it.

Udo Offermann  于2024年4月20日周六 20:25写道:
>
> I do have the dump File from the web ui. I can sende it all or you Tell me
> threads you are interessted in. Fortunately they all have meaningfull named.
>
> 张铎(Duo Zhang)  schrieb am Sa., 20. Apr. 2024, 14:13:
>
> > What is the jstack result for HMaster while hanging? Wait on the
> > namespace table online or meta table online?
> >
> > Udo Offermann  于2024年4月20日周六 19:43写道:
> > >
> > > Hello everyone,
> > >
> > > We are upgrading our Hadoop/HBase cluster from Hadoop 2.8.5 & HBase 2.2.5
> > > to Hadoop 3.3.6 & HBase 2.5.7
> > >
> > > The Hadoop upgrade worked well, but unfortunately we have problems with
> > the
> > > Hbase upgrade, because the master hangs on startup inside the „Starting
> > > assignment manger“ task.
> > >
> > > After 15 minutes the following message appears in the log file:
> > >
> > > Master failed to complete initialization after 90ms. Please
> > > consider submitting a bug report including a thread dump of this
> > > process.
> > >
> > >
> > > We face the same problem as Adam a couple of weeks ago: "Rolling upgrade
> > > from HBase 2.2.2 to 2.5.8 [typo corrected]: There are 2336 corrupted
> > > procedures“ and we fixed it in the same way by deleting the
> > > MasterProcWALs-folder
> > > in HDFS.
> > >
> > > I can provide HMaster dump and a dump of one data nodes!
> > >
> > > How can we proceed with the upgrade?
> > >
> > > Thanks and best regards
> > > Udo
> >


Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-20 Thread Udo Offermann
I do have the dump File from the web ui. I can sende it all or you Tell me
threads you are interessted in. Fortunately they all have meaningfull named.

张铎(Duo Zhang)  schrieb am Sa., 20. Apr. 2024, 14:13:

> What is the jstack result for HMaster while hanging? Wait on the
> namespace table online or meta table online?
>
> Udo Offermann  于2024年4月20日周六 19:43写道:
> >
> > Hello everyone,
> >
> > We are upgrading our Hadoop/HBase cluster from Hadoop 2.8.5 & HBase 2.2.5
> > to Hadoop 3.3.6 & HBase 2.5.7
> >
> > The Hadoop upgrade worked well, but unfortunately we have problems with
> the
> > Hbase upgrade, because the master hangs on startup inside the „Starting
> > assignment manger“ task.
> >
> > After 15 minutes the following message appears in the log file:
> >
> > Master failed to complete initialization after 90ms. Please
> > consider submitting a bug report including a thread dump of this
> > process.
> >
> >
> > We face the same problem as Adam a couple of weeks ago: "Rolling upgrade
> > from HBase 2.2.2 to 2.5.8 [typo corrected]: There are 2336 corrupted
> > procedures“ and we fixed it in the same way by deleting the
> > MasterProcWALs-folder
> > in HDFS.
> >
> > I can provide HMaster dump and a dump of one data nodes!
> >
> > How can we proceed with the upgrade?
> >
> > Thanks and best regards
> > Udo
>


Re: HBase Master hangs on startup during upgrade from 2.2.5 to 2.5.7

2024-04-20 Thread Duo Zhang
What is the jstack result for HMaster while hanging? Wait on the
namespace table online or meta table online?

Udo Offermann  于2024年4月20日周六 19:43写道:
>
> Hello everyone,
>
> We are upgrading our Hadoop/HBase cluster from Hadoop 2.8.5 & HBase 2.2.5
> to Hadoop 3.3.6 & HBase 2.5.7
>
> The Hadoop upgrade worked well, but unfortunately we have problems with the
> Hbase upgrade, because the master hangs on startup inside the „Starting
> assignment manger“ task.
>
> After 15 minutes the following message appears in the log file:
>
> Master failed to complete initialization after 90ms. Please
> consider submitting a bug report including a thread dump of this
> process.
>
>
> We face the same problem as Adam a couple of weeks ago: "Rolling upgrade
> from HBase 2.2.2 to 2.5.8 [typo corrected]: There are 2336 corrupted
> procedures“ and we fixed it in the same way by deleting the
> MasterProcWALs-folder
> in HDFS.
>
> I can provide HMaster dump and a dump of one data nodes!
>
> How can we proceed with the upgrade?
>
> Thanks and best regards
> Udo