Hi Sean,

Here is the hdfs content: https://pastebin.com/EqK1zhEe

I unfortunately don't have HDFS audit logs :( And I cleaned HBase logs
before the last upgrade test, so RCA will be difficult :-/

JMS

Le ven. 29 mars 2019 à 16:01, Sean Busbey <bus...@apache.org> a écrit :

> So all we have in hbase:meta is an entry for each table that claims
> they're all in enabled state.
>
> And the info column family is totally empty?  I believe this is a
> failure state we don't have tooling for yet. Can you upload and link
> the results of running hdfs dfs -ls -R on the /hbase directory?
>
> Do you happen to have HDFS auditing turned on and logs that go back a
> few weeks? I'd be curious about how we got into this state. The only
> way I've seen it happen thus far is when folks disabled the safety
> that keeps hbck1 from running.
>
> On Fri, Mar 29, 2019 at 9:40 AM Jean-Marc Spaggiari
> <jean-m...@spaggiari.org> wrote:
> >
> > Hi Sean,
> >
> > Thanks again for keeping an eye on that.
> >
> > I think the META content has been lost somewhere in the process.
> >
> > drwxr-xr-x   - hbase supergroup          0 2019-03-12 15:42
> > /hbase/data/hbase/meta/.tabledesc
> > -rw-r--r--   3 hbase supergroup       1447 2019-03-12 15:42
> > /hbase/data/hbase/meta/.tabledesc/.tableinfo.0000000001
> > drwxr-xr-x   - hbase supergroup          0 2019-03-12 15:42
> > /hbase/data/hbase/meta/.tmp
> > drwxr-xr-x   - hbase supergroup          0 2019-03-12 15:49
> > /hbase/data/hbase/meta/1588230740
> > -rw-r--r--   3 hbase supergroup         32 2019-03-12 15:40
> > /hbase/data/hbase/meta/1588230740/.regioninfo
> > drwxr-xr-x   - hbase supergroup          0 2019-03-12 15:40
> > /hbase/data/hbase/meta/1588230740/info
> > drwxr-xr-x   - hbase supergroup          0 2019-03-12 15:40
> > /hbase/data/hbase/meta/1588230740/recovered.edits
> > -rw-r--r--   3 hbase supergroup          0 2019-03-12 15:40
> > /hbase/data/hbase/meta/1588230740/recovered.edits/2.seqid
> > drwxr-xr-x   - hbase supergroup          0 2019-03-12 15:42
> > /hbase/data/hbase/meta/1588230740/rep_barrier
> > drwxr-xr-x   - hbase supergroup          0 2019-03-12 15:47
> > /hbase/data/hbase/meta/1588230740/table
> > -rw-r--r--   3 hbase supergroup       5454 2019-03-12 15:47
> > /hbase/data/hbase/meta/1588230740/table/b65e8774ff284e77bf22641de36110cc
> >
> > And this is the content of the file:
> > hbase@node2:~$ hbase hfile -p -f
> > /hbase/data/hbase/meta/1588230740/table/b65e8774ff284e77bf22641de36110cc
> > 2019-03-29 12:38:36,028 INFO  [main] metrics.MetricRegistries: Loaded
> > MetricRegistries class
> > org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl
> > K: customers/table:state/1552419727646/Put/vlen=2/seqid=903258414 V:
> > \x08\x00
> > K: dns/table:state/1552419727462/Put/vlen=2/seqid=903258404 V: \x08\x00
> > K: email/table:state/1552419727691/Put/vlen=2/seqid=903258416 V: \x08\x00
> > K: email_proposed/table:state/1552419727602/Put/vlen=2/seqid=903258410 V:
> > \x08\x00
> > K: ew_table/table:state/1552419727527/Put/vlen=2/seqid=903258406 V:
> \x08\x00
> > K: hbase:acl/table:state/1552419727547/Put/vlen=2/seqid=903258407 V:
> > \x08\x00
> > K: hbase:namespace/table:state/1552419727382/Put/vlen=2/seqid=903258402
> V:
> > \x08\x00
> > K: page/table:state/1552419727669/Put/vlen=2/seqid=903258415 V: \x08\x00
> > K: pageAvro/table:state/1552419727572/Put/vlen=2/seqid=903258408 V:
> \x08\x00
> > K: pageMini/table:state/1552419727591/Put/vlen=2/seqid=903258409 V:
> \x08\x00
> > K: pageSpark/table:state/1552419727867/Put/vlen=2/seqid=903258417 V:
> > \x08\x00
> > K: page_crc/table:state/1552419727635/Put/vlen=2/seqid=903258413 V:
> \x08\x00
> > K: page_duplicate/table:state/1552419727613/Put/vlen=2/seqid=903258411 V:
> > \x08\x00
> > K: page_proposed/table:state/1552419727175/Put/vlen=2/seqid=903258401 V:
> > \x08\x00
> > K: tree/table:state/1552419727502/Put/vlen=2/seqid=903258405 V: \x08\x00
> > K: work_proposed/table:state/1552419727402/Put/vlen=2/seqid=903258403 V:
> > \x08\x00
> > K: work_sent/table:state/1552419727624/Put/vlen=2/seqid=903258412 V:
> > \x08\x00
> > Scanned kv count -> 17
> >
> > Seems that it's still aware of the tables. But I don't see any reference
> to
> > any server...
> >
> > JMS
> >
> >
> > Le ven. 29 mars 2019 à 12:25, Sean Busbey <bus...@apache.org> a écrit :
> >
> > > Okay I read the logs again and we're in a weird failure state.
> > >
> > > 1) Master comes up
> > > 2) Master schedules SCP for all RS
> > > 3) Master recovers meta
> > > 4) SCP for every server claims AM currently thinks 0 regions were
> > > assigned to each server.
> > > 5) Master successfully finishes WAL splitting from dead RS and works
> > > through prior split attempts that died?
> > > 6) WAL recovery from every RS says there are no edits for any region
> > > 7) No Assignments are scheduled out of the SCP because each believes
> > > there were no regions hosted on the server that's being processed
> > > 6) Master reports all SCP have completed successfully
> > > 7) Master times out at initializing
> > >
> > > Could you link to a scan of meta? it'll include server names, table
> > > names, and region information, so I'm not sure if any of those are too
> > > sensitive?
> > >
> > > On Thu, Mar 14, 2019 at 11:36 AM Jean-Marc Spaggiari
> > > <jean-m...@spaggiari.org> wrote:
> > > >
> > > > Updated logs are there: https://pastebin.com/1UrTA8JS
> > > >
> > > > They really look like exactly the same as the previous version :-/
> > > >
> > > > There is no warning, no error, nothing :(
> > > >
> > > > JMS
> > > >
> > > > Le jeu. 14 mars 2019 à 13:38, Sean Busbey <bus...@apache.org> a
> écrit :
> > > >
> > > > > We still need to find out why hbase:namespace is not online. Did
> the
> > > > > logs complaining about being unable to assign regions not include
> any
> > > > > thing about the region(s) for the namespace table?
> > > > >
> > > > > Can you upload updated logs?
> > > > >
> > > > > If there's no mention of it then that sounds like we need an hbck2
> > > > > command to output the current assignment state of a region.
> > > > >
> > > > > On Thu, Mar 14, 2019 at 11:57 AM Jean-Marc Spaggiari
> > > > > <jean-m...@spaggiari.org> wrote:
> > > > > >
> > > > > > I stopped all the region servers, started the master. It was
> > > complaining
> > > > > > about not being able to assign regions. Then started region
> servers,
> > > but
> > > > > > after 5 minutes got the same error :-/
> > > > > >
> > > > > > 2019-03-14 12:46:38,586 ERROR
> [master/node2:60000:becomeActiveMaster]
> > > > > > master.HMaster: Failed to become active master
> > > > > > java.lang.IllegalStateException: Expected the service
> > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the service
> has
> > > > > FAILED
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1341)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1119)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2347)
> > > > > > at
> > > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:595)
> > > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > > Caused by: java.io.IOException: Timedout 300000ms waiting for
> > > namespace
> > > > > > table to be assigned and enabled: tableName=hbase:namespace,
> > > > > state=ENABLED
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1339)
> > > > > > ... 4 more
> > > > > >
> > > > > > Then stopped all, configured the maintenance mode, started all,
> get
> > > the
> > > > > > same error. I tried to bounce the RS within those 5 minutes
> without
> > > any
> > > > > > difference. I still get the same exception after 5 minutes:
> > > > > > 2019-03-14 12:55:35,167 ERROR
> [master/node2:60000:becomeActiveMaster]
> > > > > > master.HMaster: ***** ABORTING master node2.distparser.com
> > > > > ,60000,1552582220013:
> > > > > > Unhandled exception. Starting shutdown. *****
> > > > > > java.lang.IllegalStateException: Expected the service
> > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the service
> has
> > > > > FAILED
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1341)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1119)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2347)
> > > > > > at
> > > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:595)
> > > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > > Caused by: java.io.IOException: Timedout 300000ms waiting for
> > > namespace
> > > > > > table to be assigned and enabled: tableName=hbase:namespace,
> > > > > state=ENABLED
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1339)
> > > > > > ... 4 more
> > > > > >
> > > > > > I validated the maintenance mode:
> > > > > > 2019-03-14 12:50:25,421 INFO  [main] master.HMaster: Detected
> > > > > > hbase.master.maintenance_mode=true via configuration.
> > > > > >
> > > > > > And now removed it.
> > > > > >
> > > > > > What can I try next?
> > > > > >
> > > > > > I know I can always rename /hbase to /hbase_old and bulkload the
> > > HFiles
> > > > > > back to the table, that's not a big deal, but I'm curious to see
> if
> > > we
> > > > > can
> > > > > > get that working...
> > > > > >
> > > > > >
> > > > > >
> > > > > > BTW, when we try to access the Master webUI while it starts we
> are
> > > > > getting
> > > > > > an exception:
> > > > > > 2019-03-14 12:50:40,782 WARN  [qtp196717412-78]
> > > servlet.ServletHandler:
> > > > > > /master-status
> > > > > > java.lang.NullPointerException
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.renderNoFlush(MasterStatusTmplImpl.java:326)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.renderNoFlush(MasterStatusTmpl.java:397)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.render(MasterStatusTmpl.java:388)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:79)
> > > > > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> > > > > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> > > > > > at
> > > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1780)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:112)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
> > > > > > at
> > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.http.ClickjackingPreventionFilter.doFilter(ClickjackingPreventionFilter.java:48)
> > > > > >
> > > > > >
> > > > > > Le jeu. 14 mars 2019 à 08:07, Wellington Chevreuil <
> > > > > > wellington.chevre...@gmail.com> a écrit :
> > > > > >
> > > > > > > Yeah, as I suspected in my previous comment, for this type of
> > > > > timeouts, the
> > > > > > > maintenance mode wouldn't give any help. It's weird that AM
> starts
> > > but
> > > > > > > apparently does nothing until the namespace 5 mins timeout is
> > > reached:
> > > > > > > ...
> > > > > > > 2019-03-12 20:53:45,942 INFO
> > > [master/node2:60000:becomeActiveMaster]
> > > > > > > assignment.AssignmentManager: Joined the cluster in 308msec
> > > > > > > 2019-03-12 20:54:45,725 INFO
> > > > > > > [ReadOnlyZKClient-latitude.distparser.com:2181@0x7ea9b2c0]
> > > > > > > zookeeper.ZooKeeper: Session: 0x16911bd542a02a2 closed
> > > > > > > 2019-03-12 20:54:45,725 INFO
> > > > > > > [ReadOnlyZKClient-latitude.distparser.com:2181
> > > @0x7ea9b2c0-EventThread]
> > > > > > > zookeeper.ClientCnxn: EventThread shut down for session:
> > > > > 0x16911bd542a02a2
> > > > > > > 2019-03-12 20:58:46,603 ERROR
> > > [master/node2:60000:becomeActiveMaster]
> > > > > > > master.HMaster: Failed to become active master
> > > > > > > java.lang.IllegalStateException: Expected the service
> > > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the
> service
> > > has
> > > > > FAILED
> > > > > > > ...
> > > > > > >
> > > > > > > I would expect namespace region to be processed by this call
> > > > > > > <
> > > > > > >
> > > > >
> > >
> https://github.com/apache/hbase/blob/2.2.0-RC0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1088
> > > > > > > >,
> > > > > > > if
> > > > > > > namespace region is offline, as the timeout suggests. Also odd
> is
> > > that
> > > > > we
> > > > > > > don;'t see any logs suggesting offlined regions are getting
> > > assigned.
> > > > > Maybe
> > > > > > > all regions are already online on RSes? But then master should
> had
> > > > > figured
> > > > > > > that out. Have you already tried restart all RSes? That could
> kick
> > > some
> > > > > > > reassignments.
> > > > > > >
> > > > > > > Em qui, 14 de mar de 2019 às 02:21, Jean-Marc Spaggiari <
> > > > > > > jean-m...@spaggiari.org> escreveu:
> > > > > > >
> > > > > > > > Hi Wellington,
> > > > > > > >
> > > > > > > > Indeed, the META is now deployed.  I found the namespace
> region
> > > > > encoded
> > > > > > > > name using hdfs dfs -ls -R /hbase/data/hbase/namespace and it
> > > gives
> > > > > > > > me 7f4a480f47f98300185d1ae2ff663295. But here again, HBCK
> doesn't
> > > > > want to
> > > > > > > > do anything because the master is initializing :( I tried
> with ad
> > > > > without
> > > > > > > > the maintenant flag and I get the same result.
> > > > > > > >
> > > > > > > > On HBCK2 side: PleaseHoldException: Master is initializing
> > > > > > > > On the master side, it just stoped after 5 minutes trying to
> > > assign
> > > > > > > > namespace :(
> > > > > > > >
> > > > > > > > JMS
> > > > > > > >
> > > > > > > >
> > > > > > > > Le mer. 13 mars 2019 à 12:04, Wellington Chevreuil <
> > > > > > > > wellington.chevre...@gmail.com> a écrit :
> > > > > > > >
> > > > > > > > > "1588230740" would be the meta region name, not namespace.
> It
> > > seems
> > > > > > > meta
> > > > > > > > is
> > > > > > > > > already online, per below log:
> > > > > > > > > ...
> > > > > > > > > 2019-03-12 20:53:41,037 INFO
> > > > > [master/node2:60000:becomeActiveMaster]
> > > > > > > > > master.HMaster: hbase:meta {1588230740 state=OPEN,
> > > > > ts=1552438420570,
> > > > > > > > > server=
> > > > > > > > > node7.distparser.com,16020,1552421510124}
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > The maintenance mode I suggested before was to have master
> > > doing
> > > > > > > minimum
> > > > > > > > > required stuff while attempting to getting meta/namespace
> > > online,
> > > > > but I
> > > > > > > > > guess it wouldn't be able to avoid such timeouts. Below
> message
> > > > > also
> > > > > > > > means
> > > > > > > > > AM could read meta table, giving another indication meta is
> > > fine:
> > > > > > > > > ...
> > > > > > > > > 2019-03-12 20:53:45,942 INFO
> > > > > [master/node2:60000:becomeActiveMaster]
> > > > > > > > > assignment.AssignmentManager: Joined the cluster in 308msec
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > Now issue is namespace table. For some reason, AM is not
> able
> > > to
> > > > > kick
> > > > > > > APs
> > > > > > > > > before the 5 minutes timeout exceeds, and that's probably
> why
> > > > > namespace
> > > > > > > > > table never comes available:
> > > > > > > > > ...
> > > > > > > > > 2019-03-12 20:53:45,942 INFO
> > > > > [master/node2:60000:becomeActiveMaster]
> > > > > > > > > assignment.AssignmentManager: Joined the cluster in 308msec
> > > > > > > > > 2019-03-12 20:54:45,725 INFO
> > > > > > > > > [ReadOnlyZKClient-latitude.distparser.com:2181@0x7ea9b2c0]
> > > > > > > > > zookeeper.ZooKeeper: Session: 0x16911bd542a02a2 closed
> > > > > > > > > 2019-03-12 20:54:45,725 INFO
> > > > > > > > > [ReadOnlyZKClient-latitude.distparser.com:2181
> > > > > @0x7ea9b2c0-EventThread]
> > > > > > > > > zookeeper.ClientCnxn: EventThread shut down for session:
> > > > > > > > 0x16911bd542a02a2
> > > > > > > > > 2019-03-12 20:58:46,603 ERROR
> > > > > [master/node2:60000:becomeActiveMaster]
> > > > > > > > > master.HMaster: Failed to become active master
> > > > > > > > > java.lang.IllegalStateException: Expected the service
> > > > > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the
> > > service
> > > > > has
> > > > > > > > FAILED
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > You may be able to force namespace region coming online
> with
> > > hbck2
> > > > > > > > assigns
> > > > > > > > > command. You would need to find out the namespace region
> name
> > > > > first,
> > > > > > > you
> > > > > > > > > can either scan meta table or check the region dir name in
> hdfs
> > > > > with
> > > > > > > > "hdfs
> > > > > > > > > dfs -ls -R /hbase | grep namespace", in order to pass it
> as a
> > > > > param for
> > > > > > > > >
> > > > > > > > > Em qua, 13 de mar de 2019 às 13:00, Jean-Marc Spaggiari <
> > > > > > > > > jean-m...@spaggiari.org> escreveu:
> > > > > > > > >
> > > > > > > > > > Hi Sean,
> > > > > > > > > >
> > > > > > > > > > I tried. I looked-up the region name for base:namespace
> like
> > > > > this:
> > > > > > > > > >
> > > > > > > > > > hdfs dfs -ls /hbase/data/hbase/meta/
> > > > > > > > > >
> > > > > > > > > > And found the region to be 1588230740.
> > > > > > > > > >
> > > > > > > > > > The master dies after 5 minutes, so I start the master,
> wait
> > > 2
> > > > > > > minutes
> > > > > > > > to
> > > > > > > > > > be sure it's up, and run the following command:
> > > > > > > > > >
> > > > > > > > > > bin/hbase hbck -j
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> test/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
> > > > > > > > > > assigns 1588230740
> > > > > > > > > >
> > > > > > > > > > But HBCK2 doesn't like it:
> > > > > > > > > > 08:57:35.273 [main] INFO
> > > > > > > > > > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl -
> Call
> > > > > > > exception,
> > > > > > > > > > tries=9, retries=16, started=29322 ms ago,
> cancelled=false,
> > > > > > > > > > msg=org.apache.hadoop.hbase.PleaseHoldException: Master
> is
> > > > > > > initializing
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3057)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.MasterRpcServices.getClusterStatus(MasterRpcServices.java:942)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
> > > > > > > > > > at
> > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> > > > > > > > > > at
> > > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
> > > > > > > > > > at
> > > > > > > > > >
> > > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It keeps retrying and after 16 times it stopped saying
> the
> > > > > master is
> > > > > > > > not
> > > > > > > > > > initialized.
> > > > > > > > > >
> > > > > > > > > > On the WebUI I can see that there is a single region
> > > assigned,
> > > > > the
> > > > > > > META
> > > > > > > > > > region.
> > > > > > > > > >
> > > > > > > > > > Also, here is the HDFS structure of my META table. Sounds
> > > like
> > > > > some
> > > > > > > > parts
> > > > > > > > > > got lost in the process (The info content).
> > > > > > > > > >
> > > > > > > > > > hbase@node2:~/hbase-2.2.0$ hdfs dfs -ls -R
> > > > > /hbase/data/hbase/meta/
> > > > > > > > > > drwxr-xr-x   - hbase supergroup          0 2019-03-12
> 15:42
> > > > > > > > > > /hbase/data/hbase/meta/.tabledesc
> > > > > > > > > > -rw-r--r--   3 hbase supergroup       1447 2019-03-12
> 15:42
> > > > > > > > > > /hbase/data/hbase/meta/.tabledesc/.tableinfo.0000000001
> > > > > > > > > > drwxr-xr-x   - hbase supergroup          0 2019-03-12
> 15:42
> > > > > > > > > > /hbase/data/hbase/meta/.tmp
> > > > > > > > > > drwxr-xr-x   - hbase supergroup          0 2019-03-12
> 15:49
> > > > > > > > > > /hbase/data/hbase/meta/1588230740
> > > > > > > > > > -rw-r--r--   3 hbase supergroup         32 2019-03-12
> 15:40
> > > > > > > > > > /hbase/data/hbase/meta/1588230740/.regioninfo
> > > > > > > > > > drwxr-xr-x   - hbase supergroup          0 2019-03-12
> 15:40
> > > > > > > > > > /hbase/data/hbase/meta/1588230740/info
> > > > > > > > > > drwxr-xr-x   - hbase supergroup          0 2019-03-12
> 15:40
> > > > > > > > > > /hbase/data/hbase/meta/1588230740/recovered.edits
> > > > > > > > > > -rw-r--r--   3 hbase supergroup          0 2019-03-12
> 15:40
> > > > > > > > > > /hbase/data/hbase/meta/1588230740/recovered.edits/2.seqid
> > > > > > > > > > drwxr-xr-x   - hbase supergroup          0 2019-03-12
> 15:42
> > > > > > > > > > /hbase/data/hbase/meta/1588230740/rep_barrier
> > > > > > > > > > drwxr-xr-x   - hbase supergroup          0 2019-03-12
> 15:47
> > > > > > > > > > /hbase/data/hbase/meta/1588230740/table
> > > > > > > > > > -rw-r--r--   3 hbase supergroup       5454 2019-03-12
> 15:47
> > > > > > > > > >
> > > > > > > >
> > > > >
> > >
> /hbase/data/hbase/meta/1588230740/table/b65e8774ff284e77bf22641de36110cc
> > > > > > > > > >
> > > > > > > > > > What will be the next best step?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > JMS
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Le mer. 13 mars 2019 à 08:45, Sean Busbey <
> bus...@apache.org>
> > > a
> > > > > > > écrit
> > > > > > > > :
> > > > > > > > > >
> > > > > > > > > > > Okay so master thinks hbase:namespace is already
> enabled,
> > > but
> > > > > no RS
> > > > > > > > > > > believes it should be hosting the regions.
> > > > > > > > > > >
> > > > > > > > > > > Can you find the region name for the hbase:namespace
> > > region and
> > > > > > > issue
> > > > > > > > > > > an hbck2 assigns command for it?
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Mar 12, 2019 at 8:26 PM Jean-Marc Spaggiari
> > > > > > > > > > > <jean-m...@spaggiari.org> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > It doesn't say that much :(
> > > > > > > > > > > >
> > > > > > > > > > > > hbase@node2:~/hbase-2.2.0$ cat
> > > > > logs/hbase-hbase-master-node2.log
> > > > > > > > |
> > > > > > > > > > > grep -i
> > > > > > > > > > > > namespace
> > > > > > > > > > > > Caused by: java.io.IOException: Timedout 300000ms
> > > waiting for
> > > > > > > > > namespace
> > > > > > > > > > > > table to be assigned and enabled:
> > > tableName=hbase:namespace,
> > > > > > > > > > > state=ENABLED
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108)
> > > > > > > > > > > > Caused by: java.io.IOException: Timedout 300000ms
> > > waiting for
> > > > > > > > > namespace
> > > > > > > > > > > > table to be assigned and enabled:
> > > tableName=hbase:namespace,
> > > > > > > > > > > state=ENABLED
> > > > > > > > > > > > at
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108)
> > > > > > > > > > > >
> > > > > > > > > > > > I cleared the logs before restarting the instance.
> That
> > > all
> > > > > what
> > > > > > > it
> > > > > > > > > > says
> > > > > > > > > > > > about namespace.
> > > > > > > > > > > >
> > > > > > > > > > > > Full logs are available there:
> > > https://pastebin.com/9j2Rzdcg
> > > > > > > > > > > >
> > > > > > > > > > > > Le mar. 12 mars 2019 à 20:47, Sean Busbey <
> > > > > sean.bus...@gmail.com
> > > > > > > >
> > > > > > > > a
> > > > > > > > > > > écrit :
> > > > > > > > > > > >
> > > > > > > > > > > > > okay so the master spent ~5 minutes waiting to see
> if
> > > it
> > > > > could
> > > > > > > > get
> > > > > > > > > > the
> > > > > > > > > > > > > namespace table working. when it couldn't it
> aborted.
> > > > > > > > > > > > >
> > > > > > > > > > > > > can you look back over that 5 minutes and see what
> the
> > > > > master
> > > > > > > had
> > > > > > > > > to
> > > > > > > > > > > > > say about the namespace table? did the master think
> > > some
> > > > > > > > particular
> > > > > > > > > > > > > server should have it open already? was it waiting
> for
> > > > > someone
> > > > > > > to
> > > > > > > > > > > > > finish opening or closing it?
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Mar 12, 2019 at 6:39 PM Jean-Marc Spaggiari
> > > > > > > > > > > > > <jean-m...@spaggiari.org> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Le mar. 12 mars 2019 à 19:25, Sean Busbey <
> > > > > bus...@apache.org
> > > > > > > >
> > > > > > > > a
> > > > > > > > > > > écrit :
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > your command above points at the wrong jar
> from the
> > > > > hbck2
> > > > > > > > repo.
> > > > > > > > > > > it's
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > pointing at the one where you need to manually
> > > assemble
> > > > > all
> > > > > > > the
> > > > > > > > > > > > > > > dependencies it has.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You want the one that does not say "original"
> in
> > > the
> > > > > name.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ha!!! That's why! Way easier ;)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > indeed, this works even without removing all
> > > environment
> > > > > > > > > variables:
> > > > > > > > > > > > > >  bin/hbase hbck -j
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> test/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Can you confirm it's the one in
> > > > > > > > > > > > > > > > > the bin tarball? what does the version
> command
> > > > > output?
> > > > > > > > What
> > > > > > > > > > > does
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > mapredcp command output? What does the cli
> help
> > > > > for the
> > > > > > > > > hbase
> > > > > > > > > > > > > command
> > > > > > > > > > > > > > > > > show?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  hbase@node2:~/hbase-2.2.0$ hbase mapredcp
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> /home/hbase/hbase-2.2.0/bin/../lib/shaded-clients/hbase-shaded-mapreduce-2.2.0.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/audience-annotations-0.5.0.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/commons-logging-1.2.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/log4j-1.2.17.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/slf4j-api-1.7.25.jar
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > that looks great now.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Once you correct the hbck2 jar above I think
> > > you'll be
> > > > > good
> > > > > > > > for
> > > > > > > > > > > > > invoking
> > > > > > > > > > > > > > > HBCK2.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Next, what does the initializing master say
> it's
> > > > > doing? It
> > > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > > > > > on the master UI near the bottom. If it hasn't
> made
> > > > > > > progress
> > > > > > > > > > since
> > > > > > > > > > > > > > > your last update it'll be waiting for the
> > > > > hbase:namespace
> > > > > > > > > table.
> > > > > > > > > > > If it
> > > > > > > > > > > > > > > is, find the region and see what the last few
> > > messages
> > > > > in
> > > > > > > the
> > > > > > > > > > > master
> > > > > > > > > > > > > > > log are about that region.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The master died some times ago. It dies after 5
> > > minutes.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2019-03-12 19:35:58,568 ERROR
> > > > > > > > > > [master/node2:60000:becomeActiveMaster]
> > > > > > > > > > > > > > master.HMaster: Failed to become active master
> > > > > > > > > > > > > > java.lang.IllegalStateException: Expected the
> service
> > > > > > > > > > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING,
> but
> > > the
> > > > > > > > service
> > > > > > > > > > has
> > > > > > > > > > > > > FAILED
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345)
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291)
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1341)
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1119)
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2347)
> > > > > > > > > > > > > > at
> > > > > > > > > > >
> > > > > > >
> > > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:595)
> > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > > > > > > > > > > Caused by: java.io.IOException: Timedout 300000ms
> > > > > waiting for
> > > > > > > > > > > namespace
> > > > > > > > > > > > > > table to be assigned and enabled:
> > > > > tableName=hbase:namespace,
> > > > > > > > > > > > > state=ENABLED
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108)
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226)
> > > > > > > > > > > > > > at
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1339)
> > > > > > > > > > > > > > ... 4 more
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I just restarted it. I can see the meta table
> being
> > > > > > > assigned. I
> > > > > > > > > can
> > > > > > > > > > > > > access
> > > > > > > > > > > > > > the WebUI and I don't see any initializing
> > > information.
> > > > > On
> > > > > > > the
> > > > > > > > > > table
> > > > > > > > > > > > > > section, I don't see anything, in any tab.
> However,
> > > when
> > > > > > > doing
> > > > > > > > > > > "list" on
> > > > > > > > > > > > > > the shell, I can see my tables. But I can not
> scan
> > > them.
> > > > > > > > Scanning
> > > > > > > > > > any
> > > > > > > > > > > > > table
> > > > > > > > > > > > > > gives :
> > > > > > > > > > > > > > hbase(main):001:0> scan 'hbase:namespace'
> > > > > > > > > > > > > > ROW                                   COLUMN+CELL
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ERROR: Unknown table hbase:namespace!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > For usage try 'help "scan"'
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Took 1.0395 seconds
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > JMS
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Sean
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
>

Reply via email to