[ 
https://issues.apache.org/jira/browse/HBASE-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736350#comment-16736350
 ] 

stack commented on HBASE-21519:
-------------------------------

[~pankaj2461] I don't know (remember) this part of the code well. The 
createProvider change makes sense -- setting the provider to be meta provider 
if it has right group name. The getWal change though is a bit strange. I'd 
think the provider id for the group WAL would have been set already? Should we 
be selecting the meta provider higher up someplace rather than in here, looking 
at the region to see if it a meta and then selecting the meta provider?

This must have been a hard one to figure. Good on you.

> Namespace region is never assigned in a HM failover scenario and HM abort 
> always due to init timeout
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21519
>                 URL: https://issues.apache.org/jira/browse/HBASE-21519
>             Project: HBase
>          Issue Type: Bug
>          Components: master, wal
>    Affects Versions: 2.1.1
>            Reporter: Pankaj Kumar
>            Assignee: Pankaj Kumar
>            Priority: Critical
>             Fix For: 2.2.0
>
>         Attachments: HBASE-21519.branch-2.patch
>
>
> In our test env we found that namespace region is never be assigned on HM 
> failover scenario when multiwal feature is enabled,
> {noformat}
> 2018-11-28 01:38:28,085 WARN [master/HM-1:16000:becomeActiveMaster] 
> master.HMaster: 
> hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. is NOT 
> online; state=\{31f6d3383af09e18e1e81ca02a93de15 state=OPEN, 
> ts=1543340156928, server=RS-2,16020,1543339824397}; 
> ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> {noformat}
> And finally HM abort with following error,
> {noformat}
> 2018-11-28 01:39:16,858 ERROR 
> [ActiveMasterInitializationMonitor-1543338648565] master.HMaster: Master 
> failed to complete initialization after 240000ms. Please consider submitting 
> a bug report including a thread dump of this process.
> 2018-11-28 01:39:18,980 ERROR 
> [ActiveMasterInitializationMonitor-1543338648565] master.HMaster: Zombie 
> Master exiting. Thread dump to stdout
> {noformat}
> Stack trace:
> {noformat}
> Thread 102 (master/HM-1:16000:becomeActiveMaster):
>  State: TIMED_WAITING
>  Blocked count: 100
>  Waited count: 246
>  Stack:
>  java.lang.Thread.sleep(Native Method)
>  org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:148)
>  org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1166)
>  
> org.apache.hadoop.hbase.master.HMaster.waitForNamespaceOnline(HMaster.java:1187)
>  
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1044)
>  
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2285)
>  org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:590)
>  org.apache.hadoop.hbase.master.HMaster$$Lambda$40/1078246575.run(Unknown 
> Source)
>  java.lang.Thread.run(Thread.java:745)
> {noformat}
>  
> Step to reproduce:
>  1) Setup a HBase cluster with 1/2 HM (say HM-1) and 2 RS(say RS-1 & RS-2)
>  2) Enable multiwal feature with following configuration setting and start 
> the cluster,
> {noformat}
>  <property>
>  <name>hbase.wal.provider</name>
>  <value>multiwal</value>
>  </property>
> <property>
>  <name>hbase.wal.regiongrouping.strategy</name>
>  <value>identity</value>
>  </property>
> {noformat}
> 3) Make sure meta and namespace regions are assigned on different RS, suppose 
> RS-1 & RS-2 respectively.
>  4) Create table 't1' 
>  5) Flush the meta table explicitly
>  6) Kill the RS-2, so during RS-2 SCP all regions including namespace region 
> will be assigned to RS-1.
>  7) Now Kill RS-1 before meta flush happen. Here both RS-2 & RS-1 are 
> shutdown now.
>  8) Stop the HM and start RS-1 & RS-2.
>  9) Now start the HM.
> Meta region is assigned successfully but HM is keep waiting for the namespace 
> region onlline (Master startup cannot progress, in holding-pattern until 
> region onlined) and abort with timeout.
> Observation:
>  1) After step-3 namespace region was assigned to RS-2 and meta entry was as 
> follows,
> {noformat}
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:server, timestamp=1543339860920, value=RS-2:16020
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:serverstartcode, timestamp=1543339860920, value=1543339824397
> {noformat}
> 2) After step-6 namespace region was assigned to RS-1 and meta entry was as 
> follows,
> {noformat}
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:server, timestamp=1543339880920, value=RS-1:16020
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:serverstartcode, timestamp=1543339880920, value=1543339829288
> {noformat}
> 3) After Step-9, meta entry for namespace region was as follows,
> {noformat}
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:server, timestamp=1543339860920, value=RS-2:16020
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:serverstartcode, timestamp=1543339860920, value=1543339824397
> {noformat}
> During SCP we do meta log split based on filter,
> {noformat}
>  /**
>  * Specialized method to handle the splitting for meta WAL
>  * @param serverNames logs belonging to these servers will be split
>  */
>  public void splitMetaLog(final Set<ServerName> serverNames) throws 
> IOException {
>  splitLog(serverNames, META_FILTER);
>  }
> {noformat}
> So in this case meta log split will be skipped because of multiwal provider, 
> as the suffix wont be .meta.
> I will analyze it further.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to