[ 
https://issues.apache.org/jira/browse/HBASE-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722527#comment-16722527
 ] 

Pankaj Kumar commented on HBASE-21519:
--------------------------------------

Here the problem is when multiwal is enabled. During RegionGroupingProvider 
init, providerId will be set as the server name always. But during Meta region 
open, WAL group will be selected based on below condition in 
RegionGroupingProvider,

https://github.com/apache/hbase/blob/3180a6864aa6a019fc1ec21ae78f009475bacea1/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/RegionGroupingProvider.java#L192

Since providerId is set as server name in case of RegionGroupingProvider, so 
WAL group wont be META_WAL_GROUP_NAME (meta). Meta WAL will be created as a 
normal WAL (without extension .meta). And during meta SCP we split log based on 
the filter (meta), so it wont split the META file and can't reassign namespace 
region since as per meta (based on Hfile) it was on a dead server and already 
SCP finished.

In the attached patch, WAL group will be init based on the region nto 
providerId. UT also attached to reproduce the problem.

Please correct me if I missed something.

> Namespace region is never assigned in a HM failover scenario and HM abort 
> always due to init timeout
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21519
>                 URL: https://issues.apache.org/jira/browse/HBASE-21519
>             Project: HBase
>          Issue Type: Bug
>          Components: master, wal
>    Affects Versions: 2.1.1
>            Reporter: Pankaj Kumar
>            Assignee: Pankaj Kumar
>            Priority: Critical
>             Fix For: 2.2.0
>
>         Attachments: HBASE-21519.branch-2.patch
>
>
> In our test env we found that namespace region is never be assigned on HM 
> failover scenario when multiwal feature is enabled,
> {noformat}
> 2018-11-28 01:38:28,085 WARN [master/HM-1:16000:becomeActiveMaster] 
> master.HMaster: 
> hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. is NOT 
> online; state=\{31f6d3383af09e18e1e81ca02a93de15 state=OPEN, 
> ts=1543340156928, server=RS-2,16020,1543339824397}; 
> ServerCrashProcedures=false. Master startup cannot progress, in 
> holding-pattern until region onlined.
> {noformat}
> And finally HM abort with following error,
> {noformat}
> 2018-11-28 01:39:16,858 ERROR 
> [ActiveMasterInitializationMonitor-1543338648565] master.HMaster: Master 
> failed to complete initialization after 240000ms. Please consider submitting 
> a bug report including a thread dump of this process.
> 2018-11-28 01:39:18,980 ERROR 
> [ActiveMasterInitializationMonitor-1543338648565] master.HMaster: Zombie 
> Master exiting. Thread dump to stdout
> {noformat}
> Stack trace:
> {noformat}
> Thread 102 (master/HM-1:16000:becomeActiveMaster):
>  State: TIMED_WAITING
>  Blocked count: 100
>  Waited count: 246
>  Stack:
>  java.lang.Thread.sleep(Native Method)
>  org.apache.hadoop.hbase.util.Threads.sleep(Threads.java:148)
>  org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1166)
>  
> org.apache.hadoop.hbase.master.HMaster.waitForNamespaceOnline(HMaster.java:1187)
>  
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1044)
>  
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2285)
>  org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:590)
>  org.apache.hadoop.hbase.master.HMaster$$Lambda$40/1078246575.run(Unknown 
> Source)
>  java.lang.Thread.run(Thread.java:745)
> {noformat}
>  
> Step to reproduce:
>  1) Setup a HBase cluster with 1/2 HM (say HM-1) and 2 RS(say RS-1 & RS-2)
>  2) Enable multiwal feature with following configuration setting and start 
> the cluster,
> {noformat}
>  <property>
>  <name>hbase.wal.provider</name>
>  <value>multiwal</value>
>  </property>
> <property>
>  <name>hbase.wal.regiongrouping.strategy</name>
>  <value>identity</value>
>  </property>
> {noformat}
> 3) Make sure meta and namespace regions are assigned on different RS, suppose 
> RS-1 & RS-2 respectively.
>  4) Create table 't1' 
>  5) Flush the meta table explicitly
>  6) Kill the RS-2, so during RS-2 SCP all regions including namespace region 
> will be assigned to RS-1.
>  7) Now Kill RS-1 before meta flush happen. Here both RS-2 & RS-1 are 
> shutdown now.
>  8) Stop the HM and start RS-1 & RS-2.
>  9) Now start the HM.
> Meta region is assigned successfully but HM is keep waiting for the namespace 
> region onlline (Master startup cannot progress, in holding-pattern until 
> region onlined) and abort with timeout.
> Observation:
>  1) After step-3 namespace region was assigned to RS-2 and meta entry was as 
> follows,
> {noformat}
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:server, timestamp=1543339860920, value=RS-2:16020
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:serverstartcode, timestamp=1543339860920, value=1543339824397
> {noformat}
> 2) After step-6 namespace region was assigned to RS-1 and meta entry was as 
> follows,
> {noformat}
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:server, timestamp=1543339880920, value=RS-1:16020
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:serverstartcode, timestamp=1543339880920, value=1543339829288
> {noformat}
> 3) After Step-9, meta entry for namespace region was as follows,
> {noformat}
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:server, timestamp=1543339860920, value=RS-2:16020
>  hbase:namespace,,1543339859614.31f6d3383af09e18e1e81ca02a93de15. 
> column=info:serverstartcode, timestamp=1543339860920, value=1543339824397
> {noformat}
> During SCP we do meta log split based on filter,
> {noformat}
>  /**
>  * Specialized method to handle the splitting for meta WAL
>  * @param serverNames logs belonging to these servers will be split
>  */
>  public void splitMetaLog(final Set<ServerName> serverNames) throws 
> IOException {
>  splitLog(serverNames, META_FILTER);
>  }
> {noformat}
> So in this case meta log split will be skipped because of multiwal provider, 
> as the suffix wont be .meta.
> I will analyze it further.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to