[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-10-11 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: (was: 2.6.0)
   (was: 3.0.0-beta-1)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 
> 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-10-11 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: (was: 2.4.18)
   (was: 2.5.6)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 
> 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-05-25 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: 2.5.6
   (was: 2.5.5)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.18, 2.5.6
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-03-30 Thread Tak-Lon (Stephen) Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tak-Lon (Stephen) Wu updated HBASE-27382:
-
Fix Version/s: 2.4.18
   (was: 2.4.17)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.5, 2.4.18
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-03-27 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: 2.5.5
   (was: 2.5.4)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.5
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-01-29 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-27382:
--
Fix Version/s: 2.4.17
   (was: 2.4.16)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 
> 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-01-28 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-27382:
--
Fix Version/s: 2.4.16
   (was: 2.4.17)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.16, 2.5.4
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 
> 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-01-17 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: 2.5.4
   (was: 2.5.3)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2023-01-17 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: 2.4.17
   (was: 2.4.16)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.3, 2.4.17
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2022-10-06 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: 2.4.16
   (was: 2.4.15)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.2, 2.4.16
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2022-10-04 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: 2.5.2
   (was: 2.5.1)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.15, 2.5.2
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2022-09-21 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Fix Version/s: 2.6.0
   2.5.1
   3.0.0-alpha-4
   2.4.15

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.5.1, 3.0.0-alpha-4, 2.4.15
>
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2022-09-21 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27382:

Affects Version/s: 2.4.14
   2.5.0
   (was: 1.6.0)
   (was: 1.7.1)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.0, 1.7.2, 2.4.14
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2022-09-21 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-27382:
-
Affects Version/s: 1.7.1
   1.6.0

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.6.0, 1.7.1, 1.7.2
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878+
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> 

[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2022-09-21 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HBASE-27382:
-
Description: 
We are running some version of 1.7.2 in our production environment. We 
encountered this issue recently.
We colocate namenode and region server holding hbase:meta table on a set of 5 
master nodes. Co-incidentally active namenode and region server holding meta 
table were on the same physical node and that node went down due to hardware 
issue. We have sub optimal hdfs level timeouts configured so whenever active 
namenode goes down, it takes around 12-15 minutes for hdfs client within hbase 
to connect to new active namenode. So all the region servers were having 
problems for about 15 minutes to connect to new active namenode.

Below are the sequence of events:

1. Host running active namenode and hbase:meta went down at +2022-09-09 
16:56:56,878+
2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
{noformat}
2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
serverName=,61020,1662714013670, shouldSplitWal=true, 
carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
the store.

2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
Added=,61020,1662714013670 to dead servers, submitted shutdown 
handler to be executed meta=true

2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - Started 
processing ,61020,1662714013670; numProcessing=1
2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
procedure.ServerCrashProcedure - Start processing crashed 
,61020,1662714013670
{noformat}

3. SplitLogManager created 2 split log tasks in zookeeper.

{noformat}
2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
Started splitting 2 logs in 
[hdfs:///hbase/WALs/,61020,1662714013670-splitting]
 for [,61020,1662714013670]

2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination - put up splitlog task at znode 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta

2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination - put up splitlog task at znode 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
{noformat}


4. The first split log task is more interesting: 
+/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+

5. Since all the region servers were having problems connecting to active 
namenode, SplitLogManager tried total of 4 times to assign this task (3 
resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave up.

{noformat}
-- try 1 -
2022-09-09 16:59:06,205 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination - task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 acquired by ,61020,1662540522069

-- try 2 -

2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
coordination.SplitLogManagerCoordination - resubmitting task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta

2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination - task not yet acquired 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 ver = 2

2022-09-09 17:01:06,715 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination - task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 acquired by ,61020,1662530684713

-- try 3 -

2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
coordination.SplitLogManagerCoordination - resubmitting task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta

2022-09-09 17:03:07,687 DEBUG [main-EventThread] 
coordination.SplitLogManagerCoordination - task not yet acquired 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 ver = 4

2022-09-09 17:03:07,738 INFO  [main-EventThread] 
coordination.SplitLogManagerCoordination - task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
 acquired by ,61020,1662542355806


-- try 4 -
2022-09-09 17:05:08,684 INFO  [ager__ChoreService_1] 
coordination.SplitLogManagerCoordination - resubmitting task 
/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta


[jira] [Updated] (HBASE-27382) Cluster completely down due to WAL splitting failing for hbase:meta table.

2022-09-21 Thread Rushabh Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HBASE-27382:
-
Summary: Cluster completely down due to WAL splitting failing for 
hbase:meta table.  (was: Cluster completely down due to wal splitting failing 
for hbase:meta table.)

> Cluster completely down due to WAL splitting failing for hbase:meta table.
> --
>
> Key: HBASE-27382
> URL: https://issues.apache.org/jira/browse/HBASE-27382
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.7.2
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> We are running some version of 1.7.2 in our production environment. We 
> encountered this issue recently.
> We colocate namenode and region server holding hbase:meta table on a set of 5 
> master nodes. Co-incidentally active namenode and region server holding meta 
> table were on the same physical node and that node went down due to hardware 
> issue. We have sub optimal hdfs level timeouts configured so whenever active 
> namenode goes down, it takes around 12-15 minutes for hdfs client within 
> hbase to connect to new active namenode. So all the region servers were 
> having problems for about 15 minutes to connect to new active namenode.
> Below are the sequence of events:
> 1. Host running active namenode and hbase:meta went down at +2022-09-09 
> 16:56:56,878_
> 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+
> {noformat}
> 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] 
> procedure2.ProcedureExecutor - Procedure ServerCrashProcedure 
> serverName=,61020,1662714013670, shouldSplitWal=true, 
> carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to 
> the store.
> 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - 
> Added=,61020,1662714013670 to dead servers, submitted shutdown 
> handler to be executed meta=true
> 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - 
> Started processing ,61020,1662714013670; numProcessing=1
> 2022-09-09 16:59:05,712 INFO  [ProcedureExecutor-0] 
> procedure.ServerCrashProcedure - Start processing crashed 
> ,61020,1662714013670
> {noformat}
> 3. SplitLogManager created 2 split log tasks in zookeeper.
> {noformat}
> 2022-09-09 16:59:06,049 INFO  [ProcedureExecutor-1] master.SplitLogManager - 
> Started splitting 2 logs in 
> [hdfs:///hbase/WALs/,61020,1662714013670-splitting]
>  for [,61020,1662714013670]
> 2022-09-09 16:59:06,081 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 16:59:06,093 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - put up splitlog task at znode 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662739251611.meta
> {noformat}
> 4. The first split log task is more interesting: 
> +/hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta+
> 5. Since all the region servers were having problems connecting to active 
> namenode, SplitLogManager tried total of 4 times to assign this task (3 
> resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave 
> up.
> {noformat}
> -- try 1 -
> 2022-09-09 16:59:06,205 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662540522069
> -- try 2 -
> 2022-09-09 17:01:06,642 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:01:06,666 DEBUG [main-EventThread] 
> coordination.SplitLogManagerCoordination - task not yet acquired 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  ver = 2
> 2022-09-09 17:01:06,715 INFO  [main-EventThread] 
> coordination.SplitLogManagerCoordination - task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
>  acquired by ,61020,1662530684713
> -- try 3 -
> 2022-09-09 17:03:07,643 INFO  [ager__ChoreService_1] 
> coordination.SplitLogManagerCoordination - resubmitting task 
> /hbase/splitWAL/WALs%2F%2C61020%2C1662714013670-splitting%2F%252C61020%252C1662714013670.meta.1662735651285.meta
> 2022-09-09 17:03:07,687