[ https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin reassigned HBASE-22289: ---------------------------------------- Assignee: Sergey Shelukhin > WAL-based log splitting resubmit threshold results in a task being stuck > forever > -------------------------------------------------------------------------------- > > Key: HBASE-22289 > URL: https://issues.apache.org/jira/browse/HBASE-22289 > Project: HBase > Issue Type: Bug > Affects Versions: 2.1.0, 1.5.0 > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Priority: Major > Fix For: 2.1.5 > > > Not sure if this is handled better in procedure based WAL splitting; in any > case it affects versions before that. > The problem is not in ZK as such but in internal state tracking in master, it > seems. > Master: > {noformat} > 2019-04-21 01:49:49,584 INFO > [master/<master>:17000.splitLogManager..Chore.1] > coordination.SplitLogManagerCoordination: Resubmitting task > <path>.1555831286638 > {noformat} > worker-rs, split fails > {noformat} > .... > 2019-04-21 02:05:31,774 INFO > [RS_LOG_REPLAY_OPS-regionserver/<worker-rs>:17020-1] wal.WALSplitter: > Processed 24 edits across 2 regions; edits skipped=457; log > file=<path>.1555831286638, length=2156363702, corrupted=false, progress > failed=true > {noformat} > Master (not sure about the delay of the acquired-message; at any rate it > seems to detect the failure fine from this server) > {noformat} > 2019-04-21 02:11:14,928 INFO [main-EventThread] > coordination.SplitLogManagerCoordination: Task <path>.1555831286638 acquired > by <worker-rs>,17020,1555539815097 > 2019-04-21 02:19:41,264 INFO > [master/<master>:17000.splitLogManager..Chore.1] > coordination.SplitLogManagerCoordination: Skipping resubmissions of task > <path>.1555831286638 because threshold 3 reached > {noformat} > After that this task is stuck in the limbo forever with the old worker, and > never resubmitted. > RS never logs anything else for this task. > Killing the RS on the worker unblocked the task and some other server did the > split very quickly, so seems like master doesn't clear the worker name in its > internal state when hitting the threshold... master never restarted so > restarting the master might have also cleared it. > This is extracted from splitlogmanager log messages, note the times. > {noformat} > 2019-04-21 02:2 1555831286638=last_update = 1555837874928 last_version = 11 > cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress > incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20, > .... > 2019-04-22 11:1 1555831286638=last_update = 1555837874928 last_version = 11 > cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress > incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)