Sergey Shelukhin created HIVE-21676:
---
Summary: use a system table as an alternative proc store
Key: HIVE-21676
URL: https://issues.apache.org/jira/browse/HIVE-21676
Project: Hive
Issue Type: Bug
Reporter: Sergey Shelukhin
We keep hitting these issues:
{noformat}
2019-04-30 23:41:52,164 INFO [master/master:17000:becomeActiveMaster]
procedure2.ProcedureExecutor: Starting 16 core workers (bigger of cpus/4 or 16)
with max (burst) worker count=160
2019-04-30 23:41:52,171 INFO [master/master:17000:becomeActiveMaster]
util.FSHDFSUtils: Recover lease on dfs file
.../MasterProcWALs/pv2-0481.log
2019-04-30 23:41:52,176 INFO [master/master:17000:becomeActiveMaster]
util.FSHDFSUtils: Recovered lease, attempt=0 on
file=.../MasterProcWALs/pv2-0481.log after 5ms
2019-04-30 23:41:52,288 INFO [master/master:17000:becomeActiveMaster]
util.FSHDFSUtils: Recover lease on dfs file
.../MasterProcWALs/pv2-0482.log
2019-04-30 23:41:52,289 INFO [master/master:17000:becomeActiveMaster]
util.FSHDFSUtils: Recovered lease, attempt=0 on
file=.../MasterProcWALs/pv2-0482.log after 1ms
2019-04-30 23:41:52,373 INFO [master/master:17000:becomeActiveMaster]
wal.WALProcedureStore: Rolled new Procedure Store WAL, id=483
2019-04-30 23:41:52,375 INFO [master/master:17000:becomeActiveMaster]
procedure2.ProcedureExecutor: Recovered WALProcedureStore lease in 206msec
2019-04-30 23:41:52,782 INFO [master/master:17000:becomeActiveMaster]
wal.ProcedureWALFormatReader: Read 1556 entries in
.../MasterProcWALs/pv2-0482.log
2019-04-30 23:41:55,370 INFO [master/master:17000:becomeActiveMaster]
wal.ProcedureWALFormatReader: Read 28113 entries in
.../MasterProcWALs/pv2-0481.log
2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster]
wal.WALProcedureTree: Missing stack id 166, max stack id is 181, root procedure
is Procedure(pid=289380, ppid=-1,
class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
2019-04-30 23:41:55,384 ERROR [master/master:17000:becomeActiveMaster]
wal.WALProcedureTree: Missing stack id 178, max stack id is 181, root procedure
is Procedure(pid=289380, ppid=-1,
class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
2019-04-30 23:41:55,389 ERROR [master/master:17000:becomeActiveMaster]
wal.WALProcedureTree: Missing stack id 359, max stack id is 360, root procedure
is Procedure(pid=285640, ppid=-1,
class=org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure)
{noformat}
After which the procedure(s) is/are lost and cluster is stuck permanently.
There were no errors writing these files in the log, and no issues reading them
from HDFS, so it's purely a data loss issue in the structure.
I was thinking about debugging it, but on 2nd though what we are trying to
store PB state by key.
Coincidentally, we have an "HBase" facility that we already deploy, that does
just that... and it even has a WAL implementation. I don't know why we cannot
use it for procedure state and have to invent another complex implementation of
a KV store inside a KV store.
In all/most cases, we don't even support rollback and use the latest state, but
if we need multiple versions, this HBase product even supports that!
I think we should add a hbase:proc table that would be maintained similar to
meta. The latter part esp. given the existing code for meta should be much more
simple than a separate store impl.
This should be pluggable and optional via ProcStore interface (made more
abstract as relevant - update state, scan state, get)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)