[
https://issues.apache.org/jira/browse/MESOS-10198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395290#comment-17395290
]
Charles Natali commented on MESOS-10198:
----------------------------------------
Hi [~kiranjshetty], sorry for the delay, I know it's been a while.
{noformat}
Nov 12 08:36:49 servername mesos-master[20037]: mesos-master:
./db/skiplist.h:344: void leveldb::SkipList<Key, Comparator>::Insert(const
Key&) [with Key = const char*; Comparator = leveldb::MemTable::KeyComparator]:
Assertion `x == __null || !Equal(key, x->key)' failed.
{noformat}
This points to a corruption of the on-disk leveldb database - it's been a long
time, but do you remember if:
- this specific error was present in all the masters logs?
- did the hosts maybe crash prior to that?
- I guess it's too late, but it would have been interesting to see the log of
the first time the masters crashed
Looking at our code, it's not clear to me what we could do to introduce a
leveldb corruption - the only possibilities I can think of are a leveldb bug,
or maybe in specific conditions some unrelated code ends up writing to the
leveldb file descriptors, which could cause such a corruption.
But having it occur across all masters seems very unlikely.
> Mesos-master service is activating state
> ----------------------------------------
>
> Key: MESOS-10198
> URL: https://issues.apache.org/jira/browse/MESOS-10198
> Project: Mesos
> Issue Type: Task
> Affects Versions: 1.9.0
> Reporter: Kiran J Shetty
> Priority: Major
>
> mesos-master service showing activating state on all 3 master node and which
> intern making marathon to restart frequently . in logs I can see below entry.
> Mesos-master logs:
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a864206a9
> mesos::internal::log::ReplicaProcess::ReplicaProcess()
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a86420854
> mesos::internal::log::Replica::Replica()
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6a65
> mesos::internal::log::LogProcess::LogProcess()
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6e34
> mesos::log::Log::Log()
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a3ec72 main
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a82075555
> __libc_start_main
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a40d0a (unknown)
> Nov 12 08:36:29 servername systemd[1]: mesos-master.service: main process
> exited, code=killed, status=6/ABRT
> Nov 12 08:36:29 servername systemd[1]: Unit mesos-master.service entered
> failed state.
> Nov 12 08:36:29 servername systemd[1]: mesos-master.service failed.
> Nov 12 08:36:49 servername systemd[1]: mesos-master.service holdoff time
> over, scheduling restart.
> Nov 12 08:36:49 servername systemd[1]: Stopped Mesos Master.
> Nov 12 08:36:49 servername systemd[1]: Started Mesos Master.
> Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.633597 20024
> logging.cpp:201] INFO level logging started!
> Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634446 20024
> main.cpp:243] Build: 2019-10-21 12:10:14 by centos
> Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634460 20024
> main.cpp:244] Version: 1.9.0
> Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634466 20024
> main.cpp:247] Git tag: 1.9.0
> Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634470 20024
> main.cpp:251] Git SHA: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e
> Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.636653 20024
> main.cpp:345] Using 'hierarchical' allocator
> Nov 12 08:36:49 servername mesos-master[20037]: mesos-master:
> ./db/skiplist.h:344: void leveldb::SkipList<Key, Comparator>::Insert(const
> Key&) [with Key = const char*; Comparator =
> leveldb::MemTable::KeyComparator]: Assertion `x == __null || !Equal(key,
> x->key)' failed.
> Nov 12 08:36:49 servername mesos-master[20037]: *** Aborted at 1605150409
> (unix time) try "date -d @1605150409" if you are using GNU date ***
> Nov 12 08:36:49 servername mesos-master[20037]: PC: @ 0x7fdee16ed387
> __GI_raise
> Nov 12 08:36:49 servername mesos-master[20037]: *** SIGABRT (@0x4e38)
> received by PID 20024 (TID 0x7fdee720ea00) from PID 20024; stack trace: ***
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee1fb2630 (unknown)
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16ed387 __GI_raise
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16eea78 __GI_abort
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e61a6
> __assert_fail_base
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e6252
> __GI___assert_fail
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cf3dc2
> leveldb::SkipList<>::Insert()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cf3735
> leveldb::MemTable::Add()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5d00168
> leveldb::WriteBatch::Iterate()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5d00424
> leveldb::WriteBatchInternal::InsertInto()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5ce8575
> leveldb::DBImpl::RecoverLogFile()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cec0fc
> leveldb::DBImpl::Recover()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cec3fa
> leveldb::DB::Open()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a0f877
> mesos::internal::log::LevelDBStorage::restore()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a817a2
> mesos::internal::log::ReplicaProcess::restore()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a846a9
> mesos::internal::log::ReplicaProcess::ReplicaProcess()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a84854
> mesos::internal::log::Replica::Replica()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a1aa65
> mesos::internal::log::LogProcess::LogProcess()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a1ae34
> mesos::log::Log::Log()
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x559ab80e0c72 main
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16d9555
> __libc_start_main
> Nov 12 08:36:49 servername mesos-master[20037]: @ 0x559ab80e2d0a (unknown)
> Nov 12 08:36:49 servername systemd[1]: mesos-master.service: main process
> exited, code=killed, status=6/ABRT
> Nov 12 08:36:49 servername systemd[1]: Unit mesos-master.service entered
> failed state.
> Nov 12 08:36:49 servername systemd[1]: mesos-master.service failed.
>
>
> Marathon logs:
> Nov 12 08:09:44 servername marathon[25752]: *
> Actor[akka://marathon/user/reviveOffers#-1362265983|#-1362265983]
> (mesosphere.marathon.core.leadership.impl.LeadershipCoordinatorActor:marathon-akka.actor.default-dispatcher-19)
> Nov 12 08:09:44 servername marathon[25752]: [2020-11-12 08:09:44,103] ERROR
> Lost leadership; crashing
> (mesosphere.marathon.core.election.ElectionServiceImpl:marathon-akka.actor.default-dispatcher-25)
> Nov 12 08:09:44 servername marathon[25752]: [2020-11-12 08:09:44,104] INFO
> ExpungeOverdueLostTasksActor has stopped
> (mesosphere.marathon.core.task.jobs.impl.ExpungeOverdueLostTasksActor:marathon-akka.actor.default-dispatcher-15)
> Nov 12 08:09:44 servername marathon[25752]: [2020-11-12 08:09:44,112] INFO
> shutting down with exit code 103
> (mesosphere.marathon.core.base.RichRuntime:scala-execution-context-global-101)
> Nov 12 08:09:44 servername marathon[25752]: [2020-11-12 08:09:44,117] INFO
> Suspending scheduler actor
> (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-20)
> Nov 12 08:09:44 servername marathon[25752]: [2020-11-12 08:09:44,117] ERROR
> Unhandled message in suspend: class
> mesosphere.marathon.core.launchqueue.impl.RateLimiterActor$Unsubscribe$
> (mesosphere.marathon.core.leadership.impl.WhenLeaderActor:marathon-akka.actor.default-dispatcher-21)
> Nov 12 08:09:44 servername marathon[25752]: [2020-11-12 08:09:44,121] INFO
> Now standing by. Closing existing handles and rejecting new.
> (mesosphere.marathon.core.event.impl.stream.HttpEventStreamActor:marathon-akka.actor.default-dispatcher-6)
> Nov 12 08:09:44 servername systemd[1]: marathon.service: main process
> exited, code=exited, status=103/n/a
> Nov 12 08:09:44 servername systemd[1]: Unit marathon.service entered failed
> state.
> Nov 12 08:09:44 servername systemd[1]: marathon.service failed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)