[jira] [Commented] (MESOS-8839) Resource provider manager registrar recovery can race with agent on agent state leading to hard failures

Chun-Hung Hsiao (JIRA) Wed, 25 Apr 2018 20:45:27 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453459#comment-16453459
 ]


Chun-Hung Hsiao commented on MESOS-8839:
----------------------------------------

I did some digging in the leveldb code and believe the crash is caused by not 
properly cleaning up {{LevelDBStorage}} instances.

The error message {{lock <meta_slave_dir>/resource_provider_registry/LOCK 
already held by process}} comes from 
https://github.com/google/leveldb/blob/master/util/env_posix.cc#L532,
Which is triggered when the {{PosixLockTable}} already contains the lock 
filename: https://github.com/google/leveldb/blob/master/util/env_posix.cc#L378

The {{PosixLockTable}} is managed by {{Env}}, which is default to a singleton:
https://github.com/google/leveldb/blob/master/include/leveldb/options.h#L63
https://github.com/google/leveldb/blob/master/util/env_posix.cc#L752

This means that the linux process running the test contains a single 
{{PosixLockTable}} and the same lock filename 
({{<meta_slave_dir>/resource_provider_registry/LOCK}}) has been added before.
I do observe that many crashes happen in tests that involves restarting agents.
Apparently, something is not cleaned up when we restart the slave.

Let's understand more about when the lock filename is added to then singleton 
{{PosixLockTable}} and when it is removed.
When we open a database, it is added through the {{DBImpl::Recover()}} function:
https://github.com/google/leveldb/blob/master/db/db_impl.cc#L1508
https://github.com/google/leveldb/blob/master/db/db_impl.cc#L287
And the lock filename is only removed if either the db instance is destructed, 
or the caller destroys the data:
https://github.com/google/leveldb/blob/master/db/db_impl.cc#L161
https://github.com/google/leveldb/blob/master/db/db_impl.cc#L1570

So my hypothesis is that when the agent is restarted, the previous 
{{LevelDBStorage}},
and hence its {{LevelDBStorageProcess}}, are not destructed:
https://github.com/apache/mesos/blob/master/src/state/leveldb.cpp#L89
https://github.com/apache/mesos/blob/master/src/state/leveldb.cpp#L289

The following steps could reliably reproduce the problem:
# Make all cores fully loaded (on a 48-core machine):
{noformat}
docker run --rm --name busy alpine sh -c 'for i in `seq 1 48`; do sh -c "while 
true; do true; done" & done; wait'
{noformat}
I use docker to make it easy to terminate all busy processes.
# Run the following test in repetition:
{noformat}
sh bin/mesos-tests.sh 
--gtest_filter='MasterTest.AgentRestartNoReregisterRateLimit' --gtest_repeat=-1 
--gtest_break_on_failure --verbose
{noformat}

For this particular test, the fix is simple:
{noformat}
--- a/src/tests/master_tests.cpp
+++ b/src/tests/master_tests.cpp
@@ -7537,6 +7537,7 @@ TEST_F(MasterTest, AgentRestartNoReregisterRateLimit)
   // Terminate the agent abruptly. This causes the master -> agent
   // socket to break on the master side.
   slave.get()->terminate();
+  slave->reset();

   Future<ReregisterSlaveMessage> reregisterSlave =
     DROP_PROTOBUF(ReregisterSlaveMessage(), _, _);
{noformat}

> Resource provider manager registrar recovery can race with agent on agent 
> state leading to hard failures
> --------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8839
>                 URL: https://issues.apache.org/jira/browse/MESOS-8839
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.6.0
>            Reporter: Benjamin Bannier
>            Assignee: Benjamin Bannier
>            Priority: Blocker
>
> When running in the agent the resource provider manager persists its state 
> into the agent's state. The agent uses a LevelDB state which protects against 
> concurrent access. The way we modelled LevelDB an {{fetch}} when a lock is 
> present leads to a failed {{Future}} result. When the resource provider 
> manager encounters a failed recovery it emits a fatal error, e.g.,
> {noformat}
> 11:48:26 F0425 11:48:26.650568 26819 manager.cpp:254] Failed to recover 
> resource provider manager registry: Failed: IO error: lock 
> /tmp/ParentChildContainerTypeAndContentType_AgentContainerAPITest_RecoverNestedContainer_10_HXbQCK/meta/slaves/6645885c-050a-4518-b896-a20b3e72a070-S0/resource_provider_registry/LOCK:
>  already held by process
> 11:48:26 *** Check failure stack trace: ***{noformat}
> We should not fail hard for such recoverable failure scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8839) Resource provider manager registrar recovery can race with agent on agent state leading to hard failures

Reply via email to