[ https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732473#comment-16732473 ]
Vinod Kone commented on MESOS-7622: ----------------------------------- [~kaysoky] Did you fix this recently? > Agent can crash if a HTTP executor tries to retry subscription in running > state. > -------------------------------------------------------------------------------- > > Key: MESOS-7622 > URL: https://issues.apache.org/jira/browse/MESOS-7622 > Project: Mesos > Issue Type: Bug > Components: agent, executor > Affects Versions: 1.2.2 > Reporter: Aaron Wood > Priority: Critical > > It is possible that a running executor might retry its subscribe request. > This can lead to a crash if it previously had any launched tasks. Note that > the executor would still be able to subscribe again when the agent process > restarts and is recovering. > {code} > sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave > --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime > --image_providers=docker --image_provisioner_backend=overlay > --containerizers=mesos --launcher_dir=$(pwd) > --executor_environment_variables='{"LD_LIBRARY_PATH": > "/home/aaron/Code/src/mesos/build/src/.libs"}' > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by > aaron > I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0 > I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected > I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state > I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice > `mesos_executors.slice` > I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver > I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: > cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret > I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' > failed; this is the output: > sh: 1: hadoop: not found > I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin > 'hadoop' as it could not be created: Failed to create HDFS client: Failed to > execute 'hadoop version 2>&1'; the command was either not found or exited > with a non-zero exit status: 127 > I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend > 'overlay' > I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on > (1)@127.0.1.1:5051 > I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: > --appc_simple_discovery_uri_prefix="http://" > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" > --docker="docker" --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/tmp/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}" > --executor_registration_timeout="1mins" > --executor_reregistration_timeout="2secs" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname_lookup="true" > --http_command_executor="false" --http_heartbeat_interval="30secs" > --image_providers="docker" --image_provisioner_backend="overlay" > --initialize_driver_logging="true" > --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime" > --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" > --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" > --max_completed_executors_per_framework="150" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" > --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" > --strict="true" --switch_user="true" --systemd_enable_support="true" > --systemd_runtime_directory="/run/systemd/system" --version="false" > --work_dir="/tmp/slave" > I0605 14:58:23.786392 10710 slave.cpp:552] Agent resources: cpus(*):6; > mem(*):6956; disk(*):41113; ports(*):[31000-32000] > I0605 14:58:23.786437 10710 slave.cpp:560] Agent attributes: [ ] > I0605 14:58:23.786468 10710 slave.cpp:565] Agent hostname: U64 > I0605 14:58:23.786574 10714 status_update_manager.cpp:177] Pausing sending > status updates > I0605 14:58:23.787470 10718 state.cpp:62] Recovering state from > '/tmp/slave/meta' > I0605 14:58:23.787698 10713 status_update_manager.cpp:203] Recovering status > update manager > I0605 14:58:23.787755 10713 containerizer.cpp:582] Recovering containerizer > I0605 14:58:23.788499 10714 linux_launcher.cpp:291] Recovered container > 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.788805 10714 linux_launcher.cpp:373] > 0fbfe58f-a1bf-4128-83ed-81c2901299b7 is a known orphaned container > I0605 14:58:23.790591 10714 memory.cpp:479] Started listening for OOM events > for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.791183 10714 memory.cpp:590] Started listening on 'low' memory > pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.791714 10714 memory.cpp:590] Started listening on 'medium' > memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.792260 10714 memory.cpp:590] Started listening on 'critical' > memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.792899 10717 provisioner.cpp:372] Recovered container > 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.793344 10717 metadata_manager.cpp:236] Successfully loaded 1 > Docker images > I0605 14:58:23.793432 10717 provisioner.cpp:416] Provisioner recovery complete > I0605 14:58:23.793514 10717 containerizer.cpp:939] Cleaning up orphan > container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.793536 10717 containerizer.cpp:2101] Destroying container > 0fbfe58f-a1bf-4128-83ed-81c2901299b7 in RUNNING state > I0605 14:58:23.793669 10717 slave.cpp:6127] Finished recovery > I0605 14:58:23.793853 10711 linux_launcher.cpp:505] Asked to destroy > container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.794039 10717 slave.cpp:945] New master detected at > master@10.0.2.15:5050 > I0605 14:58:23.794100 10717 slave.cpp:969] No credentials provided. > Attempting to register without authentication > I0605 14:58:23.794109 10717 slave.cpp:980] Detecting new master > I0605 14:58:23.794131 10717 status_update_manager.cpp:177] Pausing sending > status updates > I0605 14:58:23.794304 10711 linux_launcher.cpp:548] Using freezer to destroy > cgroup mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.794937 10717 cgroups.cpp:2692] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.795821 10711 cgroups.cpp:1405] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 after > 850944ns > I0605 14:58:23.813447 10714 cgroups.cpp:2710] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.816241 10714 cgroups.cpp:1434] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 after > 2.750976ms > I0605 14:58:23.851388 10711 containerizer.cpp:2507] Container > 0fbfe58f-a1bf-4128-83ed-81c2901299b7 has exited > W0605 14:58:23.853590 10712 disk.cpp:448] Ignoring cleanup for unknown > container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:23.856833 10711 provisioner.cpp:576] Destroying container rootfs > at > '/tmp/slave/provisioner/containers/0fbfe58f-a1bf-4128-83ed-81c2901299b7/backends/overlay/rootfses/ade95f11-60ec-47ee-9e4e-bcbc63cc5eaa' > for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 > I0605 14:58:24.572808 10712 slave.cpp:1148] Registered with master > master@10.0.2.15:5050; given agent ID eb661932-585d-453a-bc64-a720e05ce634-S0 > I0605 14:58:24.572888 10711 status_update_manager.cpp:184] Resuming sending > status updates > I0605 14:58:24.573446 10712 slave.cpp:1206] Forwarding total oversubscribed > resources {} > I0605 14:58:24.579311 10711 slave.cpp:1632] Got assigned task > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:24.580113 10711 slave.cpp:1913] Authorizing task > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:24.580293 10711 slave.cpp:2100] Launching task > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:24.580703 10711 paths.cpp:573] Trying to chown > '/tmp/slave/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a' > to user 'root' > I0605 14:58:24.581671 10711 slave.cpp:7086] Launching executor > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 with resources cpus(*)(allocated: > *):0.1; mem(*)(allocated: *):32 in work directory > '/tmp/slave/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a' > I0605 14:58:24.581881 10711 slave.cpp:2795] Launching container > cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a for executor > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:24.582072 10712 containerizer.cpp:1056] Starting container > cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:24.582391 10711 slave.cpp:2329] Queued task > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for executor > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:24.583070 10712 provisioner.cpp:459] Provisioning image rootfs > '/tmp/slave/provisioner/containers/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/backends/overlay/rootfses/736ac735-51b1-4bc9-892b-15330d0784f9' > for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a using overlay backend > I0605 14:58:24.587167 10715 memory.cpp:479] Started listening for OOM events > for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:24.587508 10715 memory.cpp:590] Started listening on 'low' memory > pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:24.587805 10715 memory.cpp:590] Started listening on 'medium' > memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:24.588145 10715 memory.cpp:590] Started listening on 'critical' > memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:24.588714 10715 memory.cpp:199] Updated > 'memory.soft_limit_in_bytes' to 64MB for container > cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:24.589285 10715 memory.cpp:228] Updated 'memory.limit_in_bytes' > to 64MB for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:24.589588 10715 cpu.cpp:101] Updated 'cpu.shares' to 204 (cpus > 0.2) for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:24.591984 10711 linux_launcher.cpp:429] Launching container > cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a and cloning with namespaces CLONE_NEWNS > I0605 14:58:24.593714 10711 systemd.cpp:96] Assigned child process '10736' to > 'mesos_executors.slice' > I0605 14:58:24.595461 10712 containerizer.cpp:1722] Checkpointing container's > forked pid 10736 to > '/tmp/slave/meta/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/pids/forked.pid' > I0605 14:58:25.247808 10718 slave.cpp:3825] Got registration for executor > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459 > I0605 14:58:25.248884 10715 disk.cpp:208] Updating the disk resources for > container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a to cpus(*)(allocated: *):0.2; > mem(*)(allocated: *):64; disk(*)(allocated: *):1024 > I0605 14:58:25.249397 10716 memory.cpp:199] Updated > 'memory.soft_limit_in_bytes' to 64MB for container > cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:25.250066 10716 cpu.cpp:101] Updated 'cpu.shares' to 204 (cpus > 0.2) for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a > I0605 14:58:25.250365 10715 slave.cpp:2542] Sending queued task > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' to executor > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459 > I0605 14:58:25.253556 10716 slave.cpp:4295] Handling status update > TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task > testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459 > I0605 14:58:25.254091 10714 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task > testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:25.254287 10714 status_update_manager.cpp:834] Checkpointing > UPDATE for status update TASK_RUNNING (UUID: > 0292182c-35e6-4df8-8583-fe11f726c211) for task > testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:25.254456 10714 slave.cpp:4735] Forwarding the update > TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task > testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 to master@10.0.2.15:5050 > I0605 14:58:25.254534 10714 slave.cpp:4645] Sending acknowledgement for > status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for > task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 to executor(1)@127.0.1.1:42459 > I0605 14:58:25.260128 10712 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task > testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:25.260184 10712 status_update_manager.cpp:834] Checkpointing ACK > for status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) > for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 > I0605 14:58:25.352159 10712 http.cpp:1115] HTTP POST for > /slave(1)/api/v1/executor from 10.0.2.15:55512 with > User-Agent='mesos-framework-sdk' > I0605 14:58:25.352342 10712 slave.cpp:3609] Received Subscribe request for > HTTP executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459 > I0605 14:58:25.352388 10712 slave.cpp:3672] Creating a marker file for HTTP > based executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework > b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 (via HTTP) at path > '/tmp/slave/meta/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/http.marker' > F0605 14:58:25.352444 10712 slave.cpp:3685] Check failed: > executor->launchedTasks.empty() Newly registered executor > 'testapp-cc6e64001fee44e3a20d7a15149d8b34' has launched tasks > *** Check failure stack trace: *** > @ 0x7fb835fe30ed google::LogMessage::Fail() > @ 0x7fb835fe4ef0 google::LogMessage::SendToLog() > @ 0x7fb835fe2cb3 google::LogMessage::Flush() > @ 0x7fb835fe5939 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fb8346e126c mesos::internal::slave::Slave::subscribe() > @ 0x7fb8345f3169 mesos::internal::slave::Http::executor() > @ 0x7fb8346b1107 > _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_ > @ 0x7fb833964a10 > _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestEEEENKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data > @ 0x7fb83465fffe > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseEEEEclIRSt8functionIFS9_vEEEES9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_ > @ 0x7fb8339607ec process::ProcessBase::visit() > @ 0x7fb83396c03b process::ProcessManager::resume() > @ 0x7fb8339755bf > _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7fb832a3e83f (unknown) > @ 0x7fb83596d6da start_thread > @ 0x7fb8324ad17f (unknown) > Aborted (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)