[jira] [Commented] (MESOS-4999) Mesos (or Marathon) lost tasks
[ https://issues.apache.org/jira/browse/MESOS-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217726#comment-15217726 ] Peter Kolloch commented on MESOS-4999: -- "Deleting" tasks in Marathon really means that Marathon submits kills for these tasks to Mesos. It will not update or delete the tasks immediately but it will wait for a notification from Mesos. Superficially, this looks like a Mesos agent died. In that case, it often takes long (i.e. up to 10min or more, depending on your config) until Mesos responds to a kill with a "TASK_KILLED" or "TASK_LOST". Therefore, it looks as if Marathon does not respond to the kill. Ideally, we would expose another task state such as "task kill sent" in the Marathon API so that the user sees what is going on. But this is not the case yet. Sorry for the confusion. [I cannot verify this hypothesis easily without the Marathon logs] > Mesos (or Marathon) lost tasks > -- > > Key: MESOS-4999 > URL: https://issues.apache.org/jira/browse/MESOS-4999 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.2 > Environment: mesos - 0.27.0 > marathon - 0.15.2 > 189 mesos slaves with Ubuntu 14.04.2 on HP ProLiant DL380 Gen9, > CPU - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @2.50GHz (48 cores (with > hyperthreading)) > RAM - 264G, > Storage - 3.0T on RAID on HP Smart Array P840 Controller, > HDD - 12 x HP EH0600JDYTL > Network - 2 x Intel Corporation Ethernet 10G 2P X710, >Reporter: Sergey Galkin > Attachments: agent-mesos-docker-logs.tar.xz, > masternode-1-mesos-marathon-log.tar.xz, > masternode-3-mesos-marathon-log.tar.xz, mesos-nodes.png > > > After a lot of create/delete application with docker instances through > Marathon API I have a lot of lost tasks after last *deleting all application > in Marathon*. > They are divided into three types > 1. Tasks hangs in STAGED status. I don't see this tasks in 'docker ps' on the > slave and _service docker restart_ on mesos slave did not fix these tasks. > 2. RUNNING because docker hangs and can't delete these instances (a lot of > {code} > Killing docker task > Shutting down > Killing docker task > Shutting down > {code} > in stdout, > _docker stop ID_ hangs and these tasks can be fixed by _service docker > restart_ on mesos slave. > 3. RUNNING after _service docker restart_ on mesos slave. > Screenshot attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4999) Mesos (or Marathon) lost tasks
[ https://issues.apache.org/jira/browse/MESOS-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217713#comment-15217713 ] Peter Kolloch commented on MESOS-4999: -- Hi Sergey, the files that you provided do not really include the Marathon logs -- the upstart/marathon.log.* files only include the startup command. So if the problem is in Marathon, it is hard to diagnose in this fashion. What version of Marathon are you using? > Mesos (or Marathon) lost tasks > -- > > Key: MESOS-4999 > URL: https://issues.apache.org/jira/browse/MESOS-4999 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.2 > Environment: mesos - 0.27.0 > marathon - 0.15.2 > 189 mesos slaves with Ubuntu 14.04.2 on HP ProLiant DL380 Gen9, > CPU - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @2.50GHz (48 cores (with > hyperthreading)) > RAM - 264G, > Storage - 3.0T on RAID on HP Smart Array P840 Controller, > HDD - 12 x HP EH0600JDYTL > Network - 2 x Intel Corporation Ethernet 10G 2P X710, >Reporter: Sergey Galkin > Attachments: agent-mesos-docker-logs.tar.xz, > masternode-1-mesos-marathon-log.tar.xz, > masternode-3-mesos-marathon-log.tar.xz, mesos-nodes.png > > > After a lot of create/delete application with docker instances through > Marathon API I have a lot of lost tasks after last *deleting all application > in Marathon*. > They are divided into three types > 1. Tasks hangs in STAGED status. I don't see this tasks in 'docker ps' on the > slave and _service docker restart_ on mesos slave did not fix these tasks. > 2. RUNNING because docker hangs and can't delete these instances (a lot of > {code} > Killing docker task > Shutting down > Killing docker task > Shutting down > {code} > in stdout, > _docker stop ID_ hangs and these tasks can be fixed by _service docker > restart_ on mesos slave. > 3. RUNNING after _service docker restart_ on mesos slave. > Screenshot attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3793) Cannot start mesos local on a Debian GNU/Linux 8 docker machine
[ https://issues.apache.org/jira/browse/MESOS-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989299#comment-14989299 ] Peter Kolloch commented on MESOS-3793: -- The last log line ( Failed to locate systemd runtime directory: /run/systemd/system) looks as if mesos depended on systemd? Is that correct and expected? > Cannot start mesos local on a Debian GNU/Linux 8 docker machine > --- > > Key: MESOS-3793 > URL: https://issues.apache.org/jira/browse/MESOS-3793 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 > Environment: Debian GNU/Linux 8 docker machine >Reporter: Matthias Veit >Assignee: Jojy Varghese > Labels: mesosphere > > We updated the mesos version to 0.25.0 in our Marathon docker image, that > runs our integration tests. > We use mesos local for those tests. This fails with this message: > {noformat} > root@a06e4b4eb776:/marathon# mesos local > I1022 18:42:26.852485 136 leveldb.cpp:176] Opened db in 6.103258ms > I1022 18:42:26.853302 136 leveldb.cpp:183] Compacted db in 765740ns > I1022 18:42:26.853343 136 leveldb.cpp:198] Created db iterator in 9001ns > I1022 18:42:26.853355 136 leveldb.cpp:204] Seeked to beginning of db in > 1287ns > I1022 18:42:26.853366 136 leveldb.cpp:273] Iterated through 0 keys in the > db in ns > I1022 18:42:26.853406 136 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1022 18:42:26.853775 141 recover.cpp:449] Starting replica recovery > I1022 18:42:26.853862 141 recover.cpp:475] Replica is in EMPTY status > I1022 18:42:26.854751 138 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I1022 18:42:26.854856 140 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1022 18:42:26.855002 140 recover.cpp:566] Updating replica status to > STARTING > I1022 18:42:26.855655 138 master.cpp:376] Master > a3f39818-1bda-4710-b96b-2a60ed4d12b8 (a06e4b4eb776) started on > 172.17.0.14:5050 > I1022 18:42:26.855680 138 master.cpp:378] Flags at startup: > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="false" > --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" > --registry_strict="false" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" > --work_dir="/tmp/mesos/local/AK0XpG" --zk_session_timeout="10secs" > I1022 18:42:26.855790 138 master.cpp:425] Master allowing unauthenticated > frameworks to register > I1022 18:42:26.855803 138 master.cpp:430] Master allowing unauthenticated > slaves to register > I1022 18:42:26.855815 138 master.cpp:467] Using default 'crammd5' > authenticator > W1022 18:42:26.855829 138 authenticator.cpp:505] No credentials provided, > authentication requests will be refused > I1022 18:42:26.855840 138 authenticator.cpp:512] Initializing server SASL > I1022 18:42:26.856442 136 containerizer.cpp:143] Using isolation: > posix/cpu,posix/mem,filesystem/posix > I1022 18:42:26.856943 140 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.888185ms > I1022 18:42:26.856987 140 replica.cpp:323] Persisted replica status to > STARTING > I1022 18:42:26.857115 140 recover.cpp:475] Replica is in STARTING status > I1022 18:42:26.857270 140 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I1022 18:42:26.857312 140 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1022 18:42:26.857368 140 recover.cpp:566] Updating replica status to VOTING > I1022 18:42:26.857781 140 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 371121ns > I1022 18:42:26.857841 140 replica.cpp:323] Persisted replica status to > VOTING > I1022 18:42:26.857895 140 recover.cpp:580] Successfully joined the Paxos > group > I1022 18:42:26.857928 140 recover.cpp:464] Recover process terminated > I1022 18:42:26.862455 137 master.cpp:1603] The newly elected leader is > master@172.17.0.14:5050 with id a3f39818-1bda-4710-b96b-2a60ed4d12b8 > I1022 18:42:26.862498 137 master.cpp:1616] Elected as the leading master! > I1022 18:42:26.862511 137 master.cpp:1376] Recovering from registrar > I1022 18:42:26.862560 137 registrar.cpp:309] Recovering
[jira] [Commented] (MESOS-3793) Cannot start mesos local on a Debian GNU/Linux 8 docker machine
[ https://issues.apache.org/jira/browse/MESOS-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989308#comment-14989308 ] Peter Kolloch commented on MESOS-3793: -- [~karlkfi]Is it correct that you encountered this problem, too? Did you find a workaround? > Cannot start mesos local on a Debian GNU/Linux 8 docker machine > --- > > Key: MESOS-3793 > URL: https://issues.apache.org/jira/browse/MESOS-3793 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 > Environment: Debian GNU/Linux 8 docker machine >Reporter: Matthias Veit >Assignee: Jojy Varghese > Labels: mesosphere > > We updated the mesos version to 0.25.0 in our Marathon docker image, that > runs our integration tests. > We use mesos local for those tests. This fails with this message: > {noformat} > root@a06e4b4eb776:/marathon# mesos local > I1022 18:42:26.852485 136 leveldb.cpp:176] Opened db in 6.103258ms > I1022 18:42:26.853302 136 leveldb.cpp:183] Compacted db in 765740ns > I1022 18:42:26.853343 136 leveldb.cpp:198] Created db iterator in 9001ns > I1022 18:42:26.853355 136 leveldb.cpp:204] Seeked to beginning of db in > 1287ns > I1022 18:42:26.853366 136 leveldb.cpp:273] Iterated through 0 keys in the > db in ns > I1022 18:42:26.853406 136 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1022 18:42:26.853775 141 recover.cpp:449] Starting replica recovery > I1022 18:42:26.853862 141 recover.cpp:475] Replica is in EMPTY status > I1022 18:42:26.854751 138 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I1022 18:42:26.854856 140 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1022 18:42:26.855002 140 recover.cpp:566] Updating replica status to > STARTING > I1022 18:42:26.855655 138 master.cpp:376] Master > a3f39818-1bda-4710-b96b-2a60ed4d12b8 (a06e4b4eb776) started on > 172.17.0.14:5050 > I1022 18:42:26.855680 138 master.cpp:378] Flags at startup: > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="false" > --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" > --registry_strict="false" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" > --work_dir="/tmp/mesos/local/AK0XpG" --zk_session_timeout="10secs" > I1022 18:42:26.855790 138 master.cpp:425] Master allowing unauthenticated > frameworks to register > I1022 18:42:26.855803 138 master.cpp:430] Master allowing unauthenticated > slaves to register > I1022 18:42:26.855815 138 master.cpp:467] Using default 'crammd5' > authenticator > W1022 18:42:26.855829 138 authenticator.cpp:505] No credentials provided, > authentication requests will be refused > I1022 18:42:26.855840 138 authenticator.cpp:512] Initializing server SASL > I1022 18:42:26.856442 136 containerizer.cpp:143] Using isolation: > posix/cpu,posix/mem,filesystem/posix > I1022 18:42:26.856943 140 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.888185ms > I1022 18:42:26.856987 140 replica.cpp:323] Persisted replica status to > STARTING > I1022 18:42:26.857115 140 recover.cpp:475] Replica is in STARTING status > I1022 18:42:26.857270 140 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I1022 18:42:26.857312 140 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1022 18:42:26.857368 140 recover.cpp:566] Updating replica status to VOTING > I1022 18:42:26.857781 140 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 371121ns > I1022 18:42:26.857841 140 replica.cpp:323] Persisted replica status to > VOTING > I1022 18:42:26.857895 140 recover.cpp:580] Successfully joined the Paxos > group > I1022 18:42:26.857928 140 recover.cpp:464] Recover process terminated > I1022 18:42:26.862455 137 master.cpp:1603] The newly elected leader is > master@172.17.0.14:5050 with id a3f39818-1bda-4710-b96b-2a60ed4d12b8 > I1022 18:42:26.862498 137 master.cpp:1616] Elected as the leading master! > I1022 18:42:26.862511 137 master.cpp:1376] Recovering from registrar > I1022 18:42:26.862560 137 registrar.cpp:309] Recovering registrar > Failed to create a containerizer: Could not create
[jira] [Commented] (MESOS-3793) Cannot start mesos local on a Debian GNU/Linux 8 docker machine
[ https://issues.apache.org/jira/browse/MESOS-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989332#comment-14989332 ] Peter Kolloch commented on MESOS-3793: -- I found this related CHANGELOG entry (https://github.com/apache/mesos/blob/master/CHANGELOG#L109): {code} * [MESOS-3425] - Modify LinuxLauncher to support Systemd. {code} Maybe MESOS-3425 introduced a hard dependency on systemd utilities? Maybe MESOS-1159 is about fixing that but I am not sure. > Cannot start mesos local on a Debian GNU/Linux 8 docker machine > --- > > Key: MESOS-3793 > URL: https://issues.apache.org/jira/browse/MESOS-3793 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 > Environment: Debian GNU/Linux 8 docker machine >Reporter: Matthias Veit >Assignee: Jojy Varghese > Labels: mesosphere > > We updated the mesos version to 0.25.0 in our Marathon docker image, that > runs our integration tests. > We use mesos local for those tests. This fails with this message: > {noformat} > root@a06e4b4eb776:/marathon# mesos local > I1022 18:42:26.852485 136 leveldb.cpp:176] Opened db in 6.103258ms > I1022 18:42:26.853302 136 leveldb.cpp:183] Compacted db in 765740ns > I1022 18:42:26.853343 136 leveldb.cpp:198] Created db iterator in 9001ns > I1022 18:42:26.853355 136 leveldb.cpp:204] Seeked to beginning of db in > 1287ns > I1022 18:42:26.853366 136 leveldb.cpp:273] Iterated through 0 keys in the > db in ns > I1022 18:42:26.853406 136 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1022 18:42:26.853775 141 recover.cpp:449] Starting replica recovery > I1022 18:42:26.853862 141 recover.cpp:475] Replica is in EMPTY status > I1022 18:42:26.854751 138 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I1022 18:42:26.854856 140 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1022 18:42:26.855002 140 recover.cpp:566] Updating replica status to > STARTING > I1022 18:42:26.855655 138 master.cpp:376] Master > a3f39818-1bda-4710-b96b-2a60ed4d12b8 (a06e4b4eb776) started on > 172.17.0.14:5050 > I1022 18:42:26.855680 138 master.cpp:378] Flags at startup: > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="false" > --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" > --registry_strict="false" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" > --work_dir="/tmp/mesos/local/AK0XpG" --zk_session_timeout="10secs" > I1022 18:42:26.855790 138 master.cpp:425] Master allowing unauthenticated > frameworks to register > I1022 18:42:26.855803 138 master.cpp:430] Master allowing unauthenticated > slaves to register > I1022 18:42:26.855815 138 master.cpp:467] Using default 'crammd5' > authenticator > W1022 18:42:26.855829 138 authenticator.cpp:505] No credentials provided, > authentication requests will be refused > I1022 18:42:26.855840 138 authenticator.cpp:512] Initializing server SASL > I1022 18:42:26.856442 136 containerizer.cpp:143] Using isolation: > posix/cpu,posix/mem,filesystem/posix > I1022 18:42:26.856943 140 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.888185ms > I1022 18:42:26.856987 140 replica.cpp:323] Persisted replica status to > STARTING > I1022 18:42:26.857115 140 recover.cpp:475] Replica is in STARTING status > I1022 18:42:26.857270 140 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I1022 18:42:26.857312 140 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1022 18:42:26.857368 140 recover.cpp:566] Updating replica status to VOTING > I1022 18:42:26.857781 140 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 371121ns > I1022 18:42:26.857841 140 replica.cpp:323] Persisted replica status to > VOTING > I1022 18:42:26.857895 140 recover.cpp:580] Successfully joined the Paxos > group > I1022 18:42:26.857928 140 recover.cpp:464] Recover process terminated > I1022 18:42:26.862455 137 master.cpp:1603] The newly elected leader is > master@172.17.0.14:5050 with id a3f39818-1bda-4710-b96b-2a60ed4d12b8 > I1022 18:42:26.862498 137 master.cpp:1616] Elected as the leading
[jira] [Created] (MESOS-3744) Master crashes when tearing down framework
Peter Kolloch created MESOS-3744: Summary: Master crashes when tearing down framework Key: MESOS-3744 URL: https://issues.apache.org/jira/browse/MESOS-3744 Project: Mesos Issue Type: Bug Components: allocation Affects Versions: 0.23.0 Reporter: Peter Kolloch Here is an excerpt of the startup of the effected mesos master version because it does contain the software versions in use: Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started! Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0 Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: 7d15294f46b5062c59818f4d062044ac04349dc1 Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in 10366ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the db in 78606ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log positions 125 -> 126 with 0 holes and 0 unlearned Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5 Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-4-219.us-west-2.compute.internal Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client environment:os.name=Linux Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client environment:os.arch=4.0.5 Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@725: Client environment:os.version=#2 SMP Fri Jul 10 01:01:50 UTC 2015 Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@733: Client environment:user.name=(null) Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@741: Client environment:user.home=/root Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@753: Client environment:user.dir=/ Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=127.0.0.1:2181 sessionTimeout=1 watcher=0x7f0532095480 sessionId=0 sessionPasswd= context=0x7f0504001130 flags=0 Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.160876 18936 main.cpp:383] Starting Mesos master Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5 Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-4-219.us-west-2.compute.internal Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,161:18936(0x7f0528cd3700):ZOO_INFO@check_events@1703: initiated connection to server [127.0.0.1:2181] Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015
[jira] [Updated] (MESOS-3744) Master crashes when tearing down framework
[ https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Kolloch updated MESOS-3744: - Attachment: master-fail.log > Master crashes when tearing down framework > -- > > Key: MESOS-3744 > URL: https://issues.apache.org/jira/browse/MESOS-3744 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 0.23.0 >Reporter: Peter Kolloch > Attachments: master-fail.log > > > Here is an excerpt of the startup of the effected mesos master version > because it does contain the software versions in use: > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started! > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0 > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: > 7d15294f46b5062c59818f4d062044ac04349dc1 > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in > 10366ns > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the > db in 78606ns > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log > positions 125 -> 126 with 0 holes and 0 unlearned > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client > environment:zookeeper.version=zookeeper C client 3.4.5 > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client > environment:host.name=ip-10-0-4-219.us-west-2.compute.internal > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client > environment:os.name=Linux > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client > environment:os.arch=4.0.5 > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@725: Client > environment:os.version=#2 SMP Fri Jul 10 01:01:50 UTC 2015 > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@733: Client > environment:user.name=(null) > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@741: Client > environment:user.home=/root > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@753: Client > environment:user.dir=/ > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@zookeeper_init@786: > Initiating client connection, host=127.0.0.1:2181 sessionTimeout=1 > watcher=0x7f0532095480 sessionId=0 sessionPasswd= > context=0x7f0504001130 flags=0 > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.160876 18936 main.cpp:383] Starting Mesos master > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@712: Client > environment:zookeeper.version=zookeeper C client 3.4.5 > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@716:
[jira] [Updated] (MESOS-3744) Master crashes when tearing down framework
[ https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Kolloch updated MESOS-3744: - Description: The crash happened shortly after calling teardown. The teardown was initiated by using httpie with: http -f -v POST "$MASTER_BASE_URL/teardown" "frameworkId=$FRAMEWORK" Below you will find the master-fail.log over the relevant time interval. Here are the last log lines before the mesos master died: Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: F1015 13:13:21.511503 23038 sorter.cpp:213] Check failed: total.resources.contains(slaveId) Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: *** Check failure stack trace: *** Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1860169fd google::LogMessage::Fail() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd18601889d google::LogMessage::SendToLog() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1860165ec google::LogMessage::Flush() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1860191be google::LogMessageFatal::~LogMessageFatal() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd186af3ea0 mesos::internal::master::allocator::DRFSorter::remove() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1869d6dec mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd186fbdab9 process::ProcessManager::resume() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd186fbddaf process::schedule() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1852bc66c (unknown) Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd184fff2ed (unknown) I am not sure if it matters but in this case multiple framework instances registered with the same framework name. Here is an excerpt of the startup of the effected mesos master version because it does contain the software versions in use: Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started! Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0 Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: 7d15294f46b5062c59818f4d062044ac04349dc1 Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in 10366ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the db in 78606ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log positions 125 -> 126 with 0 holes and 0 unlearned Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5 Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-4-219.us-west-2.compute.internal Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client environment:os.name=Linux Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client environment:os.arch=4.0.5 Oct 15 13:13:38
[jira] [Updated] (MESOS-3744) Master crashes when tearing down framework
[ https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Kolloch updated MESOS-3744: - Description: The crash happened shortly after calling teardown. The teardown was initiated by using httpie with: http -f -v POST "$MASTER_BASE_URL/teardown" "frameworkId=$FRAMEWORK" Below you will find the master-fail.log over the relevant time interval. Here are the last log lines before the mesos master died: Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: F1015 13:13:21.511503 23038 sorter.cpp:213] Check failed: total.resources.contains(slaveId) Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: *** Check failure stack trace: *** Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1860169fd google::LogMessage::Fail() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd18601889d google::LogMessage::SendToLog() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1860165ec google::LogMessage::Flush() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1860191be google::LogMessageFatal::~LogMessageFatal() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd186af3ea0 mesos::internal::master::allocator::DRFSorter::remove() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1869d6dec mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd186fbdab9 process::ProcessManager::resume() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd186fbddaf process::schedule() Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd1852bc66c (unknown) Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 0x7fd184fff2ed (unknown) Here is an excerpt of the startup of the effected mesos master version because it does contain the software versions in use: Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started! Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0 Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: 7d15294f46b5062c59818f4d062044ac04349dc1 Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in 10366ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the db in 78606ns Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log positions 125 -> 126 with 0 holes and 0 unlearned Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5 Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-4-219.us-west-2.compute.internal Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client environment:os.name=Linux Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client environment:os.arch=4.0.5 Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@725: Client
[jira] [Commented] (MESOS-3744) Master crashes when tearing down framework
[ https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958920#comment-14958920 ] Peter Kolloch commented on MESOS-3744: -- Probably a duplicate of MESOS-3719 > Master crashes when tearing down framework > -- > > Key: MESOS-3744 > URL: https://issues.apache.org/jira/browse/MESOS-3744 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 0.23.0 >Reporter: Peter Kolloch > Attachments: master-fail.log > > > The crash happened shortly after calling teardown. The teardown was initiated > by using httpie with: > http -f -v POST "$MASTER_BASE_URL/teardown" "frameworkId=$FRAMEWORK" > Below you will find the master-fail.log over the relevant time interval. Here > are the last log lines before the mesos master died: > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > F1015 13:13:21.511503 23038 sorter.cpp:213] Check failed: > total.resources.contains(slaveId) > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > *** Check failure stack trace: *** > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd1860169fd google::LogMessage::Fail() > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd18601889d google::LogMessage::SendToLog() > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd1860165ec google::LogMessage::Flush() > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd1860191be google::LogMessageFatal::~LogMessageFatal() > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd186af3ea0 mesos::internal::master::allocator::DRFSorter::remove() > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd1869d6dec > mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework() > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd186fbdab9 process::ProcessManager::resume() > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd186fbddaf process::schedule() > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd1852bc66c (unknown) > Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: > @ 0x7fd184fff2ed (unknown) > I am not sure if it matters but in this case multiple framework instances > registered with the same framework name. > Here is an excerpt of the startup of the effected mesos master version > because it does contain the software versions in use: > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started! > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0 > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: > 7d15294f46b5062c59818f4d062044ac04349dc1 > Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in > 10366ns > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the > db in 78606ns > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log > positions 125 -> 126 with 0 holes and 0 unlearned > Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: > 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client > environment:zookeeper.version=zookeeper C client 3.4.5 > Oct 15 13:13:38
[jira] [Commented] (MESOS-2802) Prevent immediate reuse of network ports for different tasks
[ https://issues.apache.org/jira/browse/MESOS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636930#comment-14636930 ] Peter Kolloch commented on MESOS-2802: -- Hi [~bmahler], for some reason I missed the update notification, sorry. A grace period can be made secure. If you refresh your load balancer configuration every 30s (or by listening to update events) and your grace period is 2min, it is very unlikely that you connect to an old application by accident. If you want to be certain, you could adjust your load balancer such that it refuses requests if its configuration is older than 100s. For this solution, you only have to adjust one software component, the load balancer. Nothing else has to be adjusted. An alternative would be to set a HTTP header in the load balancer (e.g. X-App-Id: my-app) and modify _ALL_ applications accepting HTTP requests in your mesos cluster to reject requests that do not have the correct X-App-Id field. While theoretically possible, this is hard to achieve and, even worse, it is easy to forget adjusting one of your applications which was never meant to be available to the outside world. Can you think of a more practical solution than I for solving this problem on an application-level? Prevent immediate reuse of network ports for different tasks Key: MESOS-2802 URL: https://issues.apache.org/jira/browse/MESOS-2802 Project: Mesos Issue Type: Improvement Reporter: Peter Kolloch Labels: mesosphere Currently, if a task finishes or dies, another task might reuse the same port immediately afterwards. If another task or a load balancer connects to this port, still expecting the old task, there might be unpleasant surprises. For example, imagine that a visitor of your Mesos hosted web page sees your internal reporting tool instead of your company market material when hitting your page during an update. To make this less likely, Marathon contains code which tries to randomize dynamically assigned ports. This is a workaround at best and we would like to get rid of this code. I imagine that other frameworks might include similar code. As a solution, I propose a grace period for ports. If a task dies, the associated ports resources should not immediately go back into the resource pool. Instead, Mesos should wait for a configurable time and only then offer them for new tasks again. If you then specify a grace period of 2 minutes and update your service discovery load balancer every 30 seconds, you can be reasonably sure that no improper port reuse occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2802) Prevent immediate reuse of network ports for different tasks
[ https://issues.apache.org/jira/browse/MESOS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Kolloch updated MESOS-2802: - Labels: mesosphere (was: ) Prevent immediate reuse of network ports for different tasks Key: MESOS-2802 URL: https://issues.apache.org/jira/browse/MESOS-2802 Project: Mesos Issue Type: Improvement Reporter: Peter Kolloch Labels: mesosphere Currently, if a task finishes or dies, another task might reuse the same port immediately afterwards. If another task or a load balancer connects to this port, still expecting the old task, there might be unpleasant surprises. For example, imagine that a visitor of your Mesos hosted web page sees your internal reporting tool instead of your company market material when hitting your page during an update. To make this less likely, Marathon contains code which tries to randomize dynamically assigned ports. This is a workaround at best and we would like to get rid of this code. I imagine that other frameworks might include similar code. As a solution, I propose a grace period for ports. If a task dies, the associated ports resources should not immediately go back into the resource pool. Instead, Mesos should wait for a configurable time and only then offer them for new tasks again. If you then specify a grace period of 2 minutes and update your service discovery load balancer every 30 seconds, you can be reasonably sure that no improper port reuse occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2859) Semantics of CommandInfo shell/value/arguments are very confusing
[ https://issues.apache.org/jira/browse/MESOS-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Kolloch updated MESOS-2859: - Labels: mesosphere (was: ) Semantics of CommandInfo shell/value/arguments are very confusing - Key: MESOS-2859 URL: https://issues.apache.org/jira/browse/MESOS-2859 Project: Mesos Issue Type: Documentation Reporter: Peter Kolloch Labels: mesosphere CommandInfo includes the following fields: optional bool shell = 6 [default = true]; optional string value = 3; repeated string arguments = 7; There is some documentation for them which explains their behavior for the command executor but not for the docker executor: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L280 Both executors work fairly differently when you use shell=false and arguments. For the command executor, executing echo $PORT withOUT variable substitution could be achieved with shell=false value=/usr/bin/echo arguments=[/usr/bin/echo, $PORT]. See https://github.com/apache/mesos/blob/0.22.1/src/launcher/executor.cpp#L245 For the docker executor, using the same arguments with the ubuntu image (no default entrypoint) would result in executing /usr/bin/echo /usr/bin/echo $PORT which is rather confusing. See https://github.com/apache/mesos/blob/0.22.1/src/docker/docker.cpp#L451-L457 For the command executor, I would propose to emphasize that for all sane use cases `arguments(0)` should be equal to `value` if you use shell = false. It would also help to include some example, e.g.: * Executing python -m SimpleHTTPServer $PORT with variable substitution = shell=true value=python -m SimpleHTTPServer $PORT, arguments are ignored * Executing echo $PORT withOUT variable substitution = shell=false value=/usr/bin/echo arguments=[/usr/bin/echo, $PORT] In the case of docker you actually need to distinguish between containers with a default entrypoint and the ones without. With the ubuntu image (without default endpoint) examples could be: * Executing python -m SimpleHTTPServer $PORT with variable substitution = shell=true value=python -m SimpleHTTPServer $PORT, arguments are ignored * Executing echo $PORT withOUT variable substitution = shell=false value=/usr/bin/echo arguments=[$PORT] OR arguments=[/usr/bin/echo, $PORT] Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2308) Task reconciliation API should support data partitioning
[ https://issues.apache.org/jira/browse/MESOS-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611935#comment-14611935 ] Peter Kolloch commented on MESOS-2308: -- This might or might not be related: Currently, if you call reconcileTasks with an empty list, I do not know of any way to know when the reconciliation has finished. This would be a really nice feature since a framework would not have to persist task state anymore because it can recover that state from Mesos on startup. Without this feature, a framework like Marathon might try to scale up an application unnecessarily because it has not yet received information about all tasks. Would this be solved by this issue or should I create a separate ticket for that? I would also be happy to know about any work arounds that might solve this problem. Task reconciliation API should support data partitioning Key: MESOS-2308 URL: https://issues.apache.org/jira/browse/MESOS-2308 Project: Mesos Issue Type: Story Reporter: Bill Farner The {{reconcileTasks}} API call requires the caller to specify a collection of {{TaskStatus}}es, with the option to provide an empty collection to retrieve the master's entire state. Retrieving the entire state is the only mechanism for the scheduler to learn that there are tasks running it does not know about, however this call does not allow incremental querying. The result would be that the master may need to send many thousands of status updates, and the scheduler would have to handle them. It would be ideal if the scheduler had a means to partition these requests so it can control the pace of these status updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2374) Support relative host paths for container volumes
[ https://issues.apache.org/jira/browse/MESOS-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596407#comment-14596407 ] Peter Kolloch commented on MESOS-2374: -- This is a great idea which was suggested multiple times by users of Marathon. See https://github.com/mesosphere/marathon/issues/1694 for the latest instance. Support relative host paths for container volumes - Key: MESOS-2374 URL: https://issues.apache.org/jira/browse/MESOS-2374 Project: Mesos Issue Type: Improvement Components: containerization, docker Affects Versions: 0.21.1 Reporter: Mike Babineau There is no convenient way to mount sandbox subdirectories (such as unpacked archives from fetched URIs) as container volumes. While it is possible to access sandbox subdirectories via $MESOS_SANDBOX, this presumes the container is expecting $MESOS_SANDBOX to be passed in. Furthermore, it also expects the container already knows the resulting subdirectory paths. Unfortunately, since the archives are extracted by the fetcher, operators can not control these paths. Path changes to the extracted archive must be accompanied by a container image change. One potential solution: Add support for relative paths to the containerizer. If the containerizer is given a relative host path, it simply prepends the sandbox path before passing it to Docker (or similar). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2859) Semantics of CommandInfo shell/value/arguments are very confusing
Peter Kolloch created MESOS-2859: Summary: Semantics of CommandInfo shell/value/arguments are very confusing Key: MESOS-2859 URL: https://issues.apache.org/jira/browse/MESOS-2859 Project: Mesos Issue Type: Documentation Reporter: Peter Kolloch CommandInfo includes the following fields: optional bool shell = 6 [default = true]; optional string value = 3; repeated string arguments = 7; There is some documentation for them which explains their behavior for the command executor but not for the docker executor: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L280 Both executors work fairly differently when you use shell=false and arguments. For the command executor, executing echo $PORT withOUT variable substitution could be achieved with shell=false value=/usr/bin/echo arguments=[/usr/bin/echo, $PORT]. See https://github.com/apache/mesos/blob/0.22.1/src/launcher/executor.cpp#L245 For the docker executor, using the same arguments with the ubuntu image (no default entrypoint) would result in executing /usr/bin/echo /usr/bin/echo $PORT which is rather confusing. See https://github.com/apache/mesos/blob/0.22.1/src/docker/docker.cpp#L451-L457 For the command executor, I would propose to emphasize that for all sane use cases `arguments(0)` should be equal to `value` if you use shell = false. It would also help to include some example, e.g.: * Executing python -m SimpleHTTPServer $PORT with variable substitution = shell=true value=python -m SimpleHTTPServer $PORT, arguments are ignored * Executing echo $PORT withOUT variable substitution = shell=false value=/usr/bin/echo arguments=[/usr/bin/echo, $PORT] In the case of docker you actually need to distinguish between containers with a default entrypoint and the ones without. With the ubuntu image (without default endpoint) examples could be: * Executing python -m SimpleHTTPServer $PORT with variable substitution = shell=true value=python -m SimpleHTTPServer $PORT, arguments are ignored * Executing echo $PORT withOUT variable substitution = shell=false value=/usr/bin/echo arguments=[$PORT] OR arguments=[/usr/bin/echo, $PORT] Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2802) Prevent immediate reuse of network ports for different tasks
[ https://issues.apache.org/jira/browse/MESOS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570477#comment-14570477 ] Peter Kolloch commented on MESOS-2802: -- Hi Niklas, Hi Adam, do you have to suggestion how to ensure security without a grace period for ports? I'd be happy to hear it. Otherwise, I see your point about resource utilization problems if you have many short-lived tasks but that's typically not an issue. Let's say your typical task has one port and you reserve 50,000 ports for Mesos on each slave. That might be not typical but is possible. With a grace period of 2min you are talking about sustained ~416 task launches per second on one slave until you are out of port resources temporarily. If that is not sufficient, you could maybe use multiple IPs on that host. MESOS-2018 would allow frameworks to solve this themselves by implementing the port grace periods. That's good. Unfortunately, this would not solve the port starvation problem but move the implementation burden to every single framework. And, what's worst, if they forget to implement it, they are insecure by default. Prevent immediate reuse of network ports for different tasks Key: MESOS-2802 URL: https://issues.apache.org/jira/browse/MESOS-2802 Project: Mesos Issue Type: Improvement Reporter: Peter Kolloch Currently, if a task finishes or dies, another task might reuse the same port immediately afterwards. If another task or a load balancer connects to this port, still expecting the old task, there might be unpleasant surprises. For example, imagine that a visitor of your Mesos hosted web page sees your internal reporting tool instead of your company market material when hitting your page during an update. To make this less likely, Marathon contains code which tries to randomize dynamically assigned ports. This is a workaround at best and we would like to get rid of this code. I imagine that other frameworks might include similar code. As a solution, I propose a grace period for ports. If a task dies, the associated ports resources should not immediately go back into the resource pool. Instead, Mesos should wait for a configurable time and only then offer them for new tasks again. If you then specify a grace period of 2 minutes and update your service discovery load balancer every 30 seconds, you can be reasonably sure that no improper port reuse occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2802) Prevent immediate reuse of network ports for different tasks
Peter Kolloch created MESOS-2802: Summary: Prevent immediate reuse of network ports for different tasks Key: MESOS-2802 URL: https://issues.apache.org/jira/browse/MESOS-2802 Project: Mesos Issue Type: Improvement Reporter: Peter Kolloch Currently, if a task finishes or dies, another task might reuse the same port immediately afterwards. If another task or a load balancer connects to this port, still expecting the old task, there might be unpleasant surprises. For example, imagine that a visitor of your Mesos hosted web page sees your internal reporting tool instead of your company market material when hitting your page during an update. To make this less likely, Marathon contains code which tries to randomize dynamically assigned ports. This is a workaround at best and we would like to get rid of this code. I imagine that other frameworks might include similar code. As a solution, I propose a grace period for ports. If a task dies, the associated ports resources should not immediately go back into the resource pool. Instead, Mesos should wait for a configurable time and only then offer them for new tasks again. If you then specify a grace period of 2 minutes and update your service discovery load balancer every 30 seconds, you can be reasonably sure that no improper port reuse occurs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)