[jira] [Commented] (MESOS-4999) Mesos (or Marathon) lost tasks

2016-03-30 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217726#comment-15217726
 ] 

Peter Kolloch commented on MESOS-4999:
--

"Deleting" tasks in Marathon really means that Marathon submits kills for these 
tasks to Mesos. It will not update or delete the tasks immediately but it will 
wait for a notification from Mesos.

Superficially, this looks like a Mesos agent died. In that case, it often takes 
long (i.e. up to 10min or more, depending on your config) until Mesos responds 
to a kill with a "TASK_KILLED" or "TASK_LOST". Therefore, it looks as if 
Marathon does not respond to the kill.

Ideally, we would expose another task state such as "task kill sent" in the 
Marathon API so that the user sees what is going on. But this is not the case 
yet. Sorry for the confusion.

[I cannot verify this hypothesis easily without the Marathon logs]

> Mesos (or Marathon) lost tasks
> --
>
> Key: MESOS-4999
> URL: https://issues.apache.org/jira/browse/MESOS-4999
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.2
> Environment: mesos - 0.27.0
> marathon - 0.15.2
> 189 mesos slaves with Ubuntu 14.04.2 on HP ProLiant DL380 Gen9,
> CPU - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @2.50GHz (48 cores (with 
> hyperthreading))
> RAM - 264G,
> Storage - 3.0T on RAID on HP Smart Array P840 Controller,
> HDD - 12 x HP EH0600JDYTL
> Network - 2 x Intel Corporation Ethernet 10G 2P X710,
>Reporter: Sergey Galkin
> Attachments: agent-mesos-docker-logs.tar.xz, 
> masternode-1-mesos-marathon-log.tar.xz, 
> masternode-3-mesos-marathon-log.tar.xz, mesos-nodes.png
>
>
> After a lot of create/delete application  with docker instances  through 
> Marathon API I have a lot of lost tasks after last *deleting all application 
> in Marathon*.
> They are divided into three types
> 1. Tasks hangs in STAGED status. I don't see this tasks in 'docker ps' on the 
> slave and _service docker restart_ on mesos slave did not fix these tasks.
> 2. RUNNING because docker hangs and can't delete these instances  (a lot of 
> {code}
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> {code}
>  in stdout,  
> _docker stop ID_ hangs and these tasks can be fixed by _service docker 
> restart_ on mesos slave.
> 3. RUNNING after _service docker restart_ on mesos slave.
> Screenshot attached 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4999) Mesos (or Marathon) lost tasks

2016-03-30 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217713#comment-15217713
 ] 

Peter Kolloch commented on MESOS-4999:
--

Hi Sergey, the files that you provided do not really include the Marathon logs 
-- the upstart/marathon.log.* files only include the startup command. So if the 
problem is in Marathon, it is hard to diagnose in this fashion. What version of 
Marathon are you using?

> Mesos (or Marathon) lost tasks
> --
>
> Key: MESOS-4999
> URL: https://issues.apache.org/jira/browse/MESOS-4999
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.2
> Environment: mesos - 0.27.0
> marathon - 0.15.2
> 189 mesos slaves with Ubuntu 14.04.2 on HP ProLiant DL380 Gen9,
> CPU - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @2.50GHz (48 cores (with 
> hyperthreading))
> RAM - 264G,
> Storage - 3.0T on RAID on HP Smart Array P840 Controller,
> HDD - 12 x HP EH0600JDYTL
> Network - 2 x Intel Corporation Ethernet 10G 2P X710,
>Reporter: Sergey Galkin
> Attachments: agent-mesos-docker-logs.tar.xz, 
> masternode-1-mesos-marathon-log.tar.xz, 
> masternode-3-mesos-marathon-log.tar.xz, mesos-nodes.png
>
>
> After a lot of create/delete application  with docker instances  through 
> Marathon API I have a lot of lost tasks after last *deleting all application 
> in Marathon*.
> They are divided into three types
> 1. Tasks hangs in STAGED status. I don't see this tasks in 'docker ps' on the 
> slave and _service docker restart_ on mesos slave did not fix these tasks.
> 2. RUNNING because docker hangs and can't delete these instances  (a lot of 
> {code}
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> {code}
>  in stdout,  
> _docker stop ID_ hangs and these tasks can be fixed by _service docker 
> restart_ on mesos slave.
> 3. RUNNING after _service docker restart_ on mesos slave.
> Screenshot attached 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3793) Cannot start mesos local on a Debian GNU/Linux 8 docker machine

2015-11-04 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989299#comment-14989299
 ] 

Peter Kolloch commented on MESOS-3793:
--

The last log line ( Failed to locate systemd runtime directory: 
/run/systemd/system) looks as if mesos depended on systemd? Is that correct and 
expected?

> Cannot start mesos local on a Debian GNU/Linux 8 docker machine
> ---
>
> Key: MESOS-3793
> URL: https://issues.apache.org/jira/browse/MESOS-3793
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
> Environment: Debian GNU/Linux 8 docker machine
>Reporter: Matthias Veit
>Assignee: Jojy Varghese
>  Labels: mesosphere
>
> We updated the mesos version to 0.25.0 in our Marathon docker image, that 
> runs our integration tests.
> We use mesos local for those tests. This fails with this message:
> {noformat}
> root@a06e4b4eb776:/marathon# mesos local
> I1022 18:42:26.852485   136 leveldb.cpp:176] Opened db in 6.103258ms
> I1022 18:42:26.853302   136 leveldb.cpp:183] Compacted db in 765740ns
> I1022 18:42:26.853343   136 leveldb.cpp:198] Created db iterator in 9001ns
> I1022 18:42:26.853355   136 leveldb.cpp:204] Seeked to beginning of db in 
> 1287ns
> I1022 18:42:26.853366   136 leveldb.cpp:273] Iterated through 0 keys in the 
> db in ns
> I1022 18:42:26.853406   136 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1022 18:42:26.853775   141 recover.cpp:449] Starting replica recovery
> I1022 18:42:26.853862   141 recover.cpp:475] Replica is in EMPTY status
> I1022 18:42:26.854751   138 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I1022 18:42:26.854856   140 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1022 18:42:26.855002   140 recover.cpp:566] Updating replica status to 
> STARTING
> I1022 18:42:26.855655   138 master.cpp:376] Master 
> a3f39818-1bda-4710-b96b-2a60ed4d12b8 (a06e4b4eb776) started on 
> 172.17.0.14:5050
> I1022 18:42:26.855680   138 master.cpp:378] Flags at startup: 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="false" 
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" 
> --registry_strict="false" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" 
> --work_dir="/tmp/mesos/local/AK0XpG" --zk_session_timeout="10secs"
> I1022 18:42:26.855790   138 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I1022 18:42:26.855803   138 master.cpp:430] Master allowing unauthenticated 
> slaves to register
> I1022 18:42:26.855815   138 master.cpp:467] Using default 'crammd5' 
> authenticator
> W1022 18:42:26.855829   138 authenticator.cpp:505] No credentials provided, 
> authentication requests will be refused
> I1022 18:42:26.855840   138 authenticator.cpp:512] Initializing server SASL
> I1022 18:42:26.856442   136 containerizer.cpp:143] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix
> I1022 18:42:26.856943   140 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 1.888185ms
> I1022 18:42:26.856987   140 replica.cpp:323] Persisted replica status to 
> STARTING
> I1022 18:42:26.857115   140 recover.cpp:475] Replica is in STARTING status
> I1022 18:42:26.857270   140 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I1022 18:42:26.857312   140 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1022 18:42:26.857368   140 recover.cpp:566] Updating replica status to VOTING
> I1022 18:42:26.857781   140 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 371121ns
> I1022 18:42:26.857841   140 replica.cpp:323] Persisted replica status to 
> VOTING
> I1022 18:42:26.857895   140 recover.cpp:580] Successfully joined the Paxos 
> group
> I1022 18:42:26.857928   140 recover.cpp:464] Recover process terminated
> I1022 18:42:26.862455   137 master.cpp:1603] The newly elected leader is 
> master@172.17.0.14:5050 with id a3f39818-1bda-4710-b96b-2a60ed4d12b8
> I1022 18:42:26.862498   137 master.cpp:1616] Elected as the leading master!
> I1022 18:42:26.862511   137 master.cpp:1376] Recovering from registrar
> I1022 18:42:26.862560   137 registrar.cpp:309] Recovering 

[jira] [Commented] (MESOS-3793) Cannot start mesos local on a Debian GNU/Linux 8 docker machine

2015-11-04 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989308#comment-14989308
 ] 

Peter Kolloch commented on MESOS-3793:
--

[~karlkfi]Is it correct that you encountered this problem, too? Did you find a 
workaround?

> Cannot start mesos local on a Debian GNU/Linux 8 docker machine
> ---
>
> Key: MESOS-3793
> URL: https://issues.apache.org/jira/browse/MESOS-3793
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
> Environment: Debian GNU/Linux 8 docker machine
>Reporter: Matthias Veit
>Assignee: Jojy Varghese
>  Labels: mesosphere
>
> We updated the mesos version to 0.25.0 in our Marathon docker image, that 
> runs our integration tests.
> We use mesos local for those tests. This fails with this message:
> {noformat}
> root@a06e4b4eb776:/marathon# mesos local
> I1022 18:42:26.852485   136 leveldb.cpp:176] Opened db in 6.103258ms
> I1022 18:42:26.853302   136 leveldb.cpp:183] Compacted db in 765740ns
> I1022 18:42:26.853343   136 leveldb.cpp:198] Created db iterator in 9001ns
> I1022 18:42:26.853355   136 leveldb.cpp:204] Seeked to beginning of db in 
> 1287ns
> I1022 18:42:26.853366   136 leveldb.cpp:273] Iterated through 0 keys in the 
> db in ns
> I1022 18:42:26.853406   136 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1022 18:42:26.853775   141 recover.cpp:449] Starting replica recovery
> I1022 18:42:26.853862   141 recover.cpp:475] Replica is in EMPTY status
> I1022 18:42:26.854751   138 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I1022 18:42:26.854856   140 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1022 18:42:26.855002   140 recover.cpp:566] Updating replica status to 
> STARTING
> I1022 18:42:26.855655   138 master.cpp:376] Master 
> a3f39818-1bda-4710-b96b-2a60ed4d12b8 (a06e4b4eb776) started on 
> 172.17.0.14:5050
> I1022 18:42:26.855680   138 master.cpp:378] Flags at startup: 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="false" 
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" 
> --registry_strict="false" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" 
> --work_dir="/tmp/mesos/local/AK0XpG" --zk_session_timeout="10secs"
> I1022 18:42:26.855790   138 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I1022 18:42:26.855803   138 master.cpp:430] Master allowing unauthenticated 
> slaves to register
> I1022 18:42:26.855815   138 master.cpp:467] Using default 'crammd5' 
> authenticator
> W1022 18:42:26.855829   138 authenticator.cpp:505] No credentials provided, 
> authentication requests will be refused
> I1022 18:42:26.855840   138 authenticator.cpp:512] Initializing server SASL
> I1022 18:42:26.856442   136 containerizer.cpp:143] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix
> I1022 18:42:26.856943   140 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 1.888185ms
> I1022 18:42:26.856987   140 replica.cpp:323] Persisted replica status to 
> STARTING
> I1022 18:42:26.857115   140 recover.cpp:475] Replica is in STARTING status
> I1022 18:42:26.857270   140 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I1022 18:42:26.857312   140 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1022 18:42:26.857368   140 recover.cpp:566] Updating replica status to VOTING
> I1022 18:42:26.857781   140 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 371121ns
> I1022 18:42:26.857841   140 replica.cpp:323] Persisted replica status to 
> VOTING
> I1022 18:42:26.857895   140 recover.cpp:580] Successfully joined the Paxos 
> group
> I1022 18:42:26.857928   140 recover.cpp:464] Recover process terminated
> I1022 18:42:26.862455   137 master.cpp:1603] The newly elected leader is 
> master@172.17.0.14:5050 with id a3f39818-1bda-4710-b96b-2a60ed4d12b8
> I1022 18:42:26.862498   137 master.cpp:1616] Elected as the leading master!
> I1022 18:42:26.862511   137 master.cpp:1376] Recovering from registrar
> I1022 18:42:26.862560   137 registrar.cpp:309] Recovering registrar
> Failed to create a containerizer: Could not create 

[jira] [Commented] (MESOS-3793) Cannot start mesos local on a Debian GNU/Linux 8 docker machine

2015-11-04 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989332#comment-14989332
 ] 

Peter Kolloch commented on MESOS-3793:
--

I found this related CHANGELOG entry 
(https://github.com/apache/mesos/blob/master/CHANGELOG#L109):

{code}
  * [MESOS-3425] - Modify LinuxLauncher to support Systemd.
{code}

Maybe MESOS-3425 introduced a hard dependency on systemd utilities? Maybe 
MESOS-1159 is about fixing that but I am not sure.

> Cannot start mesos local on a Debian GNU/Linux 8 docker machine
> ---
>
> Key: MESOS-3793
> URL: https://issues.apache.org/jira/browse/MESOS-3793
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
> Environment: Debian GNU/Linux 8 docker machine
>Reporter: Matthias Veit
>Assignee: Jojy Varghese
>  Labels: mesosphere
>
> We updated the mesos version to 0.25.0 in our Marathon docker image, that 
> runs our integration tests.
> We use mesos local for those tests. This fails with this message:
> {noformat}
> root@a06e4b4eb776:/marathon# mesos local
> I1022 18:42:26.852485   136 leveldb.cpp:176] Opened db in 6.103258ms
> I1022 18:42:26.853302   136 leveldb.cpp:183] Compacted db in 765740ns
> I1022 18:42:26.853343   136 leveldb.cpp:198] Created db iterator in 9001ns
> I1022 18:42:26.853355   136 leveldb.cpp:204] Seeked to beginning of db in 
> 1287ns
> I1022 18:42:26.853366   136 leveldb.cpp:273] Iterated through 0 keys in the 
> db in ns
> I1022 18:42:26.853406   136 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1022 18:42:26.853775   141 recover.cpp:449] Starting replica recovery
> I1022 18:42:26.853862   141 recover.cpp:475] Replica is in EMPTY status
> I1022 18:42:26.854751   138 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I1022 18:42:26.854856   140 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1022 18:42:26.855002   140 recover.cpp:566] Updating replica status to 
> STARTING
> I1022 18:42:26.855655   138 master.cpp:376] Master 
> a3f39818-1bda-4710-b96b-2a60ed4d12b8 (a06e4b4eb776) started on 
> 172.17.0.14:5050
> I1022 18:42:26.855680   138 master.cpp:378] Flags at startup: 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_slaves="false" 
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_slave_ping_timeouts="5" --quiet="false" 
> --recovery_slave_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" 
> --registry_strict="false" --root_submissions="true" 
> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
> --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" 
> --work_dir="/tmp/mesos/local/AK0XpG" --zk_session_timeout="10secs"
> I1022 18:42:26.855790   138 master.cpp:425] Master allowing unauthenticated 
> frameworks to register
> I1022 18:42:26.855803   138 master.cpp:430] Master allowing unauthenticated 
> slaves to register
> I1022 18:42:26.855815   138 master.cpp:467] Using default 'crammd5' 
> authenticator
> W1022 18:42:26.855829   138 authenticator.cpp:505] No credentials provided, 
> authentication requests will be refused
> I1022 18:42:26.855840   138 authenticator.cpp:512] Initializing server SASL
> I1022 18:42:26.856442   136 containerizer.cpp:143] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix
> I1022 18:42:26.856943   140 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 1.888185ms
> I1022 18:42:26.856987   140 replica.cpp:323] Persisted replica status to 
> STARTING
> I1022 18:42:26.857115   140 recover.cpp:475] Replica is in STARTING status
> I1022 18:42:26.857270   140 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I1022 18:42:26.857312   140 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1022 18:42:26.857368   140 recover.cpp:566] Updating replica status to VOTING
> I1022 18:42:26.857781   140 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 371121ns
> I1022 18:42:26.857841   140 replica.cpp:323] Persisted replica status to 
> VOTING
> I1022 18:42:26.857895   140 recover.cpp:580] Successfully joined the Paxos 
> group
> I1022 18:42:26.857928   140 recover.cpp:464] Recover process terminated
> I1022 18:42:26.862455   137 master.cpp:1603] The newly elected leader is 
> master@172.17.0.14:5050 with id a3f39818-1bda-4710-b96b-2a60ed4d12b8
> I1022 18:42:26.862498   137 master.cpp:1616] Elected as the leading 

[jira] [Created] (MESOS-3744) Master crashes when tearing down framework

2015-10-15 Thread Peter Kolloch (JIRA)
Peter Kolloch created MESOS-3744:


 Summary: Master crashes when tearing down framework
 Key: MESOS-3744
 URL: https://issues.apache.org/jira/browse/MESOS-3744
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Affects Versions: 0.23.0
Reporter: Peter Kolloch


Here is an excerpt of the startup of the effected mesos master version because 
it does contain the software versions in use:

Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started!
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: 
7d15294f46b5062c59818f4d062044ac04349dc1
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in 
10366ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the db 
in 78606ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log 
positions 125 -> 126 with 0 holes and 0 unlearned
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client 
environment:host.name=ip-10-0-4-219.us-west-2.compute.internal
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client 
environment:os.name=Linux
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client 
environment:os.arch=4.0.5
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@725: Client 
environment:os.version=#2 SMP Fri Jul 10 01:01:50 UTC 2015
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@733: Client 
environment:user.name=(null)
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@741: Client 
environment:user.home=/root
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@753: Client 
environment:user.dir=/
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@zookeeper_init@786: 
Initiating client connection, host=127.0.0.1:2181 sessionTimeout=1 
watcher=0x7f0532095480 sessionId=0 sessionPasswd= context=0x7f0504001130 
flags=0
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.160876 18936 main.cpp:383] Starting Mesos master
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@716: Client 
environment:host.name=ip-10-0-4-219.us-west-2.compute.internal
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,161:18936(0x7f0528cd3700):ZOO_INFO@check_events@1703: 
initiated connection to server [127.0.0.1:2181]
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 

[jira] [Updated] (MESOS-3744) Master crashes when tearing down framework

2015-10-15 Thread Peter Kolloch (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Kolloch updated MESOS-3744:
-
Attachment: master-fail.log

> Master crashes when tearing down framework
> --
>
> Key: MESOS-3744
> URL: https://issues.apache.org/jira/browse/MESOS-3744
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 0.23.0
>Reporter: Peter Kolloch
> Attachments: master-fail.log
>
>
> Here is an excerpt of the startup of the effected mesos master version 
> because it does contain the software versions in use:
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started!
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: 
> 7d15294f46b5062c59818f4d062044ac04349dc1
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in 
> 10366ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the 
> db in 78606ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log 
> positions 125 -> 126 with 0 holes and 0 unlearned
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client 
> environment:host.name=ip-10-0-4-219.us-west-2.compute.internal
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client 
> environment:os.arch=4.0.5
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@725: Client 
> environment:os.version=#2 SMP Fri Jul 10 01:01:50 UTC 2015
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@733: Client 
> environment:user.name=(null)
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@741: Client 
> environment:user.home=/root
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@zookeeper_init@786: 
> Initiating client connection, host=127.0.0.1:2181 sessionTimeout=1 
> watcher=0x7f0532095480 sessionId=0 sessionPasswd= 
> context=0x7f0504001130 flags=0
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.160876 18936 main.cpp:383] Starting Mesos master
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@716: 

[jira] [Updated] (MESOS-3744) Master crashes when tearing down framework

2015-10-15 Thread Peter Kolloch (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Kolloch updated MESOS-3744:
-
Description: 
The crash happened shortly after calling teardown. The teardown was initiated 
by using httpie with:

http -f -v POST "$MASTER_BASE_URL/teardown" "frameworkId=$FRAMEWORK"

Below you will find the master-fail.log over the relevant time interval. Here 
are the last log lines before the mesos master died:

Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
F1015 13:13:21.511503 23038 sorter.cpp:213] Check failed: 
total.resources.contains(slaveId)
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
*** Check failure stack trace: ***
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1860169fd  google::LogMessage::Fail()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd18601889d  google::LogMessage::SendToLog()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1860165ec  google::LogMessage::Flush()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1860191be  google::LogMessageFatal::~LogMessageFatal()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd186af3ea0  mesos::internal::master::allocator::DRFSorter::remove()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1869d6dec  
mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd186fbdab9  process::ProcessManager::resume()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd186fbddaf  process::schedule()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1852bc66c  (unknown)
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd184fff2ed  (unknown)

I am not sure if it matters but in this case multiple framework instances 
registered with the same framework name.

Here is an excerpt of the startup of the effected mesos master version because 
it does contain the software versions in use:

Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started!
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: 
7d15294f46b5062c59818f4d062044ac04349dc1
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in 
10366ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the db 
in 78606ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log 
positions 125 -> 126 with 0 holes and 0 unlearned
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client 
environment:host.name=ip-10-0-4-219.us-west-2.compute.internal
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client 
environment:os.name=Linux
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client 
environment:os.arch=4.0.5
Oct 15 13:13:38 

[jira] [Updated] (MESOS-3744) Master crashes when tearing down framework

2015-10-15 Thread Peter Kolloch (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Kolloch updated MESOS-3744:
-
Description: 
The crash happened shortly after calling teardown. The teardown was initiated 
by using httpie with:

http -f -v POST "$MASTER_BASE_URL/teardown" "frameworkId=$FRAMEWORK"

Below you will find the master-fail.log over the relevant time interval. Here 
are the last log lines before the mesos master died:

Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
F1015 13:13:21.511503 23038 sorter.cpp:213] Check failed: 
total.resources.contains(slaveId)
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
*** Check failure stack trace: ***
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1860169fd  google::LogMessage::Fail()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd18601889d  google::LogMessage::SendToLog()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1860165ec  google::LogMessage::Flush()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1860191be  google::LogMessageFatal::~LogMessageFatal()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd186af3ea0  mesos::internal::master::allocator::DRFSorter::remove()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1869d6dec  
mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd186fbdab9  process::ProcessManager::resume()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd186fbddaf  process::schedule()
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd1852bc66c  (unknown)
Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @ 
0x7fd184fff2ed  (unknown)

Here is an excerpt of the startup of the effected mesos master version because 
it does contain the software versions in use:

Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started!
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: 
7d15294f46b5062c59818f4d062044ac04349dc1
Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in 
10366ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the db 
in 78606ns
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log 
positions 125 -> 126 with 0 holes and 0 unlearned
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client 
environment:host.name=ip-10-0-4-219.us-west-2.compute.internal
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client 
environment:os.name=Linux
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client 
environment:os.arch=4.0.5
Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@725: Client 

[jira] [Commented] (MESOS-3744) Master crashes when tearing down framework

2015-10-15 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14958920#comment-14958920
 ] 

Peter Kolloch commented on MESOS-3744:
--

Probably a duplicate of MESOS-3719

> Master crashes when tearing down framework
> --
>
> Key: MESOS-3744
> URL: https://issues.apache.org/jira/browse/MESOS-3744
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 0.23.0
>Reporter: Peter Kolloch
> Attachments: master-fail.log
>
>
> The crash happened shortly after calling teardown. The teardown was initiated 
> by using httpie with:
> http -f -v POST "$MASTER_BASE_URL/teardown" "frameworkId=$FRAMEWORK"
> Below you will find the master-fail.log over the relevant time interval. Here 
> are the last log lines before the mesos master died:
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> F1015 13:13:21.511503 23038 sorter.cpp:213] Check failed: 
> total.resources.contains(slaveId)
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> *** Check failure stack trace: ***
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd1860169fd  google::LogMessage::Fail()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd18601889d  google::LogMessage::SendToLog()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd1860165ec  google::LogMessage::Flush()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd1860191be  google::LogMessageFatal::~LogMessageFatal()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd186af3ea0  mesos::internal::master::allocator::DRFSorter::remove()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd1869d6dec  
> mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd186fbdab9  process::ProcessManager::resume()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd186fbddaf  process::schedule()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd1852bc66c  (unknown)
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: 
> @ 0x7fd184fff2ed  (unknown)
> I am not sure if it matters but in this case multiple framework instances 
> registered with the same framework name.
> Here is an excerpt of the startup of the effected mesos master version 
> because it does contain the software versions in use:
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.454946 18936 logging.cpp:172] INFO level logging started!
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.455173 18936 main.cpp:181] Build: 2015-09-28 19:50:01 by
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.455199 18936 main.cpp:183] Version: 0.23.0
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.455215 18936 main.cpp:190] Git SHA: 
> 7d15294f46b5062c59818f4d062044ac04349dc1
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:37.455294 18936 main.cpp:204] Using 'HierarchicalDRF' allocator
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.016752 18936 leveldb.cpp:176] Opened db in 561.344642ms
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158462 18936 leveldb.cpp:183] Compacted db in 141.288563ms
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158534 18936 leveldb.cpp:198] Created db iterator in 13783ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158572 18936 leveldb.cpp:204] Seeked to beginning of db in 
> 10366ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158673 18936 leveldb.cpp:273] Iterated through 3 keys in the 
> db in 78606ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> I1015 13:13:38.158733 18936 replica.cpp:744] Replica recovered with log 
> positions 125 -> 126 with 0 holes and 0 unlearned
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 
> 2015-10-15 13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> Oct 15 13:13:38 

[jira] [Commented] (MESOS-2802) Prevent immediate reuse of network ports for different tasks

2015-07-22 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636930#comment-14636930
 ] 

Peter Kolloch commented on MESOS-2802:
--

Hi [~bmahler], for some reason I missed the update notification, sorry.

A grace period can be made secure. If you refresh your load balancer 
configuration every 30s (or by listening to update events) and your grace 
period is 2min, it is very unlikely that you connect to an old application by 
accident. If you want to be certain, you could adjust your load balancer such 
that it refuses requests if its configuration is older than 100s. For this 
solution, you only have to adjust one software component, the load balancer. 
Nothing else has to be adjusted.

An alternative would be to set a HTTP header in the load balancer (e.g. 
X-App-Id: my-app) and modify _ALL_ applications accepting HTTP requests in your 
mesos cluster to reject requests that do not have the correct X-App-Id field. 
While theoretically possible, this is hard to achieve and, even worse, it is 
easy to forget adjusting one of your applications which was never meant to be 
available to the outside world.

Can you think of a more practical solution than I for solving this problem on 
an application-level?

 Prevent immediate reuse of network ports for different tasks
 

 Key: MESOS-2802
 URL: https://issues.apache.org/jira/browse/MESOS-2802
 Project: Mesos
  Issue Type: Improvement
Reporter: Peter Kolloch
  Labels: mesosphere

 Currently, if a task finishes or dies, another task might reuse the same port 
 immediately afterwards. If another task or a load balancer connects to this 
 port, still expecting the old task, there might be unpleasant surprises.
 For example, imagine that a visitor of your Mesos hosted web page sees your 
 internal reporting tool instead of your company market material when hitting 
 your page during an update.
 To make this less likely, Marathon contains code which tries to randomize 
 dynamically assigned ports. This is a workaround at best and we would like to 
 get rid of this code. I imagine that other frameworks might include similar 
 code.
 As a solution, I propose a grace period for ports. If a task dies, the 
 associated ports resources should not immediately go back into the resource 
 pool. Instead, Mesos should wait for a configurable time and only then offer 
 them for new tasks again.
 If you then specify a grace period of 2 minutes and update your service 
 discovery load balancer every 30 seconds, you can be reasonably sure that no 
 improper port reuse occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2802) Prevent immediate reuse of network ports for different tasks

2015-07-02 Thread Peter Kolloch (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Kolloch updated MESOS-2802:
-
Labels: mesosphere  (was: )

 Prevent immediate reuse of network ports for different tasks
 

 Key: MESOS-2802
 URL: https://issues.apache.org/jira/browse/MESOS-2802
 Project: Mesos
  Issue Type: Improvement
Reporter: Peter Kolloch
  Labels: mesosphere

 Currently, if a task finishes or dies, another task might reuse the same port 
 immediately afterwards. If another task or a load balancer connects to this 
 port, still expecting the old task, there might be unpleasant surprises.
 For example, imagine that a visitor of your Mesos hosted web page sees your 
 internal reporting tool instead of your company market material when hitting 
 your page during an update.
 To make this less likely, Marathon contains code which tries to randomize 
 dynamically assigned ports. This is a workaround at best and we would like to 
 get rid of this code. I imagine that other frameworks might include similar 
 code.
 As a solution, I propose a grace period for ports. If a task dies, the 
 associated ports resources should not immediately go back into the resource 
 pool. Instead, Mesos should wait for a configurable time and only then offer 
 them for new tasks again.
 If you then specify a grace period of 2 minutes and update your service 
 discovery load balancer every 30 seconds, you can be reasonably sure that no 
 improper port reuse occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2859) Semantics of CommandInfo shell/value/arguments are very confusing

2015-07-02 Thread Peter Kolloch (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Kolloch updated MESOS-2859:
-
Labels: mesosphere  (was: )

 Semantics of CommandInfo shell/value/arguments are very confusing
 -

 Key: MESOS-2859
 URL: https://issues.apache.org/jira/browse/MESOS-2859
 Project: Mesos
  Issue Type: Documentation
Reporter: Peter Kolloch
  Labels: mesosphere

 CommandInfo includes the following fields:
   optional bool shell = 6 [default = true];
   optional string value = 3;
   repeated string arguments = 7;
 There is some documentation for them which explains their behavior for the 
 command executor but not for the docker executor:
 https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L280
 Both executors work fairly differently when you use shell=false and arguments.
 For the command executor, executing echo $PORT withOUT variable 
 substitution could be achieved with shell=false value=/usr/bin/echo 
 arguments=[/usr/bin/echo, $PORT]. See 
 https://github.com/apache/mesos/blob/0.22.1/src/launcher/executor.cpp#L245
 For the docker executor, using the same arguments with the ubuntu image (no 
 default entrypoint) would result in executing /usr/bin/echo /usr/bin/echo 
 $PORT which is rather confusing. See 
 https://github.com/apache/mesos/blob/0.22.1/src/docker/docker.cpp#L451-L457
 For the command executor, I would propose to emphasize that for all sane use 
 cases `arguments(0)` should be equal to `value` if you use shell = false.
 It would also help to include some example, e.g.:
 * Executing python -m SimpleHTTPServer $PORT with variable substitution = 
 shell=true value=python -m SimpleHTTPServer $PORT, arguments are ignored
 * Executing echo $PORT withOUT variable substitution = shell=false 
 value=/usr/bin/echo arguments=[/usr/bin/echo, $PORT]
 In the case of docker you actually need to distinguish between containers 
 with a default entrypoint and the ones without.
 With the ubuntu image (without default endpoint) examples could be:
 * Executing python -m SimpleHTTPServer $PORT with variable substitution = 
 shell=true value=python -m SimpleHTTPServer $PORT, arguments are ignored
 * Executing echo $PORT withOUT variable substitution = shell=false 
 value=/usr/bin/echo arguments=[$PORT] OR arguments=[/usr/bin/echo, 
 $PORT] 
 Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2308) Task reconciliation API should support data partitioning

2015-07-02 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611935#comment-14611935
 ] 

Peter Kolloch commented on MESOS-2308:
--

This might or might not be related:

Currently, if you call reconcileTasks with an empty list, I do not know of any 
way to know when the reconciliation has finished. 

This would be a really nice feature since a framework would not have to persist 
task state anymore because it can recover that state from Mesos on startup. 
Without this feature, a framework like Marathon might try to scale up an 
application unnecessarily because it has not yet received information about all 
tasks.

Would this be solved by this issue or should I create a separate ticket for 
that? I would also be happy to know about any work arounds that might solve 
this problem.

 Task reconciliation API should support data partitioning
 

 Key: MESOS-2308
 URL: https://issues.apache.org/jira/browse/MESOS-2308
 Project: Mesos
  Issue Type: Story
Reporter: Bill Farner

 The {{reconcileTasks}} API call requires the caller to specify a collection 
 of {{TaskStatus}}es, with the option to provide an empty collection to 
 retrieve the master's entire state.  Retrieving the entire state is the only 
 mechanism for the scheduler to learn that there are tasks running it does not 
 know about, however this call does not allow incremental querying.  The 
 result would be that the master may need to send many thousands of status 
 updates, and the scheduler would have to handle them.  It would be ideal if 
 the scheduler had a means to partition these requests so it can control the 
 pace of these status updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2374) Support relative host paths for container volumes

2015-06-22 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596407#comment-14596407
 ] 

Peter Kolloch commented on MESOS-2374:
--

This is a great idea which was suggested multiple times by users of Marathon. 
See

https://github.com/mesosphere/marathon/issues/1694

for the latest instance.

 Support relative host paths for container volumes
 -

 Key: MESOS-2374
 URL: https://issues.apache.org/jira/browse/MESOS-2374
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, docker
Affects Versions: 0.21.1
Reporter: Mike Babineau

 There is no convenient way to mount sandbox subdirectories (such as unpacked 
 archives from fetched URIs) as container volumes.
 While it is possible to access sandbox subdirectories via $MESOS_SANDBOX, 
 this presumes the container is expecting $MESOS_SANDBOX to be passed in. 
 Furthermore, it also expects the container already knows the resulting 
 subdirectory paths. Unfortunately, since the archives are extracted by the 
 fetcher, operators can not control these paths. Path changes to the extracted 
 archive must be accompanied by a container image change.
 One potential solution:
 Add support for relative paths to the containerizer. If the containerizer is 
 given a relative host path, it simply prepends the sandbox path before 
 passing it to Docker (or similar).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2859) Semantics of CommandInfo shell/value/arguments are very confusing

2015-06-11 Thread Peter Kolloch (JIRA)
Peter Kolloch created MESOS-2859:


 Summary: Semantics of CommandInfo shell/value/arguments are very 
confusing
 Key: MESOS-2859
 URL: https://issues.apache.org/jira/browse/MESOS-2859
 Project: Mesos
  Issue Type: Documentation
Reporter: Peter Kolloch


CommandInfo includes the following fields:

  optional bool shell = 6 [default = true];
  optional string value = 3;
  repeated string arguments = 7;

There is some documentation for them which explains their behavior for the 
command executor but not for the docker executor:

https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L280

Both executors work fairly differently when you use shell=false and arguments.

For the command executor, executing echo $PORT withOUT variable substitution 
could be achieved with shell=false value=/usr/bin/echo 
arguments=[/usr/bin/echo, $PORT]. See 
https://github.com/apache/mesos/blob/0.22.1/src/launcher/executor.cpp#L245

For the docker executor, using the same arguments with the ubuntu image (no 
default entrypoint) would result in executing /usr/bin/echo /usr/bin/echo 
$PORT which is rather confusing. See 
https://github.com/apache/mesos/blob/0.22.1/src/docker/docker.cpp#L451-L457

For the command executor, I would propose to emphasize that for all sane use 
cases `arguments(0)` should be equal to `value` if you use shell = false.

It would also help to include some example, e.g.:

* Executing python -m SimpleHTTPServer $PORT with variable substitution = 
shell=true value=python -m SimpleHTTPServer $PORT, arguments are ignored
* Executing echo $PORT withOUT variable substitution = shell=false 
value=/usr/bin/echo arguments=[/usr/bin/echo, $PORT]

In the case of docker you actually need to distinguish between containers with 
a default entrypoint and the ones without.

With the ubuntu image (without default endpoint) examples could be:

* Executing python -m SimpleHTTPServer $PORT with variable substitution = 
shell=true value=python -m SimpleHTTPServer $PORT, arguments are ignored
* Executing echo $PORT withOUT variable substitution = shell=false 
value=/usr/bin/echo arguments=[$PORT] OR arguments=[/usr/bin/echo, 
$PORT] 

Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2802) Prevent immediate reuse of network ports for different tasks

2015-06-03 Thread Peter Kolloch (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570477#comment-14570477
 ] 

Peter Kolloch commented on MESOS-2802:
--

Hi Niklas, Hi Adam,

do you have to suggestion how to ensure security without a grace period for 
ports? I'd be happy to hear it. 

Otherwise, I see your point about resource utilization problems if you have 
many short-lived tasks but that's typically not an issue. Let's say your 
typical task has one port and you reserve 50,000 ports for Mesos on each slave. 
That might be not typical but is possible. With a grace period of 2min you are 
talking about sustained ~416 task launches per second on one slave until you 
are out of port resources temporarily. If that is not sufficient, you could 
maybe use multiple IPs on that host.

MESOS-2018 would allow frameworks to solve this themselves by implementing the 
port grace periods. That's good.

Unfortunately, this would not solve the port starvation problem but move the 
implementation burden to every single framework. And, what's worst, if they 
forget to implement it, they are insecure by default. 

 Prevent immediate reuse of network ports for different tasks
 

 Key: MESOS-2802
 URL: https://issues.apache.org/jira/browse/MESOS-2802
 Project: Mesos
  Issue Type: Improvement
Reporter: Peter Kolloch

 Currently, if a task finishes or dies, another task might reuse the same port 
 immediately afterwards. If another task or a load balancer connects to this 
 port, still expecting the old task, there might be unpleasant surprises.
 For example, imagine that a visitor of your Mesos hosted web page sees your 
 internal reporting tool instead of your company market material when hitting 
 your page during an update.
 To make this less likely, Marathon contains code which tries to randomize 
 dynamically assigned ports. This is a workaround at best and we would like to 
 get rid of this code. I imagine that other frameworks might include similar 
 code.
 As a solution, I propose a grace period for ports. If a task dies, the 
 associated ports resources should not immediately go back into the resource 
 pool. Instead, Mesos should wait for a configurable time and only then offer 
 them for new tasks again.
 If you then specify a grace period of 2 minutes and update your service 
 discovery load balancer every 30 seconds, you can be reasonably sure that no 
 improper port reuse occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2802) Prevent immediate reuse of network ports for different tasks

2015-06-02 Thread Peter Kolloch (JIRA)
Peter Kolloch created MESOS-2802:


 Summary: Prevent immediate reuse of network ports for different 
tasks
 Key: MESOS-2802
 URL: https://issues.apache.org/jira/browse/MESOS-2802
 Project: Mesos
  Issue Type: Improvement
Reporter: Peter Kolloch


Currently, if a task finishes or dies, another task might reuse the same port 
immediately afterwards. If another task or a load balancer connects to this 
port, still expecting the old task, there might be unpleasant surprises.

For example, imagine that a visitor of your Mesos hosted web page sees your 
internal reporting tool instead of your company market material when hitting 
your page during an update.

To make this less likely, Marathon contains code which tries to randomize 
dynamically assigned ports. This is a workaround at best and we would like to 
get rid of this code. I imagine that other frameworks might include similar 
code.

As a solution, I propose a grace period for ports. If a task dies, the 
associated ports resources should not immediately go back into the resource 
pool. Instead, Mesos should wait for a configurable time and only then offer 
them for new tasks again.

If you then specify a grace period of 2 minutes and update your service 
discovery load balancer every 30 seconds, you can be reasonably sure that no 
improper port reuse occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)