[jira] [Commented] (MESOS-4492) Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation
[ https://issues.apache.org/jira/browse/MESOS-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205908#comment-15205908 ] Fan Du commented on MESOS-4492: --- [~greggomann] Any further comments about the review? :) > Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation > -- > > Key: MESOS-4492 > URL: https://issues.apache.org/jira/browse/MESOS-4492 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Fan Du >Assignee: Fan Du >Priority: Minor > > This ticket aims to enable user or operator to inspect operation statistics > such as RESERVE, UNRESERVE, CREATE and DESTROY, current implementation only > supports LAUNCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4981) Framework (re-)register metric counters broken for calls made via scheduler driver
[ https://issues.apache.org/jira/browse/MESOS-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205905#comment-15205905 ] Fan Du commented on MESOS-4981: --- [~anandmazumdar] Thanks, I have added [~vinodkone] as reviewer. > Framework (re-)register metric counters broken for calls made via scheduler > driver > -- > > Key: MESOS-4981 > URL: https://issues.apache.org/jira/browse/MESOS-4981 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Anand Mazumdar >Assignee: Fan Du > Labels: mesosphere > > The counters {{master/messages_register_framework}} and > {{master/messages_reregister_framework}} are no longer being incremented > after the scheduler driver started sending {{Call}} messages to the master in > Mesos 0.23. We should correctly be incrementing these counters for PID based > frameworks as was the case previously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Created] (MESOS-4948) MasterMaintenanceTest.InverseOffers is flaky
Sorry, I didn't realize the summary of this ticket had changed since this email was generated, and the plan is now to migrate the test to the new scheduler library. On Mon, Mar 21, 2016 at 11:21 PM, Benjamin Mahler wrote: > +joseph > > Have you seen this? > > On Tue, Mar 15, 2016 at 7:13 AM, Greg Mann (JIRA) wrote: > >> Greg Mann created MESOS-4948: >> >> >> Summary: MasterMaintenanceTest.InverseOffers is flaky >> Key: MESOS-4948 >> URL: https://issues.apache.org/jira/browse/MESOS-4948 >> Project: Mesos >> Issue Type: Bug >> Components: tests >> Environment: Ubuntu 14.04, using gcc, with libevent and SSL >> enabled (on ASF CI) >> Reporter: Greg Mann >> >> >> This seems to be an issue distinct from the other tickets that have been >> filed on this test. Failed log is included below; while the core dump >> appears just after the start of >> {{MasterMaintenanceTest.InverseOffersFilters}}, it looks to me like the >> segfault is triggered by one of the callbacks called at the end of >> {{MasterMaintenanceTest.InverseOffers}}. >> >> {code} >> [ RUN ] MasterMaintenanceTest.InverseOffers >> I0315 04:16:50.786032 2681 leveldb.cpp:174] Opened db in 125.361171ms >> I0315 04:16:50.836374 2681 leveldb.cpp:181] Compacted db in 50.254411ms >> I0315 04:16:50.836470 2681 leveldb.cpp:196] Created db iterator in >> 25917ns >> I0315 04:16:50.836488 2681 leveldb.cpp:202] Seeked to beginning of db in >> 3291ns >> I0315 04:16:50.836498 2681 leveldb.cpp:271] Iterated through 0 keys in >> the db in 253ns >> I0315 04:16:50.836549 2681 replica.cpp:779] Replica recovered with log >> positions 0 -> 0 with 1 holes and 0 unlearned >> I0315 04:16:50.837474 2702 recover.cpp:447] Starting replica recovery >> I0315 04:16:50.837565 2681 cluster.cpp:183] Creating default 'local' >> authorizer >> I0315 04:16:50.838191 2702 recover.cpp:473] Replica is in EMPTY status >> I0315 04:16:50.839532 2704 replica.cpp:673] Replica in EMPTY status >> received a broadcasted recover request from (4784)@172.17.0.4:39845 >> I0315 04:16:50.839754 2705 recover.cpp:193] Received a recover response >> from a replica in EMPTY status >> I0315 04:16:50.841893 2704 recover.cpp:564] Updating replica status to >> STARTING >> I0315 04:16:50.842566 2703 master.cpp:376] Master >> c326bc68-2581-48d4-9dc4-0d6f270bdda1 (01fcd642f65f) started on >> 172.17.0.4:39845 >> I0315 04:16:50.842644 2703 master.cpp:378] Flags at startup: --acls="" >> --allocation_interval="1secs" --allocator="HierarchicalDRF" >> --authenticate="false" --authenticate_http="true" >> --authenticate_slaves="true" --authenticators="crammd5" >> --authorizers="local" --credentials="/tmp/DE2Uaw/credentials" >> --framework_sorter="drf" --help="false" --hostname_lookup="true" >> --http_authenticators="basic" --initialize_driver_logging="true" >> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" >> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" >> --max_slave_ping_timeouts="5" --quiet="false" >> --recovery_slave_removal_limit="100%" --registry="replicated_log" >> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" >> --registry_strict="true" --root_submissions="true" >> --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" >> --user_sorter="drf" --version="false" >> --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" >> --work_dir="/tmp/DE2Uaw/master" --zk_session_timeout="10secs" >> I0315 04:16:50.843168 2703 master.cpp:425] Master allowing >> unauthenticated frameworks to register >> I0315 04:16:50.843227 2703 master.cpp:428] Master only allowing >> authenticated slaves to register >> I0315 04:16:50.843302 2703 credentials.hpp:35] Loading credentials for >> authentication from '/tmp/DE2Uaw/credentials' >> I0315 04:16:50.843737 2703 master.cpp:468] Using default 'crammd5' >> authenticator >> I0315 04:16:50.843969 2703 master.cpp:537] Using default 'basic' HTTP >> authenticator >> I0315 04:16:50.844177 2703 master.cpp:571] Authorization enabled >> I0315 04:16:50.844360 2708 hierarchical.cpp:144] Initialized >> hierarchical allocator process >> I0315 04:16:50.844430 2708 whitelist_watcher.cpp:77] No whitelist given >> I0315 04:16:50.848227 2703 master.cpp:1806] The newly elected leader is >> master@172.17.0.4:39845 with id c326bc68-2581-48d4-9dc4-0d6f270bdda1 >> I0315 04:16:50.848269 2703 master.cpp:1819] Elected as the leading >> master! >> I0315 04:16:50.848292 2703 master.cpp:1508] Recovering from registrar >> I0315 04:16:50.848563 2703 registrar.cpp:307] Recovering registrar >> I0315 04:16:50.876277 2711 leveldb.cpp:304] Persisting metadata (8 >> bytes) to leveldb took 34.178445ms >> I0315 04:16:50.876365 2711 replica.cpp:320] Persisted replica status to >> STARTING >> I0315 04:16:50.876776 2711 recover.cpp:473] Replica is in STARTING status >> I0315 04:16:50.87
Re: [jira] [Created] (MESOS-4948) MasterMaintenanceTest.InverseOffers is flaky
+joseph Have you seen this? On Tue, Mar 15, 2016 at 7:13 AM, Greg Mann (JIRA) wrote: > Greg Mann created MESOS-4948: > > > Summary: MasterMaintenanceTest.InverseOffers is flaky > Key: MESOS-4948 > URL: https://issues.apache.org/jira/browse/MESOS-4948 > Project: Mesos > Issue Type: Bug > Components: tests > Environment: Ubuntu 14.04, using gcc, with libevent and SSL > enabled (on ASF CI) > Reporter: Greg Mann > > > This seems to be an issue distinct from the other tickets that have been > filed on this test. Failed log is included below; while the core dump > appears just after the start of > {{MasterMaintenanceTest.InverseOffersFilters}}, it looks to me like the > segfault is triggered by one of the callbacks called at the end of > {{MasterMaintenanceTest.InverseOffers}}. > > {code} > [ RUN ] MasterMaintenanceTest.InverseOffers > I0315 04:16:50.786032 2681 leveldb.cpp:174] Opened db in 125.361171ms > I0315 04:16:50.836374 2681 leveldb.cpp:181] Compacted db in 50.254411ms > I0315 04:16:50.836470 2681 leveldb.cpp:196] Created db iterator in 25917ns > I0315 04:16:50.836488 2681 leveldb.cpp:202] Seeked to beginning of db in > 3291ns > I0315 04:16:50.836498 2681 leveldb.cpp:271] Iterated through 0 keys in > the db in 253ns > I0315 04:16:50.836549 2681 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0315 04:16:50.837474 2702 recover.cpp:447] Starting replica recovery > I0315 04:16:50.837565 2681 cluster.cpp:183] Creating default 'local' > authorizer > I0315 04:16:50.838191 2702 recover.cpp:473] Replica is in EMPTY status > I0315 04:16:50.839532 2704 replica.cpp:673] Replica in EMPTY status > received a broadcasted recover request from (4784)@172.17.0.4:39845 > I0315 04:16:50.839754 2705 recover.cpp:193] Received a recover response > from a replica in EMPTY status > I0315 04:16:50.841893 2704 recover.cpp:564] Updating replica status to > STARTING > I0315 04:16:50.842566 2703 master.cpp:376] Master > c326bc68-2581-48d4-9dc4-0d6f270bdda1 (01fcd642f65f) started on > 172.17.0.4:39845 > I0315 04:16:50.842644 2703 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_http="true" > --authenticate_slaves="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/DE2Uaw/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" > --registry_strict="true" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" > --work_dir="/tmp/DE2Uaw/master" --zk_session_timeout="10secs" > I0315 04:16:50.843168 2703 master.cpp:425] Master allowing > unauthenticated frameworks to register > I0315 04:16:50.843227 2703 master.cpp:428] Master only allowing > authenticated slaves to register > I0315 04:16:50.843302 2703 credentials.hpp:35] Loading credentials for > authentication from '/tmp/DE2Uaw/credentials' > I0315 04:16:50.843737 2703 master.cpp:468] Using default 'crammd5' > authenticator > I0315 04:16:50.843969 2703 master.cpp:537] Using default 'basic' HTTP > authenticator > I0315 04:16:50.844177 2703 master.cpp:571] Authorization enabled > I0315 04:16:50.844360 2708 hierarchical.cpp:144] Initialized hierarchical > allocator process > I0315 04:16:50.844430 2708 whitelist_watcher.cpp:77] No whitelist given > I0315 04:16:50.848227 2703 master.cpp:1806] The newly elected leader is > master@172.17.0.4:39845 with id c326bc68-2581-48d4-9dc4-0d6f270bdda1 > I0315 04:16:50.848269 2703 master.cpp:1819] Elected as the leading master! > I0315 04:16:50.848292 2703 master.cpp:1508] Recovering from registrar > I0315 04:16:50.848563 2703 registrar.cpp:307] Recovering registrar > I0315 04:16:50.876277 2711 leveldb.cpp:304] Persisting metadata (8 bytes) > to leveldb took 34.178445ms > I0315 04:16:50.876365 2711 replica.cpp:320] Persisted replica status to > STARTING > I0315 04:16:50.876776 2711 recover.cpp:473] Replica is in STARTING status > I0315 04:16:50.878779 2706 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (4786)@172.17.0.4:39845 > I0315 04:16:50.879240 2706 recover.cpp:193] Received a recover response > from a replica in STARTING status > I0315 04:16:50.880100 2701 recover.cpp:564] Updating replica status to > VOTING > I0315
Re: [jira] [Created] (MESOS-4635) CoordinatorTest.AppendDiscarded is flaky
+jie Jie, have you seen this? On Wed, Feb 10, 2016 at 7:46 AM, Greg Mann (JIRA) wrote: > Greg Mann created MESOS-4635: > > > Summary: CoordinatorTest.AppendDiscarded is flaky > Key: MESOS-4635 > URL: https://issues.apache.org/jira/browse/MESOS-4635 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.27.0 > Environment: Ubuntu 14.04 with clang > Reporter: Greg Mann > > > Just saw this failure on the ASF Jenkins CI: > > {code} > [ RUN ] CoordinatorTest.AppendDiscarded > I0210 09:34:39.188288 31550 leveldb.cpp:174] Opened db in 2.043145ms > I0210 09:34:39.189136 31550 leveldb.cpp:181] Compacted db in 811003ns > I0210 09:34:39.189182 31550 leveldb.cpp:196] Created db iterator in 27506ns > I0210 09:34:39.189208 31550 leveldb.cpp:202] Seeked to beginning of db in > 10415ns > I0210 09:34:39.189224 31550 leveldb.cpp:271] Iterated through 0 keys in > the db in 8230ns > I0210 09:34:39.189260 31550 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0210 09:34:39.190004 31577 leveldb.cpp:304] Persisting metadata (8 bytes) > to leveldb took 471666ns > I0210 09:34:39.190028 31577 replica.cpp:320] Persisted replica status to > VOTING > I0210 09:34:39.192812 31550 leveldb.cpp:174] Opened db in 2.215995ms > I0210 09:34:39.193488 31550 leveldb.cpp:181] Compacted db in 660244ns > I0210 09:34:39.193528 31550 leveldb.cpp:196] Created db iterator in 23068ns > I0210 09:34:39.193554 31550 leveldb.cpp:202] Seeked to beginning of db in > 10451ns > I0210 09:34:39.193570 31550 leveldb.cpp:271] Iterated through 0 keys in > the db in 7996ns > I0210 09:34:39.193603 31550 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0210 09:34:39.194510 31569 leveldb.cpp:304] Persisting metadata (8 bytes) > to leveldb took 393072ns > I0210 09:34:39.194537 31569 replica.cpp:320] Persisted replica status to > VOTING > I0210 09:34:39.196895 31550 leveldb.cpp:174] Opened db in 1.804552ms > I0210 09:34:39.198554 31550 leveldb.cpp:181] Compacted db in 1.642208ms > I0210 09:34:39.198593 31550 leveldb.cpp:196] Created db iterator in 19381ns > I0210 09:34:39.198633 31550 leveldb.cpp:202] Seeked to beginning of db in > 35677ns > I0210 09:34:39.198673 31550 leveldb.cpp:271] Iterated through 1 keys in > the db in 26460ns > I0210 09:34:39.198703 31550 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0210 09:34:39.200898 31550 leveldb.cpp:174] Opened db in 2.09532ms > I0210 09:34:39.202641 31550 leveldb.cpp:181] Compacted db in 1.7251ms > I0210 09:34:39.202697 31550 leveldb.cpp:196] Created db iterator in 39337ns > I0210 09:34:39.202836 31550 leveldb.cpp:202] Seeked to beginning of db in > 34194ns > I0210 09:34:39.202965 31550 leveldb.cpp:271] Iterated through 1 keys in > the db in 39383ns > I0210 09:34:39.203088 31550 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0210 09:34:39.204413 31573 replica.cpp:493] Replica received implicit > promise request from (2636)@172.17.0.2:58132 with proposal 1 > I0210 09:34:39.204494 31572 replica.cpp:493] Replica received implicit > promise request from (2637)@172.17.0.2:58132 with proposal 1 > I0210 09:34:39.204854 31573 leveldb.cpp:304] Persisting metadata (8 bytes) > to leveldb took 417201ns > I0210 09:34:39.204880 31573 replica.cpp:342] Persisted promised to 1 > I0210 09:34:39.205060 31572 leveldb.cpp:304] Persisting metadata (8 bytes) > to leveldb took 471800ns > I0210 09:34:39.205087 31572 replica.cpp:342] Persisted promised to 1 > I0210 09:34:39.205577 31582 coordinator.cpp:238] Coordinator attempting to > fill missing positions > I0210 09:34:39.206393 31579 replica.cpp:388] Replica received explicit > promise request from (2638)@172.17.0.2:58132 for position 0 with proposal > 2 > I0210 09:34:39.206569 31578 replica.cpp:388] Replica received explicit > promise request from (2639)@172.17.0.2:58132 for position 0 with proposal > 2 > I0210 09:34:39.206840 31579 leveldb.cpp:341] Persisting action (8 bytes) > to leveldb took 335263ns > I0210 09:34:39.206881 31579 replica.cpp:712] Persisted action at 0 > I0210 09:34:39.207236 31578 leveldb.cpp:341] Persisting action (8 bytes) > to leveldb took 442481ns > I0210 09:34:39.207258 31578 replica.cpp:712] Persisted action at 0 > I0210 09:34:39.208065 31577 replica.cpp:537] Replica received write > request for position 0 from (2640)@172.17.0.2:58132 > I0210 09:34:39.208160 31568 replica.cpp:537] Replica received write > request for position 0 from (2641)@172.17.0.2:58132 > I0210 09:34:39.208206 31568 leveldb.cpp:436] Reading position from leveldb > took 67699ns > I0210 09:34:39.208117 31577 leveldb.cpp:436] Reading position from leveldb > took 225587ns > I0210 09:34:39.208647 31568 leveldb.cpp:341] Persisting action (14 bytes) > to leveldb took 374594ns > I0210 09:
[jira] [Commented] (MESOS-4989) Design document for docker volume driver
[ https://issues.apache.org/jira/browse/MESOS-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205867#comment-15205867 ] Guangya Liu commented on MESOS-4989: Plan to publish it this week after passed internal review. > Design document for docker volume driver > > > Key: MESOS-4989 > URL: https://issues.apache.org/jira/browse/MESOS-4989 > Project: Mesos > Issue Type: Task >Reporter: Guangya Liu >Assignee: Guangya Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Updated] (MESOS-4984) MasterTest.SlavesEndpointTwoSlaves is flaky
I'm finding these are getting lost in the noise, it would be great when filing test issue tickets to 'git blame' the code and figure out who to notify to fix the test. Could you do that here? On Fri, Mar 18, 2016 at 3:55 PM, Neil Conway (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/MESOS-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Neil Conway updated MESOS-4984: > --- > Labels: flaky-test mesosphere tech-debt (was: flaky-test mesosphere) > > > MasterTest.SlavesEndpointTwoSlaves is flaky > > --- > > > > Key: MESOS-4984 > > URL: https://issues.apache.org/jira/browse/MESOS-4984 > > Project: Mesos > > Issue Type: Bug > > Components: tests > >Reporter: Neil Conway > > Labels: flaky-test, mesosphere, tech-debt > > Attachments: slaves_endpoint_flaky_4984_verbose_log.txt > > > > > > Observed on Arch Linux with GCC 6, running in a virtualbox VM: > > [ RUN ] MasterTest.SlavesEndpointTwoSlaves > > /mesos-2/src/tests/master_tests.cpp:1710: Failure > > Value of: array.get().values.size() > > Actual: 1 > > Expected: 2u > > Which is: 2 > > [ FAILED ] MasterTest.SlavesEndpointTwoSlaves (86 ms) > > Seems to fail non-deterministically, perhaps more often when there is > concurrent CPU load on the machine. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
[jira] [Commented] (MESOS-3548) Investigate federations of Mesos masters
[ https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205860#comment-15205860 ] Deshi Xiao commented on MESOS-3548: --- FYI: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/federation.md > Investigate federations of Mesos masters > > > Key: MESOS-3548 > URL: https://issues.apache.org/jira/browse/MESOS-3548 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: federation, mesosphere, multi-dc > > In a large Mesos installation, the operator might want to ensure that even if > the Mesos masters are inaccessible or failed, new tasks can still be > scheduled (across multiple different frameworks). HA masters are only a > partial solution here: the masters might still be inaccessible due to a > correlated failure (e.g., Zookeeper misconfiguration/human error). > To support this, we could support the notion of "hierarchies" or > "federations" of Mesos masters. In a Mesos installation with 10k machines, > the operator might configure 10 Mesos masters (each of which might be HA) to > manage 1k machines each. Then an additional "meta-Master" would manage the > allocation of cluster resources to the 10 masters. Hence, the failure of any > individual master would impact 1k machines at most. The meta-master might not > have a lot of work to do: e.g., it might be limited to occasionally > reallocating cluster resources among the 10 masters, or ensuring that newly > added cluster resources are allocated among the masters as appropriate. > Hence, the failure of the meta-master would not prevent any of the individual > masters from scheduling new tasks. A single framework instance probably > wouldn't be able to use more resources than have been assigned to a single > Master, but that seems like a reasonable restriction. > This feature might also be a good fit for a multi-datacenter deployment of > Mesos: each Mesos master instance would manage a single DC. Naturally, > reducing the traffic between frameworks and the meta-master would be > important for performance reasons in a configuration like this. > Operationally, this might be simpler if Mesos processes were self-hosting > ([MESOS-3547]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3547) Investigate self-hosting Mesos processes
[ https://issues.apache.org/jira/browse/MESOS-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205799#comment-15205799 ] Deshi Xiao commented on MESOS-3547: --- what? use mesos to install mesos master? could you please input more detail design doc here. > Investigate self-hosting Mesos processes > > > Key: MESOS-3547 > URL: https://issues.apache.org/jira/browse/MESOS-3547 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Neil Conway > Labels: mesosphere > > Right now, Mesos master and slave nodes are managed differently: they use > different binaries and startup scripts and require different ops procedures. > Some of this asymmetric is essential, but perhaps not all of it is. If Mesos > supported a concept of "persistent tasks" (see [MESOS-3545]), it might be > possible to implement the Mesos master as such a task -- this might help > unify the ops procedures between a master and a slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4899) Mesos slave crash after killing docker container
[ https://issues.apache.org/jira/browse/MESOS-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205798#comment-15205798 ] Jie Yu commented on MESOS-4899: --- Have yo configured your slave's work_dir? If not configured, --work_dir=/tmp/mesos. Is it possible that some other entity in the system was deleting your files under slave's work_dir (e.g., systemd-tmpfiles-clean.service)? > Mesos slave crash after killing docker container > > > Key: MESOS-4899 > URL: https://issues.apache.org/jira/browse/MESOS-4899 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.1 >Reporter: Giulio D'Ippolito >Priority: Blocker > > I have experienced an issue where a Mesos slave crashed because a docker task > could not be killed properly. > I'm using marathon to launch the task. > The setup is the following: > OS verison: Centos 7.2 > Docker version: 1.8.2 > Mesos-slave: 0.27.1 > Mesos-master: 0.27.1 > Marathon 0.15.3 > The mesos slave crashed (which is not great). This is the log from the mesos > slave (both mesos slave had the same issue): > {code} > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: I0307 15:25:16.441756 > 30870 slave.cpp:1890] Asked to kill task > giulio.httpd.test.tag.a47c55be-db01-11e5-a92a-0242eb705eb2 of framework > ef1354df-7ecc-41ac-82d8 > -d7536e319ea2- > Mar 07 15:25:16 marathon_mesos-2 docker[13019]: > time="2016-03-07T15:25:16.553925302Z" level=info msg="POST > /v1.20/containers/mesos-d0f20f55-bc6e-43e8-babb-250b0176f5f6-S145.104860a3-4630-4cbb-8e68-87ecca13fcad/s > top?t=0" > Mar 07 15:25:16 marathon_mesos-2 docker[13019]: > time="2016-03-07T15:25:16.559932661Z" level=info msg="Container > fe412634ec92bb641a18b4c48d399895f703af29492804b927943646bd81ab8a failed to > exit within 0 seconds of > SIGTERM - using the force" > Mar 07 15:25:16 marathon_mesos-2 systemd[1]: Stopped docker container > fe412634ec92bb641a18b4c48d399895f703af29492804b927943646bd81ab8a. > Mar 07 15:25:16 marathon_mesos-2 systemd[1]: Stopping docker container > fe412634ec92bb641a18b4c48d399895f703af29492804b927943646bd81ab8a. > Mar 07 15:25:16 marathon_mesos-2 kernel: docker0: port 4(vetha772033) entered > disabled state > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vethd55899a): > failed to find device 106 'vethd55899a' with udev > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vethd55899a): > new Veth device (carrier: OFF, driver: 'veth', ifindex: 106) > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vetha772033): > link disconnected > Mar 07 15:25:16 marathon_mesos-2 kernel: docker0: port 4(vetha772033) entered > disabled state > Mar 07 15:25:16 marathon_mesos-2 avahi-daemon[11714]: Withdrawing address > record for fe80::b06a:2fff:fecc:87f1 on vetha772033. > Mar 07 15:25:16 marathon_mesos-2 kernel: device vetha772033 left promiscuous > mode > Mar 07 15:25:16 marathon_mesos-2 kernel: docker0: port 4(vetha772033) entered > disabled state > Mar 07 15:25:16 marathon_mesos-2 avahi-daemon[11714]: Withdrawing workstation > service for vethd55899a. > Mar 07 15:25:16 marathon_mesos-2 avahi-daemon[11714]: Withdrawing workstation > service for vetha772033. > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vethd55899a): > failed to disable userspace IPv6LL address handling > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (docker0): > bridge port vetha772033 was detached > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vetha772033): > released from master docker0 > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vetha772033): > failed to disable userspace IPv6LL address handling > Mar 07 15:25:16 marathon_mesos-2 kernel: XFS (dm-8): Unmounting Filesystem > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: I0307 15:25:16.879240 > 30868 slave.cpp:3001] Handling status update TASK_KILLED (UUID: > 1b9446db-dcdc-47a7-aa05-0372e84e1b4d) for task giulio.httpd.test.tag.a47 > c55be-db01-11e5-a92a-0242eb705eb2 of framework > ef1354df-7ecc-41ac-82d8-d7536e319ea2- from > executor(1)@XXX.XXX.XXX.XXX:57543 > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: E0307 15:25:16.880144 > 30868 slave.cpp:3205] Failed to update resources for container > 104860a3-4630-4cbb-8e68-87ecca13fcad of executor 'giulio.httpd.test.tag.a > 47c55be-db01-11e5-a92a-0242eb705eb2' running task > giulio.httpd.test.tag.a47c55be-db01-11e5-a92a-0242eb705eb2 on status update > for terminal task, destroying container: Failed to determine cgroup for the > 'cpu' sub > system: Failed to read /proc/4390/cgroup: Failed to open file > '/proc/4390/cgroup': No such file or directory > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: I0307 15:25:16.880388 > 30868 status_update_manager.cpp:320] Received status update TASK_KILLED > (UUID: 1b9446db-d
[jira] [Comment Edited] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205795#comment-15205795 ] Jie Yu edited comment on MESOS-4827 at 3/22/16 4:53 AM: >From the log, it looks like files under /tmp/mesos/meta have been deleted by >some entity (not Mesos itself). Are you on CentOS/RHEL? Can you double check >your systemd-tmpfiles-clean.service see if it was cleaning files under /tmp? Also, after you switch to use --work_dir=/var/lib/mesos, let us know if you still have this problem. was (Author: jieyu): >From the log, it looks like files under /tmp/mesos/meta have been deleted by >some entity (not Mesos itself). Are you on CentOS/RHEL? Can you double check >your systemd-tmpfiles-clean.service see if it was cleaning files under /tmp? > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0, 0.26.0, 0.27.0, 0.28.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205795#comment-15205795 ] Jie Yu commented on MESOS-4827: --- >From the log, it looks like files under /tmp/mesos/meta have been deleted by >some entity (not Mesos itself). Are you on CentOS/RHEL? Can you double check >your systemd-tmpfiles-clean.service see if it was cleaning files under /tmp? > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0, 0.26.0, 0.27.0, 0.28.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4989) Design document for docker volume driver
[ https://issues.apache.org/jira/browse/MESOS-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205775#comment-15205775 ] Deshi Xiao commented on MESOS-4989: --- how about the update? > Design document for docker volume driver > > > Key: MESOS-4989 > URL: https://issues.apache.org/jira/browse/MESOS-4989 > Project: Mesos > Issue Type: Task >Reporter: Guangya Liu >Assignee: Guangya Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4879) Update glog patch to support PowerPC LE
[ https://issues.apache.org/jira/browse/MESOS-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205685#comment-15205685 ] Chen Zhiwei commented on MESOS-4879: Thanks, it seems this patch can't catch up your PR now. Maybe in next glog release. > Update glog patch to support PowerPC LE > --- > > Key: MESOS-4879 > URL: https://issues.apache.org/jira/browse/MESOS-4879 > Project: Mesos > Issue Type: Improvement >Reporter: Chen Zhiwei >Assignee: Chen Zhiwei > > This is a part of PowerPC LE porting -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4997) Current approach to protobuf enums does not support upgrades.
[ https://issues.apache.org/jira/browse/MESOS-4997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4997: --- Component/s: technical debt > Current approach to protobuf enums does not support upgrades. > - > > Key: MESOS-4997 > URL: https://issues.apache.org/jira/browse/MESOS-4997 > Project: Mesos > Issue Type: Bug > Components: technical debt >Reporter: Benjamin Mahler >Priority: Critical > > Some users were opting in to the recently introduced > [TASK_KILLING_STATE|https://github.com/apache/mesos/blob/0.28.0/include/mesos/v1/mesos.proto#L259-L272] > capability introduced in 0.28.0. When the scheduler ties to register with > the TASK_KILLING_STATE capability against a 0.27.0 master, the master drops > the message and prints the following: > {noformat} > [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message > of type "mesos.scheduler.Call" because it is missing required fields: > subscribe.framework_info.capabilities[0].type > {noformat} > It turns out that our approach to enums in general does not allow for > backwards compatibility. For example: > {code} > message Capability { > enum Type { > REVOCABLE_RESOURCES = 1; > TASK_KILLING_STATE = 2; // New! > } > required Type type = 1; > } > {code} > Using a required enum is problematic because protobuf will strip unknown enum > values during de-serialization: > https://developers.google.com/protocol-buffers/docs/proto#updating > {quote} > enum is compatible with int32, uint32, int64, and uint64 in terms of wire > format (note that values will be truncated if they don't fit), but be aware > that client code may treat them differently when the message is deserialized. > Notably, unrecognized enum values are discarded when the message is > deserialized, which makes the field's has.. accessor return false and its > getter return the first value listed in the enum definition. However, an > integer field will always preserve its value. Because of this, you need to be > very careful when upgrading an integer to an enum in terms of receiving out > of bounds enum values on the wire. > {quote} > The suggestion on the protobuf mailing list is to use optional enum fields > and include an UNKNOWN value as the first entry in the enum list (and/or > explicitly specifying it as the default): > https://groups.google.com/forum/#!msg/protobuf/NhUjBfDyGmY/pf294zMi2bIJ > The updated version of Capability would be: > {code} > message Capability { > enum Type { > UNKNOWN = 0; > REVOCABLE_RESOURCES = 1; > TASK_KILLING_STATE = 2; // New! > } > optional Type type = 1; > } > {code} > Note that the first entry in an enum list is the default value, even if it's > number is not the lowest (unless {{\[default = \]}} is explicitly > specified). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4997) Current approach to protobuf enums does not support upgrades.
Benjamin Mahler created MESOS-4997: -- Summary: Current approach to protobuf enums does not support upgrades. Key: MESOS-4997 URL: https://issues.apache.org/jira/browse/MESOS-4997 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Priority: Critical Some users were opting in to the recently introduced [TASK_KILLING_STATE|https://github.com/apache/mesos/blob/0.28.0/include/mesos/v1/mesos.proto#L259-L272] capability introduced in 0.28.0. When the scheduler ties to register with the TASK_KILLING_STATE capability against a 0.27.0 master, the master drops the message and prints the following: {noformat} [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "mesos.scheduler.Call" because it is missing required fields: subscribe.framework_info.capabilities[0].type {noformat} It turns out that our approach to enums in general does not allow for backwards compatibility. For example: {code} message Capability { enum Type { REVOCABLE_RESOURCES = 1; TASK_KILLING_STATE = 2; // New! } required Type type = 1; } {code} Using a required enum is problematic because protobuf will strip unknown enum values during de-serialization: https://developers.google.com/protocol-buffers/docs/proto#updating {quote} enum is compatible with int32, uint32, int64, and uint64 in terms of wire format (note that values will be truncated if they don't fit), but be aware that client code may treat them differently when the message is deserialized. Notably, unrecognized enum values are discarded when the message is deserialized, which makes the field's has.. accessor return false and its getter return the first value listed in the enum definition. However, an integer field will always preserve its value. Because of this, you need to be very careful when upgrading an integer to an enum in terms of receiving out of bounds enum values on the wire. {quote} The suggestion on the protobuf mailing list is to use optional enum fields and include an UNKNOWN value as the first entry in the enum list (and/or explicitly specifying it as the default): https://groups.google.com/forum/#!msg/protobuf/NhUjBfDyGmY/pf294zMi2bIJ The updated version of Capability would be: {code} message Capability { enum Type { UNKNOWN = 0; REVOCABLE_RESOURCES = 1; TASK_KILLING_STATE = 2; // New! } optional Type type = 1; } {code} Note that the first entry in an enum list is the default value, even if it's number is not the lowest (unless {{\[default = \]}} is explicitly specified). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4993) FetcherTest.ExtractZipFile assumes `unzip` is installed
[ https://issues.apache.org/jira/browse/MESOS-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205435#comment-15205435 ] Tomasz Janiszewski edited comment on MESOS-4993 at 3/22/16 12:25 AM: - https://reviews.apache.org/r/45134/ I added condition in tests to skip them if {{unzip -v}} returns error. But I think it should be done in {{./configure}} or in cmake to check whether unzip is installed and set {{gtest_filter}}. Any hints, how to achieve it? or maybe some better solutions? was (Author: janisz): https://reviews.apache.org/r/45134/ I added condition in tests to skip them if `unzip -v` returns error. But I think it should be done in `./configure` or in `cmake` to check whether `unzip` is installed and set `--gtest_filter="-FetcherTest.*Zip*"`. Any hints, how to achieve it? or maybe some better solutions? > FetcherTest.ExtractZipFile assumes `unzip` is installed > --- > > Key: MESOS-4993 > URL: https://issues.apache.org/jira/browse/MESOS-4993 > Project: Mesos > Issue Type: Task > Components: fetcher, tests >Reporter: Neil Conway >Assignee: Tomasz Janiszewski > Labels: mesosphere > > {noformat} > [ RUN ] FetcherTest.ExtractZipFile > W0322 06:46:42.086458 3635 fetcher.cpp:805] Begin fetcher log (stderr in > sandbox) for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc from running > command: /home/vagrant/build-mesos-2/src/mesos-fetcher > I0322 06:46:41.895934 3653 fetcher.cpp:424] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/0OaVy1\/from\/yMUPVR.zip"}}],"sandbox_directory":"\/tmp\/0OaVy1"} > I0322 06:46:41.896709 3653 fetcher.cpp:379] Fetching URI > '/tmp/0OaVy1/from/yMUPVR.zip' > I0322 06:46:41.896719 3653 fetcher.cpp:250] Fetching directly into the > sandbox directory > I0322 06:46:41.896729 3653 fetcher.cpp:187] Fetching URI > '/tmp/0OaVy1/from/yMUPVR.zip' > I0322 06:46:41.896738 3653 fetcher.cpp:167] Copying resource with command:cp > '/tmp/0OaVy1/from/yMUPVR.zip' '/tmp/0OaVy1/yMUPVR.zip' > I0322 06:46:41.899859 3653 fetcher.cpp:84] Extracting with command: unzip -o > -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' > sh: unzip: command not found > Failed to fetch '/tmp/0OaVy1/from/yMUPVR.zip': Failed to extract: command > unzip -o -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' exited with status: 32512 > End fetcher log for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc > E0322 06:46:42.087045 3635 fetcher.cpp:520] Failed to run mesos-fetcher: > Failed to fetch all URIs for container 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' > with exit status: 256 > /mesos-2/src/tests/fetcher_tests.cpp:688: Failure > (fetch).failure(): Failed to fetch all URIs for container > 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' with exit status: 256 > [ FAILED ] FetcherTest.ExtractZipFile (227 ms) > [--] 1 test from FetcherTest (227 ms total) > {noformat} > Similarly: > {noformat} > [ FAILED ] FetcherTest.ExtractZipFile > [ FAILED ] FetcherTest.ExtractInvalidZipFile > [ FAILED ] FetcherTest.ExtractZipFileWithDuplicatedEntries > {noformat} > We should handle missing {{unzip}} more gracefully, e.g., skip the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4993) FetcherTest.ExtractZipFile assumes `unzip` is installed
[ https://issues.apache.org/jira/browse/MESOS-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205435#comment-15205435 ] Tomasz Janiszewski commented on MESOS-4993: --- https://reviews.apache.org/r/45134/ I added condition in tests to skip them if `unzip -v` returns error. But I think it should be done in `./configure` or in `cmake` to check whether `unzip` is installed and set `--gtest_filter="-FetcherTest.*Zip*"`. Any hints, how to achieve it? or maybe some better solutions? > FetcherTest.ExtractZipFile assumes `unzip` is installed > --- > > Key: MESOS-4993 > URL: https://issues.apache.org/jira/browse/MESOS-4993 > Project: Mesos > Issue Type: Task > Components: fetcher, tests >Reporter: Neil Conway >Assignee: Tomasz Janiszewski > Labels: mesosphere > > {noformat} > [ RUN ] FetcherTest.ExtractZipFile > W0322 06:46:42.086458 3635 fetcher.cpp:805] Begin fetcher log (stderr in > sandbox) for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc from running > command: /home/vagrant/build-mesos-2/src/mesos-fetcher > I0322 06:46:41.895934 3653 fetcher.cpp:424] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/0OaVy1\/from\/yMUPVR.zip"}}],"sandbox_directory":"\/tmp\/0OaVy1"} > I0322 06:46:41.896709 3653 fetcher.cpp:379] Fetching URI > '/tmp/0OaVy1/from/yMUPVR.zip' > I0322 06:46:41.896719 3653 fetcher.cpp:250] Fetching directly into the > sandbox directory > I0322 06:46:41.896729 3653 fetcher.cpp:187] Fetching URI > '/tmp/0OaVy1/from/yMUPVR.zip' > I0322 06:46:41.896738 3653 fetcher.cpp:167] Copying resource with command:cp > '/tmp/0OaVy1/from/yMUPVR.zip' '/tmp/0OaVy1/yMUPVR.zip' > I0322 06:46:41.899859 3653 fetcher.cpp:84] Extracting with command: unzip -o > -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' > sh: unzip: command not found > Failed to fetch '/tmp/0OaVy1/from/yMUPVR.zip': Failed to extract: command > unzip -o -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' exited with status: 32512 > End fetcher log for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc > E0322 06:46:42.087045 3635 fetcher.cpp:520] Failed to run mesos-fetcher: > Failed to fetch all URIs for container 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' > with exit status: 256 > /mesos-2/src/tests/fetcher_tests.cpp:688: Failure > (fetch).failure(): Failed to fetch all URIs for container > 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' with exit status: 256 > [ FAILED ] FetcherTest.ExtractZipFile (227 ms) > [--] 1 test from FetcherTest (227 ms total) > {noformat} > Similarly: > {noformat} > [ FAILED ] FetcherTest.ExtractZipFile > [ FAILED ] FetcherTest.ExtractInvalidZipFile > [ FAILED ] FetcherTest.ExtractZipFileWithDuplicatedEntries > {noformat} > We should handle missing {{unzip}} more gracefully, e.g., skip the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4996) 'containerizer->update' will always fail after killing a docker container.
[ https://issues.apache.org/jira/browse/MESOS-4996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4996: -- Component/s: docker containerization > 'containerizer->update' will always fail after killing a docker container. > -- > > Key: MESOS-4996 > URL: https://issues.apache.org/jira/browse/MESOS-4996 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0, 0.26.0, 0.27.0, > 0.27.1, 0.28.0, 0.27.2 >Reporter: Jie Yu >Priority: Critical > Labels: mesosphere > > Here is the sequence of events: > 1) the framework does a killTask > 2) killTask is handled by the docker executor > 3) the docker executor calls docker->stop > 4) docker container terminated > 5) docker executor sends TASK_KILLED to the agent > 6) since TASK_KILLED is terminal, agent calls containerizer->update() > 7) DockerContainerizerProcess::update is called > 8) Since pid is known, it tries to get the cgroups associated with the pid > 9) Since pid has gone, cgroups::hierarchy("cpu") will return Error > 10) We got "Failed to determine the cgroup hierarchy where the 'cpu' > subsystem is mounted: Failed to read /proc/4390/cgroup: Failed to open file > '/proc/4390/cgroup': No such file or directory" > 11) containerizer->update fail, agent will call containerizer->destroy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4996) 'containerizer->update' will always fail after killing a docker container.
Jie Yu created MESOS-4996: - Summary: 'containerizer->update' will always fail after killing a docker container. Key: MESOS-4996 URL: https://issues.apache.org/jira/browse/MESOS-4996 Project: Mesos Issue Type: Bug Affects Versions: 0.27.2, 0.28.0, 0.27.1, 0.27.0, 0.26.0, 0.25.0, 0.24.1, 0.24.0, 0.23.1, 0.23.0 Reporter: Jie Yu Priority: Critical Here is the sequence of events: 1) the framework does a killTask 2) killTask is handled by the docker executor 3) the docker executor calls docker->stop 4) docker container terminated 5) docker executor sends TASK_KILLED to the agent 6) since TASK_KILLED is terminal, agent calls containerizer->update() 7) DockerContainerizerProcess::update is called 8) Since pid is known, it tries to get the cgroups associated with the pid 9) Since pid has gone, cgroups::hierarchy("cpu") will return Error 10) We got "Failed to determine the cgroup hierarchy where the 'cpu' subsystem is mounted: Failed to read /proc/4390/cgroup: Failed to open file '/proc/4390/cgroup': No such file or directory" 11) containerizer->update fail, agent will call containerizer->destroy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4994) Overloads for defering member function invocations match too tightly
[ https://issues.apache.org/jira/browse/MESOS-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-4994: Description: libprocess' {{defer}} provides the following overloads {code} Deferred defer(const Process& process, void (T::*method)()) Deferred defer(const Process* process, void (T::*method)()) Deferred()> defer(const PID& pid, Future (T::*method)()) Deferred()> defer(const Process& process, Future (T::*method)()) Deferred defer(const PID& pid, void (T::*method)()) Deferred()> defer(const PID& pid, R (T::*method)()) Deferred()> defer(const Process& process, R (T::*method)()) Deferred()> defer(const Process* process, R (T::*method)()) Deferred()> defer(const Process* process, Future (T::*method)()) _Deferred()> defer(const UPID& pid, F&& f) {code} Here all overloads but the last require that the first argument is (in a LSP sense) either a {{Process}} or a {{PID}}, and only member functions of {{T}} are supported. Consider the following setup, {code} struct Base : public Process { double b() { return 0; } // non-const due to MESOS-4995 }; struct Derived : public Base { double d() { return 0; } // non-const due to MESOS-4995 }; {code} We can then {{defer}} for a {{Base}} like {code} defer(base, &Base::b); {code} which invokes an overload taking {{const Process&}} and {{R (T::*method)()}}, but on the other hand use of {{Derived}} is more cumbersome, {code} defer(derived.self(), &Derived::d); {code} This effectively performs an explicit cast of the {{Derived}} to a {{PID}} so the overload taking {{const PID&}} and {{R (T::*method)()}} can be taken. The overload taking for the {{Base}} case cannot be taken here since while {{Derived}} is a {{Process}}, there exists no {{&Base::d}} (like one would have expected). We should investigate ways to allow the same {{self}}-less {{defer}} syntax for both {{Base}} and {{Derived}}-like classes. This might be possible by decoupling the types {{T}} for the first and second arguments. was: libprocess' {{defer}} provides the following overloads {code} Deferred defer(const Process& process, void (T::*method)()) Deferred defer(const Process* process, void (T::*method)()) Deferred()> defer(const PID& pid, Future (T::*method)()) Deferred()> defer(const Process& process, Future (T::*method)()) Deferred defer(const PID& pid, void (T::*method)()) Deferred()> defer(const PID& pid, R (T::*method)()) Deferred()> defer(const Process& process, R (T::*method)()) Deferred()> defer(const Process* process, R (T::*method)()) Deferred()> defer(const Process* process, Future (T::*method)()) _Deferred()> defer(const UPID& pid, F&& f) {code} Here all overloads but the last require that the first argument is (in a LSP sense) either a {{Process}} or a {{PID}}, and only member functions of {{T}} are supported. Consider the following setup, {code} struct Base : public Process { double b() { return 0; } // non-const due to MESOS- }; struct Derived : public Base { double d() { return 0; } // non-const due to MESOS- }; {code} We can then {{defer}} for a {{Base}} like {code} defer(base, &Base::b); {code} which invokes an overload taking {{const Process&}} and {{R (T::*method)()}}, but on the other hand use of {{Derived}} is more cumbersome, {code} defer(derived.self(), &Derived::d); {code} This effectively performs an explicit cast of the {{Derived}} to a {{PID}} so the overload taking {{const PID&}} and {{R (T::*method)()}} can be taken. The overload taking for the {{Base}} case cannot be taken here since while {{Derived}} is a {{Process}}, there exists no {{&Base::d}} (like one would have expected). We should investigate ways to allow the same {{self}}-less {{defer}} syntax for both {{Base}} and {{Derived}}-like classes. This might be possible by decoupling the types {{T}} for the first and second arguments. > Overloads for defering member function invocations match too tightly > > > Key: MESOS-4994 > URL: https://issues.apache.org/jira/browse/MESOS-4994 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Bannier >Priority: Minor > Labels: beginner, mesosphere, newbie, newbie++ > > libprocess' {{defer}} provides the following overloads > {code} > Deferred defer(const Process& process, void (T::*method)()) > Deferred defer(const Process* process, void (T::*method)()) > Deferred()> defer(const PID& pid, Future (T::*method)()) > Deferred()> defer(const Process& process, Future > (T::*method)()) > Deferred defer(const PID& pid, void (T::*method)()) > Deferred()> defer(const PID& pid, R (T::*method)()) > Deferred()> defer(const Process& process, R (T::*method)()) > Deferred()> defer(const Process* process, R (T::*method)()) > Deferred()> defer(const Process* proc
[jira] [Commented] (MESOS-2372) Test script for verifying compatibility between Mesos components
[ https://issues.apache.org/jira/browse/MESOS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205322#comment-15205322 ] Greg Mann commented on MESOS-2372: -- I posted a patch which is based off of [~nnielsen]'s previous work: https://reviews.apache.org/r/44229/ > Test script for verifying compatibility between Mesos components > > > Key: MESOS-2372 > URL: https://issues.apache.org/jira/browse/MESOS-2372 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Greg Mann > Labels: mesosphere, tests, upgrade > > While our current unit/integration test suite catches functional bugs, it > doesn't catch compatibility bugs (e.g, MESOS-2371). This is really crucial to > provide operators the ability to do seamless upgrades on live clusters. > We should have a test suite / framework (ideally running on CI vetting each > review on RB) that tests upgrade paths between master, slave, scheduler and > executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4995) Make it possible to directly defer invocations of const member functions
[ https://issues.apache.org/jira/browse/MESOS-4995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-4995: Description: Currently libprocess' {{defer}} provides no overloads to invoke {{const}} member functions. This has lead to a situation where often effectively {{const}} getters are not made {{const}}, purely to allow straight-forward usage of {{defer}}, and leads to surprising API choices motivated only by limitations in low-level infrastructure (here: {{defer}}). We should augument {{defer}} with overloads allowing to {{defer}} invocation of {{const}} member functions, and tighten up interfaces of existing code where possible. was: libprocess' {{defer}} provides the following overloads {code} Deferred defer(const Process& process, void (T::*method)()) Deferred defer(const Process* process, void (T::*method)()) Deferred()> defer(const PID& pid, Future (T::*method)()) Deferred()> defer(const Process& process, Future (T::*method)()) Deferred defer(const PID& pid, void (T::*method)()) Deferred()> defer(const PID& pid, R (T::*method)()) Deferred()> defer(const Process& process, R (T::*method)()) Deferred()> defer(const Process* process, R (T::*method)()) Deferred()> defer(const Process* process, Future (T::*method)()) _Deferred()> defer(const UPID& pid, F&& f) {code} Here all overloads but the last require that the first argument is (in a LSP sense) either a {{Process}} or a {{PID}}, and only member functions of {{T}} are supported. Consider the following setup, {code} struct Base : public Process { double b() { return 0; } // non-const due to MESOS- }; struct Derived : public Base { double d() { return 0; } // non-const due to MESOS- }; {code} We can then {{defer}} for a {{Base}} like {code} defer(base, &Base::b); {code} which invokes an overload taking {{const Process&}} and {{R (T::*method)()}}, but on the other hand use of {{Derived}} is more cumbersome, {code} defer(derived.self(), &Derived::d); {code} This effectively performs an explicit cast of the {{Derived}} to a {{PID&}} and {{R (T::*method)()}} can be taken. The overload taking for the {{Base}} case cannot be taken here since while {{Derived}} is a {{Process}}, there exists no {{&Base::d}} (like one would have expected). We should investigate ways to allow the same {{self}}-less {{defer}} syntax for both {{Base}} and {{Derived}}-like classes. This might be possible by decoupling the types {{T}} for the first and second arguments. > Make it possible to directly defer invocations of const member functions > > > Key: MESOS-4995 > URL: https://issues.apache.org/jira/browse/MESOS-4995 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Bannier > Labels: beginner, mesosphere, newbie, newbie++ > > Currently libprocess' {{defer}} provides no overloads to invoke {{const}} > member functions. > This has lead to a situation where often effectively {{const}} getters are > not made {{const}}, purely to allow straight-forward usage of {{defer}}, and > leads to surprising API choices motivated only by limitations in low-level > infrastructure (here: {{defer}}). > We should augument {{defer}} with overloads allowing to {{defer}} invocation > of {{const}} member functions, and tighten up interfaces of existing code > where possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2372) Test suite for verifying compatibility between Mesos components
[ https://issues.apache.org/jira/browse/MESOS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-2372: Assignee: Greg Mann > Test suite for verifying compatibility between Mesos components > --- > > Key: MESOS-2372 > URL: https://issues.apache.org/jira/browse/MESOS-2372 > Project: Mesos > Issue Type: Epic >Reporter: Vinod Kone >Assignee: Greg Mann > Labels: mesosphere > > While our current unit/integration test suite catches functional bugs, it > doesn't catch compatibility bugs (e.g, MESOS-2371). This is really crucial to > provide operators the ability to do seamless upgrades on live clusters. > We should have a test suite / framework (ideally running on CI vetting each > review on RB) that tests upgrade paths between master, slave, scheduler and > executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4951) Enable actors to pass an authentication realm to libprocess
[ https://issues.apache.org/jira/browse/MESOS-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205317#comment-15205317 ] Greg Mann commented on MESOS-4951: -- After some discussion with [~bmahler], we settled on the following approach: a parameter can be added to {{process::initialize()}} which sets the authentication realm for libprocess. This allows the master/agent/tests to set the realm at initialization time. Since {{process::initialize()}} is called frequently throughout libprocess, we must also alter the function to return a boolean which indicates whether or not a particular invocation of {{initialize}} was the *first* invocation. This will allow us to assert during startup that we set the authentication realm as desired. > Enable actors to pass an authentication realm to libprocess > --- > > Key: MESOS-4951 > URL: https://issues.apache.org/jira/browse/MESOS-4951 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Reporter: Greg Mann >Assignee: Greg Mann > Labels: authentication, http, mesosphere, security > > To prepare for MESOS-4902, the Mesos master and agent need a way to pass the > desired authentication realm to libprocess. Since some endpoints (like > {{/profiler/*}}) get installed in libprocess, the master/agent should be able > to specify during initialization what authentication realm the > libprocess-level endpoints will be authenticated under. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4956) Add authentication to /files endpoints
[ https://issues.apache.org/jira/browse/MESOS-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205312#comment-15205312 ] Greg Mann commented on MESOS-4956: -- After some discussion with [~bmahler], we decided that a good approach would be to pass an authentication realm to the {{Files}} process via its constructor. This allows the master/agent's 'main.cpp' file, or the relevant routines in the test suite, to specify the realm for this process during initialization. Patches implementing this approach are forthcoming. > Add authentication to /files endpoints > -- > > Key: MESOS-4956 > URL: https://issues.apache.org/jira/browse/MESOS-4956 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Greg Mann >Assignee: Greg Mann > Labels: authentication, http, mesosphere, security > > To protect access (authz) to master/agent logs as well as executor sandboxes, > we need authentication on the /files endpoints. > Adding HTTP authentication to these endpoints is a bit complicated since they > are defined in code that is shared by the master and agent. > While working on MESOS-4850, it became apparent that since our tests use the > same instance of libprocess for both master and agent, different default > authentication realms must be used for master/agent so that HTTP > authentication can be independently enabled/disabled for each. > We should establish a mechanism for making an endpoint authenticated that > allows us to: > 1) Install an endpoint like {{/files}}, whose code is shared by the master > and agent, with different authentication realms for the master and agent > 2) Avoid hard-coding a default authentication realm into libprocess, to > permit the use of different authentication realms for the master and agent > and to keep application-level concerns from leaking into libprocess > Another option would be to use a single default authentication realm and > always enable or disable HTTP authentication for *both* the master and agent > in tests. However, this wouldn't allow us to test scenarios where HTTP > authentication is enabled on one but disabled on the other. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4994) Overloads for defering member function invocations match too tightly
Benjamin Bannier created MESOS-4994: --- Summary: Overloads for defering member function invocations match too tightly Key: MESOS-4994 URL: https://issues.apache.org/jira/browse/MESOS-4994 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Bannier Priority: Minor libprocess' {{defer}} provides the following overloads {code} Deferred defer(const Process& process, void (T::*method)()) Deferred defer(const Process* process, void (T::*method)()) Deferred()> defer(const PID& pid, Future (T::*method)()) Deferred()> defer(const Process& process, Future (T::*method)()) Deferred defer(const PID& pid, void (T::*method)()) Deferred()> defer(const PID& pid, R (T::*method)()) Deferred()> defer(const Process& process, R (T::*method)()) Deferred()> defer(const Process* process, R (T::*method)()) Deferred()> defer(const Process* process, Future (T::*method)()) _Deferred()> defer(const UPID& pid, F&& f) {code} Here all overloads but the last require that the first argument is (in a LSP sense) either a {{Process}} or a {{PID}}, and only member functions of {{T}} are supported. Consider the following setup, {code} struct Base : public Process { double b() { return 0; } // non-const due to MESOS- }; struct Derived : public Base { double d() { return 0; } // non-const due to MESOS- }; {code} We can then {{defer}} for a {{Base}} like {code} defer(base, &Base::b); {code} which invokes an overload taking {{const Process&}} and {{R (T::*method)()}}, but on the other hand use of {{Derived}} is more cumbersome, {code} defer(derived.self(), &Derived::d); {code} This effectively performs an explicit cast of the {{Derived}} to a {{PID}} so the overload taking {{const PID&}} and {{R (T::*method)()}} can be taken. The overload taking for the {{Base}} case cannot be taken here since while {{Derived}} is a {{Process}}, there exists no {{&Base::d}} (like one would have expected). We should investigate ways to allow the same {{self}}-less {{defer}} syntax for both {{Base}} and {{Derived}}-like classes. This might be possible by decoupling the types {{T}} for the first and second arguments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4995) Make it possible to directly defer invocations of const member functions
Benjamin Bannier created MESOS-4995: --- Summary: Make it possible to directly defer invocations of const member functions Key: MESOS-4995 URL: https://issues.apache.org/jira/browse/MESOS-4995 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Bannier libprocess' {{defer}} provides the following overloads {code} Deferred defer(const Process& process, void (T::*method)()) Deferred defer(const Process* process, void (T::*method)()) Deferred()> defer(const PID& pid, Future (T::*method)()) Deferred()> defer(const Process& process, Future (T::*method)()) Deferred defer(const PID& pid, void (T::*method)()) Deferred()> defer(const PID& pid, R (T::*method)()) Deferred()> defer(const Process& process, R (T::*method)()) Deferred()> defer(const Process* process, R (T::*method)()) Deferred()> defer(const Process* process, Future (T::*method)()) _Deferred()> defer(const UPID& pid, F&& f) {code} Here all overloads but the last require that the first argument is (in a LSP sense) either a {{Process}} or a {{PID}}, and only member functions of {{T}} are supported. Consider the following setup, {code} struct Base : public Process { double b() { return 0; } // non-const due to MESOS- }; struct Derived : public Base { double d() { return 0; } // non-const due to MESOS- }; {code} We can then {{defer}} for a {{Base}} like {code} defer(base, &Base::b); {code} which invokes an overload taking {{const Process&}} and {{R (T::*method)()}}, but on the other hand use of {{Derived}} is more cumbersome, {code} defer(derived.self(), &Derived::d); {code} This effectively performs an explicit cast of the {{Derived}} to a {{PID&}} and {{R (T::*method)()}} can be taken. The overload taking for the {{Base}} case cannot be taken here since while {{Derived}} is a {{Process}}, there exists no {{&Base::d}} (like one would have expected). We should investigate ways to allow the same {{self}}-less {{defer}} syntax for both {{Base}} and {{Derived}}-like classes. This might be possible by decoupling the types {{T}} for the first and second arguments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4951) Enable actors to pass an authentication realm to libprocess
[ https://issues.apache.org/jira/browse/MESOS-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4951: - Description: To prepare for MESOS-4902, the Mesos master and agent need a way to pass the desired authentication realm to libprocess. Since some endpoints (like {{/profiler/*}}) get installed in libprocess, the master/agent should be able to specify during initialization what authentication realm the libprocess-level endpoints will be authenticated under. (was: To prepare for MESOS-4902, the Mesos master and agent need a way to set the authentication realm of an endpoint that has already been installed. Since some endpoints (like {{/profiler/*}}) get installed in libprocess, the master/agent should be able to specify during initialization what authentication realm the libprocess-level endpoints will be authenticated under.) > Enable actors to pass an authentication realm to libprocess > --- > > Key: MESOS-4951 > URL: https://issues.apache.org/jira/browse/MESOS-4951 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Reporter: Greg Mann >Assignee: Greg Mann > Labels: authentication, http, mesosphere, security > > To prepare for MESOS-4902, the Mesos master and agent need a way to pass the > desired authentication realm to libprocess. Since some endpoints (like > {{/profiler/*}}) get installed in libprocess, the master/agent should be able > to specify during initialization what authentication realm the > libprocess-level endpoints will be authenticated under. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4951) Enable actors to pass an authentication realm to libprocess
[ https://issues.apache.org/jira/browse/MESOS-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4951: - Summary: Enable actors to pass an authentication realm to libprocess (was: Enable actors to set an existing endpoint's authentication realm) > Enable actors to pass an authentication realm to libprocess > --- > > Key: MESOS-4951 > URL: https://issues.apache.org/jira/browse/MESOS-4951 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Reporter: Greg Mann >Assignee: Greg Mann > Labels: authentication, http, mesosphere, security > > To prepare for MESOS-4902, the Mesos master and agent need a way to set the > authentication realm of an endpoint that has already been installed. Since > some endpoints (like {{/profiler/*}}) get installed in libprocess, the > master/agent should be able to specify during initialization what > authentication realm the libprocess-level endpoints will be authenticated > under. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags
[ https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205292#comment-15205292 ] Vinod Kone commented on MESOS-3781: --- Yup. I think that's the plan. You should make sure to call out that the old flags are deprecated (in flags description, CHANGELOG, docs etc.). > Replace Master/Slave Terminology Phase I - Add duplicate agent flags > - > > Key: MESOS-3781 > URL: https://issues.apache.org/jira/browse/MESOS-3781 > Project: Mesos > Issue Type: Task >Reporter: Diana Arroyo >Assignee: Jay Guo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4981) Framework (re-)register metric counters broken for calls made via scheduler driver
[ https://issues.apache.org/jira/browse/MESOS-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205283#comment-15205283 ] Anand Mazumdar commented on MESOS-4981: --- Thanks for working on this. [~vinodkone] agreed to shepherd this. I can do a first pass review. [~fan.du] Can you add Vinod as a reviewer to the reviews too? > Framework (re-)register metric counters broken for calls made via scheduler > driver > -- > > Key: MESOS-4981 > URL: https://issues.apache.org/jira/browse/MESOS-4981 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Anand Mazumdar >Assignee: Fan Du > Labels: mesosphere > > The counters {{master/messages_register_framework}} and > {{master/messages_reregister_framework}} are no longer being incremented > after the scheduler driver started sending {{Call}} messages to the master in > Mesos 0.23. We should correctly be incrementing these counters for PID based > frameworks as was the case previously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4981) Framework (re-)register metric counters broken for calls made via scheduler driver
[ https://issues.apache.org/jira/browse/MESOS-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4981: -- Description: The counters {{master/messages_register_framework}} and {{master/messages_reregister_framework}} are no longer being incremented after the scheduler driver started sending {{Call}} messages to the master in Mesos 0.23. We should correctly be incrementing these counters for PID based frameworks as was the case previously. (was: The counters {{master/messages_register_framework}} and {{master/messages_reregister_framework}} are no longer being incremented after the scheduler driver started sending {{Call}} messages to the master in Mesos 0.23. Either, we should think about adding new counter(s) for {{Subscribe}} calls to the master for both PID/HTTP frameworks or modify the existing code to correctly increment the counters.) > Framework (re-)register metric counters broken for calls made via scheduler > driver > -- > > Key: MESOS-4981 > URL: https://issues.apache.org/jira/browse/MESOS-4981 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Anand Mazumdar >Assignee: Fan Du > Labels: mesosphere > > The counters {{master/messages_register_framework}} and > {{master/messages_reregister_framework}} are no longer being incremented > after the scheduler driver started sending {{Call}} messages to the master in > Mesos 0.23. We should correctly be incrementing these counters for PID based > frameworks as was the case previously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4981) Framework (re-)register metric counters broken for calls made via scheduler driver
[ https://issues.apache.org/jira/browse/MESOS-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4981: -- Shepherd: Vinod Kone > Framework (re-)register metric counters broken for calls made via scheduler > driver > -- > > Key: MESOS-4981 > URL: https://issues.apache.org/jira/browse/MESOS-4981 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Anand Mazumdar >Assignee: Fan Du > Labels: mesosphere > > The counters {{master/messages_register_framework}} and > {{master/messages_reregister_framework}} are no longer being incremented > after the scheduler driver started sending {{Call}} messages to the master in > Mesos 0.23. We should correctly be incrementing these counters for PID based > frameworks as was the case previously. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2372) Test suite for verifying compatibility between Mesos components
[ https://issues.apache.org/jira/browse/MESOS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-2372: - Assignee: (was: Greg Mann) > Test suite for verifying compatibility between Mesos components > --- > > Key: MESOS-2372 > URL: https://issues.apache.org/jira/browse/MESOS-2372 > Project: Mesos > Issue Type: Epic >Reporter: Vinod Kone > Labels: mesosphere > > While our current unit/integration test suite catches functional bugs, it > doesn't catch compatibility bugs (e.g, MESOS-2371). This is really crucial to > provide operators the ability to do seamless upgrades on live clusters. > We should have a test suite / framework (ideally running on CI vetting each > review on RB) that tests upgrade paths between master, slave, scheduler and > executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2372) Test suite for verifying compatibility between Mesos components
[ https://issues.apache.org/jira/browse/MESOS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-2372: Assignee: Greg Mann (was: Kapil Arya) > Test suite for verifying compatibility between Mesos components > --- > > Key: MESOS-2372 > URL: https://issues.apache.org/jira/browse/MESOS-2372 > Project: Mesos > Issue Type: Epic >Reporter: Vinod Kone >Assignee: Greg Mann > Labels: mesosphere > > While our current unit/integration test suite catches functional bugs, it > doesn't catch compatibility bugs (e.g, MESOS-2371). This is really crucial to > provide operators the ability to do seamless upgrades on live clusters. > We should have a test suite / framework (ideally running on CI vetting each > review on RB) that tests upgrade paths between master, slave, scheduler and > executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205178#comment-15205178 ] Geoffroy Jabouley commented on MESOS-4827: -- it is indeed not configured, so default value is /tmp/mesos. Will change it asap. It is a normal folder (not tmpfs), part of a 50GB xfs RAID1 partition mounted in /, 94% space available. Other apps are writting on this partition (mesos master, zookeeper, marathon). Docker has its own btrfs partition, still on the same physical disk. > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0, 0.26.0, 0.27.0, 0.28.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4993) FetcherTest.ExtractZipFile assumes `unzip` is installed
[ https://issues.apache.org/jira/browse/MESOS-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Janiszewski reassigned MESOS-4993: - Assignee: Tomasz Janiszewski > FetcherTest.ExtractZipFile assumes `unzip` is installed > --- > > Key: MESOS-4993 > URL: https://issues.apache.org/jira/browse/MESOS-4993 > Project: Mesos > Issue Type: Task > Components: fetcher, tests >Reporter: Neil Conway >Assignee: Tomasz Janiszewski > Labels: mesosphere > > {noformat} > [ RUN ] FetcherTest.ExtractZipFile > W0322 06:46:42.086458 3635 fetcher.cpp:805] Begin fetcher log (stderr in > sandbox) for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc from running > command: /home/vagrant/build-mesos-2/src/mesos-fetcher > I0322 06:46:41.895934 3653 fetcher.cpp:424] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/0OaVy1\/from\/yMUPVR.zip"}}],"sandbox_directory":"\/tmp\/0OaVy1"} > I0322 06:46:41.896709 3653 fetcher.cpp:379] Fetching URI > '/tmp/0OaVy1/from/yMUPVR.zip' > I0322 06:46:41.896719 3653 fetcher.cpp:250] Fetching directly into the > sandbox directory > I0322 06:46:41.896729 3653 fetcher.cpp:187] Fetching URI > '/tmp/0OaVy1/from/yMUPVR.zip' > I0322 06:46:41.896738 3653 fetcher.cpp:167] Copying resource with command:cp > '/tmp/0OaVy1/from/yMUPVR.zip' '/tmp/0OaVy1/yMUPVR.zip' > I0322 06:46:41.899859 3653 fetcher.cpp:84] Extracting with command: unzip -o > -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' > sh: unzip: command not found > Failed to fetch '/tmp/0OaVy1/from/yMUPVR.zip': Failed to extract: command > unzip -o -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' exited with status: 32512 > End fetcher log for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc > E0322 06:46:42.087045 3635 fetcher.cpp:520] Failed to run mesos-fetcher: > Failed to fetch all URIs for container 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' > with exit status: 256 > /mesos-2/src/tests/fetcher_tests.cpp:688: Failure > (fetch).failure(): Failed to fetch all URIs for container > 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' with exit status: 256 > [ FAILED ] FetcherTest.ExtractZipFile (227 ms) > [--] 1 test from FetcherTest (227 ms total) > {noformat} > Similarly: > {noformat} > [ FAILED ] FetcherTest.ExtractZipFile > [ FAILED ] FetcherTest.ExtractInvalidZipFile > [ FAILED ] FetcherTest.ExtractZipFileWithDuplicatedEntries > {noformat} > We should handle missing {{unzip}} more gracefully, e.g., skip the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Created] (MESOS-4993) FetcherTest.ExtractZipFile assumes `unzip` is installed
+jie, Tomasz Looks like this new test needs a filter. Can one of you follow up with a fix? On Mon, Mar 21, 2016 at 12:51 PM, Neil Conway (JIRA) wrote: > Neil Conway created MESOS-4993: > -- > > Summary: FetcherTest.ExtractZipFile assumes `unzip` is > installed > Key: MESOS-4993 > URL: https://issues.apache.org/jira/browse/MESOS-4993 > Project: Mesos > Issue Type: Task > Components: fetcher, tests > Reporter: Neil Conway > > > {noformat} > [ RUN ] FetcherTest.ExtractZipFile > W0322 06:46:42.086458 3635 fetcher.cpp:805] Begin fetcher log (stderr in > sandbox) for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc from running > command: /home/vagrant/build-mesos-2/src/mesos-fetcher > I0322 06:46:41.895934 3653 fetcher.cpp:424] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/0OaVy1\/from\/yMUPVR.zip"}}],"sandbox_directory":"\/tmp\/0OaVy1"} > I0322 06:46:41.896709 3653 fetcher.cpp:379] Fetching URI > '/tmp/0OaVy1/from/yMUPVR.zip' > I0322 06:46:41.896719 3653 fetcher.cpp:250] Fetching directly into the > sandbox directory > I0322 06:46:41.896729 3653 fetcher.cpp:187] Fetching URI > '/tmp/0OaVy1/from/yMUPVR.zip' > I0322 06:46:41.896738 3653 fetcher.cpp:167] Copying resource with > command:cp '/tmp/0OaVy1/from/yMUPVR.zip' '/tmp/0OaVy1/yMUPVR.zip' > I0322 06:46:41.899859 3653 fetcher.cpp:84] Extracting with command: unzip > -o -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' > sh: unzip: command not found > Failed to fetch '/tmp/0OaVy1/from/yMUPVR.zip': Failed to extract: command > unzip -o -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' exited with status: 32512 > > End fetcher log for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc > E0322 06:46:42.087045 3635 fetcher.cpp:520] Failed to run mesos-fetcher: > Failed to fetch all URIs for container > 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' with exit status: 256 > /mesos-2/src/tests/fetcher_tests.cpp:688: Failure > (fetch).failure(): Failed to fetch all URIs for container > 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' with exit status: 256 > [ FAILED ] FetcherTest.ExtractZipFile (227 ms) > [--] 1 test from FetcherTest (227 ms total) > {noformat} > > Similarly: > > {noformat} > [ FAILED ] FetcherTest.ExtractZipFile > [ FAILED ] FetcherTest.ExtractInvalidZipFile > [ FAILED ] FetcherTest.ExtractZipFileWithDuplicatedEntries > {noformat} > > We should handle missing {{unzip}} more gracefully, e.g., skip the test. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
[jira] [Created] (MESOS-4993) FetcherTest.ExtractZipFile assumes `unzip` is installed
Neil Conway created MESOS-4993: -- Summary: FetcherTest.ExtractZipFile assumes `unzip` is installed Key: MESOS-4993 URL: https://issues.apache.org/jira/browse/MESOS-4993 Project: Mesos Issue Type: Task Components: fetcher, tests Reporter: Neil Conway {noformat} [ RUN ] FetcherTest.ExtractZipFile W0322 06:46:42.086458 3635 fetcher.cpp:805] Begin fetcher log (stderr in sandbox) for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc from running command: /home/vagrant/build-mesos-2/src/mesos-fetcher I0322 06:46:41.895934 3653 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/tmp\/0OaVy1\/from\/yMUPVR.zip"}}],"sandbox_directory":"\/tmp\/0OaVy1"} I0322 06:46:41.896709 3653 fetcher.cpp:379] Fetching URI '/tmp/0OaVy1/from/yMUPVR.zip' I0322 06:46:41.896719 3653 fetcher.cpp:250] Fetching directly into the sandbox directory I0322 06:46:41.896729 3653 fetcher.cpp:187] Fetching URI '/tmp/0OaVy1/from/yMUPVR.zip' I0322 06:46:41.896738 3653 fetcher.cpp:167] Copying resource with command:cp '/tmp/0OaVy1/from/yMUPVR.zip' '/tmp/0OaVy1/yMUPVR.zip' I0322 06:46:41.899859 3653 fetcher.cpp:84] Extracting with command: unzip -o -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' sh: unzip: command not found Failed to fetch '/tmp/0OaVy1/from/yMUPVR.zip': Failed to extract: command unzip -o -d '/tmp/0OaVy1' '/tmp/0OaVy1/yMUPVR.zip' exited with status: 32512 End fetcher log for container b71f9a05-9561-402a-b33a-c9dc4f8b03cc E0322 06:46:42.087045 3635 fetcher.cpp:520] Failed to run mesos-fetcher: Failed to fetch all URIs for container 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' with exit status: 256 /mesos-2/src/tests/fetcher_tests.cpp:688: Failure (fetch).failure(): Failed to fetch all URIs for container 'b71f9a05-9561-402a-b33a-c9dc4f8b03cc' with exit status: 256 [ FAILED ] FetcherTest.ExtractZipFile (227 ms) [--] 1 test from FetcherTest (227 ms total) {noformat} Similarly: {noformat} [ FAILED ] FetcherTest.ExtractZipFile [ FAILED ] FetcherTest.ExtractInvalidZipFile [ FAILED ] FetcherTest.ExtractZipFileWithDuplicatedEntries {noformat} We should handle missing {{unzip}} more gracefully, e.g., skip the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4610) MasterContender/MasterDetector should be loadable as modules
[ https://issues.apache.org/jira/browse/MESOS-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204928#comment-15204928 ] ANURAG SINGH commented on MESOS-4610: - On Joseph's suggestion, I've requested for a new shepherd to be assigned to this issue. It appears that Ben may not have the bandwidth to work on this. > MasterContender/MasterDetector should be loadable as modules > > > Key: MESOS-4610 > URL: https://issues.apache.org/jira/browse/MESOS-4610 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Mark Cavage >Assignee: Mark Cavage > > Currently mesos depends on Zookeeper for leader election and notification to > slaves, although there is a C++ hierarchy in the code to support alternatives > (e.g., unit tests use an in-memory implementation). From an operational > perspective, many organizations/users do not want to take a dependency on > Zookeeper, and use an alternative solution to implementing leader election. > Our organization in particular, very much wants this, and as a reference > there have been several requests from the community (see referenced tickets) > to replace with etcd/consul/etc. > This ticket will serve as the work effort to modularize the > MasterContender/MasterDetector APIs such that integrators can build a > pluggable solution of their choice; this ticket will not fold in any > implementations such as etcd et al., but simply move this hierarchy to be > fully pluggable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4070) numify() handles negative numbers inconsistently.
[ https://issues.apache.org/jira/browse/MESOS-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204889#comment-15204889 ] Yong Tang commented on MESOS-4070: -- Ping [~jieyu], any feedback or can you shepherd this issue if possible? > numify() handles negative numbers inconsistently. > - > > Key: MESOS-4070 > URL: https://issues.apache.org/jira/browse/MESOS-4070 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Jie Yu >Assignee: Yong Tang > Labels: tech-debt > > As pointed by [~neilc] in this review: > https://reviews.apache.org/r/40988 > {noformat} > Try num2 = numify("-10"); > EXPECT_SOME_EQ(-10, num2); > // TODO(neilc): This is inconsistent with the handling of non-hex numbers. > EXPECT_ERROR(numify("-0x10")); > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4899) Mesos slave crash after killing docker container
[ https://issues.apache.org/jira/browse/MESOS-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4899: -- Priority: Blocker (was: Major) > Mesos slave crash after killing docker container > > > Key: MESOS-4899 > URL: https://issues.apache.org/jira/browse/MESOS-4899 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.1 >Reporter: Giulio D'Ippolito >Priority: Blocker > > I have experienced an issue where a Mesos slave crashed because a docker task > could not be killed properly. > I'm using marathon to launch the task. > The setup is the following: > OS verison: Centos 7.2 > Docker version: 1.8.2 > Mesos-slave: 0.27.1 > Mesos-master: 0.27.1 > Marathon 0.15.3 > The mesos slave crashed (which is not great). This is the log from the mesos > slave (both mesos slave had the same issue): > {code} > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: I0307 15:25:16.441756 > 30870 slave.cpp:1890] Asked to kill task > giulio.httpd.test.tag.a47c55be-db01-11e5-a92a-0242eb705eb2 of framework > ef1354df-7ecc-41ac-82d8 > -d7536e319ea2- > Mar 07 15:25:16 marathon_mesos-2 docker[13019]: > time="2016-03-07T15:25:16.553925302Z" level=info msg="POST > /v1.20/containers/mesos-d0f20f55-bc6e-43e8-babb-250b0176f5f6-S145.104860a3-4630-4cbb-8e68-87ecca13fcad/s > top?t=0" > Mar 07 15:25:16 marathon_mesos-2 docker[13019]: > time="2016-03-07T15:25:16.559932661Z" level=info msg="Container > fe412634ec92bb641a18b4c48d399895f703af29492804b927943646bd81ab8a failed to > exit within 0 seconds of > SIGTERM - using the force" > Mar 07 15:25:16 marathon_mesos-2 systemd[1]: Stopped docker container > fe412634ec92bb641a18b4c48d399895f703af29492804b927943646bd81ab8a. > Mar 07 15:25:16 marathon_mesos-2 systemd[1]: Stopping docker container > fe412634ec92bb641a18b4c48d399895f703af29492804b927943646bd81ab8a. > Mar 07 15:25:16 marathon_mesos-2 kernel: docker0: port 4(vetha772033) entered > disabled state > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vethd55899a): > failed to find device 106 'vethd55899a' with udev > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vethd55899a): > new Veth device (carrier: OFF, driver: 'veth', ifindex: 106) > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vetha772033): > link disconnected > Mar 07 15:25:16 marathon_mesos-2 kernel: docker0: port 4(vetha772033) entered > disabled state > Mar 07 15:25:16 marathon_mesos-2 avahi-daemon[11714]: Withdrawing address > record for fe80::b06a:2fff:fecc:87f1 on vetha772033. > Mar 07 15:25:16 marathon_mesos-2 kernel: device vetha772033 left promiscuous > mode > Mar 07 15:25:16 marathon_mesos-2 kernel: docker0: port 4(vetha772033) entered > disabled state > Mar 07 15:25:16 marathon_mesos-2 avahi-daemon[11714]: Withdrawing workstation > service for vethd55899a. > Mar 07 15:25:16 marathon_mesos-2 avahi-daemon[11714]: Withdrawing workstation > service for vetha772033. > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vethd55899a): > failed to disable userspace IPv6LL address handling > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (docker0): > bridge port vetha772033 was detached > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vetha772033): > released from master docker0 > Mar 07 15:25:16 marathon_mesos-2 NetworkManager[788]: (vetha772033): > failed to disable userspace IPv6LL address handling > Mar 07 15:25:16 marathon_mesos-2 kernel: XFS (dm-8): Unmounting Filesystem > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: I0307 15:25:16.879240 > 30868 slave.cpp:3001] Handling status update TASK_KILLED (UUID: > 1b9446db-dcdc-47a7-aa05-0372e84e1b4d) for task giulio.httpd.test.tag.a47 > c55be-db01-11e5-a92a-0242eb705eb2 of framework > ef1354df-7ecc-41ac-82d8-d7536e319ea2- from > executor(1)@XXX.XXX.XXX.XXX:57543 > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: E0307 15:25:16.880144 > 30868 slave.cpp:3205] Failed to update resources for container > 104860a3-4630-4cbb-8e68-87ecca13fcad of executor 'giulio.httpd.test.tag.a > 47c55be-db01-11e5-a92a-0242eb705eb2' running task > giulio.httpd.test.tag.a47c55be-db01-11e5-a92a-0242eb705eb2 on status update > for terminal task, destroying container: Failed to determine cgroup for the > 'cpu' sub > system: Failed to read /proc/4390/cgroup: Failed to open file > '/proc/4390/cgroup': No such file or directory > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: I0307 15:25:16.880388 > 30868 status_update_manager.cpp:320] Received status update TASK_KILLED > (UUID: 1b9446db-dcdc-47a7-aa05-0372e84e1b4d) for task giulio.htt > pd.test.tag.a47c55be-db01-11e5-a92a-0242eb705eb2 of framework > ef1354df-7ecc-41ac-82d8-d7536e319ea2- > Mar 07 15:25:16 marathon_mesos-2 mesos-slave[30866]: I0307 15:25:16.880416 > 30870 docke
[jira] [Updated] (MESOS-2408) Slave should reclaim storage for destroyed persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-2408: --- Shepherd: Jie Yu (was: Vinod Kone) > Slave should reclaim storage for destroyed persistent volumes. > -- > > Key: MESOS-2408 > URL: https://issues.apache.org/jira/browse/MESOS-2408 > Project: Mesos > Issue Type: Task > Components: slave >Reporter: Jie Yu >Assignee: Neil Conway > Labels: mesosphere, persistent-volumes > > At present, destroying a persistent volume does not cleanup any filesystem > space that was used by the volume (it just removes the Mesos-level metadata > about the volume). This effectively leads to a storage leak, which is bad. > For task sandboxes, we do "garbage collection" to remove the sandbox at a > later time to facilitate debugging failed tasks; for volumes, because they > are explicitly deleted and are not tied to the lifecycle of a task, removing > the associated storage immediately seems best. > To implement this safely, we'll either need to ensure that libprocess > messages are delivered in-order, or else add some extra safe-guards to ensure > that out-of-order {{CheckpointResources}} messages don't lead to accidental > data loss. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204819#comment-15204819 ] Jie Yu commented on MESOS-4827: --- Looking at the log, the slave's work_dir is under /tmp. Is that expected. A typical production setting will set slave's work_dir to /var/lib/mesos. Also, can you share your /tmp mount options. Is it a tmpfs, how large it is? > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0, 0.26.0, 0.27.0, 0.28.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4827: Affects Version/s: 0.26.0 0.27.0 0.28.0 > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0, 0.26.0, 0.27.0, 0.28.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3368) Add device support in cgroups abstraction
[ https://issues.apache.org/jira/browse/MESOS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-3368: --- Labels: cgroups (was: ) > Add device support in cgroups abstraction > - > > Key: MESOS-3368 > URL: https://issues.apache.org/jira/browse/MESOS-3368 > Project: Mesos > Issue Type: Task >Reporter: Niklas Quarfot Nielsen >Assignee: Abhishek Dasgupta > Labels: cgroups > > Add support for [device > cgroups|https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt] to > aid isolators controlling access to devices. > In the future, we could think about how to numerate and control access to > devices as resource or task/container policy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-3368) Add device support in cgroups abstraction
[ https://issues.apache.org/jira/browse/MESOS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-3368: --- Comment: was deleted (was: https://reviews.apache.org/r/44974/ https://reviews.apache.org/r/44975/) > Add device support in cgroups abstraction > - > > Key: MESOS-3368 > URL: https://issues.apache.org/jira/browse/MESOS-3368 > Project: Mesos > Issue Type: Task >Reporter: Niklas Quarfot Nielsen >Assignee: Abhishek Dasgupta > > Add support for [device > cgroups|https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt] to > aid isolators controlling access to devices. > In the future, we could think about how to numerate and control access to > devices as resource or task/container policy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3368) Add device support in cgroups abstraction
[ https://issues.apache.org/jira/browse/MESOS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-3368: --- Shepherd: Benjamin Mahler Sprint: Mesosphere Sprint 31 https://reviews.apache.org/r/44974/ https://reviews.apache.org/r/44975/ > Add device support in cgroups abstraction > - > > Key: MESOS-3368 > URL: https://issues.apache.org/jira/browse/MESOS-3368 > Project: Mesos > Issue Type: Task >Reporter: Niklas Quarfot Nielsen >Assignee: Abhishek Dasgupta > > Add support for [device > cgroups|https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt] to > aid isolators controlling access to devices. > In the future, we could think about how to numerate and control access to > devices as resource or task/container policy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4625) Implement Nvidia GPU isolation w/o filesystem isolation enabled.
[ https://issues.apache.org/jira/browse/MESOS-4625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues reassigned MESOS-4625: -- Assignee: Kevin Klues (was: Robert Todd) > Implement Nvidia GPU isolation w/o filesystem isolation enabled. > > > Key: MESOS-4625 > URL: https://issues.apache.org/jira/browse/MESOS-4625 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Benjamin Mahler >Assignee: Kevin Klues > > The Nvidia GPU isolator will need to use the device cgroup to restrict access > to GPU resources, and will need to recover this information after agent > failover. For now this will require that the operator specifies the GPU > devices via a flag. > To handle filesystem isolation requires that we provide mechanisms for > operators to inject volumes with the necessary libraries into all containers > using GPU resources, we'll tackle this in a separate ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4625) Implement Nvidia GPU isolation w/o filesystem isolation enabled.
[ https://issues.apache.org/jira/browse/MESOS-4625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204751#comment-15204751 ] Kevin Klues commented on MESOS-4625: https://reviews.apache.org/r/44979/ > Implement Nvidia GPU isolation w/o filesystem isolation enabled. > > > Key: MESOS-4625 > URL: https://issues.apache.org/jira/browse/MESOS-4625 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Benjamin Mahler >Assignee: Robert Todd > > The Nvidia GPU isolator will need to use the device cgroup to restrict access > to GPU resources, and will need to recover this information after agent > failover. For now this will require that the operator specifies the GPU > devices via a flag. > To handle filesystem isolation requires that we provide mechanisms for > operators to inject volumes with the necessary libraries into all containers > using GPU resources, we'll tackle this in a separate ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4625) Implement Nvidia GPU isolation w/o filesystem isolation enabled.
[ https://issues.apache.org/jira/browse/MESOS-4625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-4625: --- Sprint: Mesosphere Sprint 31 > Implement Nvidia GPU isolation w/o filesystem isolation enabled. > > > Key: MESOS-4625 > URL: https://issues.apache.org/jira/browse/MESOS-4625 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Benjamin Mahler >Assignee: Robert Todd > > The Nvidia GPU isolator will need to use the device cgroup to restrict access > to GPU resources, and will need to recover this information after agent > failover. For now this will require that the operator specifies the GPU > devices via a flag. > To handle filesystem isolation requires that we provide mechanisms for > operators to inject volumes with the necessary libraries into all containers > using GPU resources, we'll tackle this in a separate ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4827) Destroy Docker container crashes Mesos slave
[ https://issues.apache.org/jira/browse/MESOS-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204729#comment-15204729 ] Giulio D'Ippolito commented on MESOS-4827: -- This issue is limiting the use of Mesos in production environment. Is there any workaround to mitigate this issue? > Destroy Docker container crashes Mesos slave > > > Key: MESOS-4827 > URL: https://issues.apache.org/jira/browse/MESOS-4827 > Project: Mesos > Issue Type: Bug > Components: docker, framework, slave >Affects Versions: 0.25.0 >Reporter: Zhenzhong Shi >Priority: Blocker > Fix For: 0.29.0 > > > The details of this issue originally [posted on > StackOverflow|http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave]. > > To be short, the problem is when we destroy/re-deploy a docker-containerized > task, the mesos-slave got killed from time to time. It happened on our > production environment and I cann't re-produce it. > Please refer to the post on StackOverflow about the error message I got and > details of environment info. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4521) We need to ensure two Disk resources do not have the same root path in DiskInfo::Source::Mount during agent initialization.
[ https://issues.apache.org/jira/browse/MESOS-4521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204709#comment-15204709 ] haosdent commented on MESOS-4521: - Thanks a lot. Let me update. > We need to ensure two Disk resources do not have the same root path in > DiskInfo::Source::Mount during agent initialization. > --- > > Key: MESOS-4521 > URL: https://issues.apache.org/jira/browse/MESOS-4521 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: haosdent > Labels: mesosphere > > See summary. MOUNT type disk is supposed to be exclusive. So having two MOUNT > type disk with the same root does not make sense. > Another interesting case is that the root of one MOUNT disk is contained in > the root of another MOUNT disk. Technically, we should check that as well > and disallow it because the user might be able to write to a disk that he/she > is not supposed to write. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4521) We need to ensure two Disk resources do not have the same root path in DiskInfo::Source::Mount during agent initialization.
[ https://issues.apache.org/jira/browse/MESOS-4521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204705#comment-15204705 ] Neil Conway commented on MESOS-4521: I can do a first-pass review (now done). > We need to ensure two Disk resources do not have the same root path in > DiskInfo::Source::Mount during agent initialization. > --- > > Key: MESOS-4521 > URL: https://issues.apache.org/jira/browse/MESOS-4521 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: haosdent > Labels: mesosphere > > See summary. MOUNT type disk is supposed to be exclusive. So having two MOUNT > type disk with the same root does not make sense. > Another interesting case is that the root of one MOUNT disk is contained in > the root of another MOUNT disk. Technically, we should check that as well > and disallow it because the user might be able to write to a disk that he/she > is not supposed to write. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4802) Update leveldb patch file to suport PowerPC LE
[ https://issues.apache.org/jira/browse/MESOS-4802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204686#comment-15204686 ] Vinod Kone commented on MESOS-4802: --- yup! > Update leveldb patch file to suport PowerPC LE > -- > > Key: MESOS-4802 > URL: https://issues.apache.org/jira/browse/MESOS-4802 > Project: Mesos > Issue Type: Improvement >Reporter: Qian Zhang >Assignee: Chen Zhiwei > > See: https://github.com/google/leveldb/releases/tag/v1.18 for improvements / > bug fixes. > The motivation is that leveldb 1.18 has officially supported IBM Power > (ppc64le), so this is needed by > [MESOS-4312|https://issues.apache.org/jira/browse/MESOS-4312]. > Update: Since someone updated leveldb to 1.4, so I only update the patch file > to support PowerPC LE. Because I don't think upgrade 3rdparty library > frequently is a good thing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4992) sandbox uri does not work outisde mesos http server
[ https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated MESOS-4992: --- Description: The SandBox uri of a framework does not work if i just copy paste it to the browser. For example the following sandbox uri: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse should redirect to: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 yet it fails with the message: "Failed to find slaves. Navigate to the slave's sandbox via the Mesos UI." and redirects to: http://172.17.0.1:5050/#/ It is an issue for me because im working on expanding the mesos spark ui with sandbox uri, The other option is to get the slave info and parse the json file there and get executor paths not so straightforward or elegant though. Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess this is hidden info, this is the needed piece of info to re-write the uri withotu redirection. was: The SandBox uri of a framework does not work if i just copy paste it to the browser. For example the following sandbox uri: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse should redirect to: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 yet it fails with the message: "Failed to find slaves. Navigate to the slave's sandbox via the Mesos UI." and redirects to: http://172.17.0.1:5050/#/ It is an issue for me because im working on expanding the mesos spark ui with sandbox uris and the other option is to get the slave info and parse the json file there and get executor paths not so straightforward or elegant. Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess this is hidden info, this is the needed piece of info to re-write the uri withotu redirection. > sandbox uri does not work outisde mesos http server > --- > > Key: MESOS-4992 > URL: https://issues.apache.org/jira/browse/MESOS-4992 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 0.27.1 >Reporter: Stavros Kontopoulos > > The SandBox uri of a framework does not work if i just copy paste it to the > browser. > For example the following sandbox uri: > http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse > should redirect to: > http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 > yet it fails with the message: > "Failed to find slaves. > Navigate to the slave's sandbox via the Mesos UI." > and redirects to: > http://172.17.0.1:5050/#/ > It is an issue for me because im working on expanding the mesos spark ui with > sandbox uri, The other option is to get the slave info and parse the json > file there and get executor paths not so straightforward or elegant though. > Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess > this is hidden info, this is the needed piece of info to re-write the uri > withotu redirection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4992) sandbox uri does not work outisde mesos http server
[ https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated MESOS-4992: --- Description: The SandBox uri of a framework does not work if i just copy paste it to the browser. For example the following sandbox uri: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse should redirect to: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 yet it fails with the message: "Failed to find slaves. Navigate to the slave's sandbox via the Mesos UI." and redirects to: http://172.17.0.1:5050/#/ It is an issue for me because im working on expanding the mesos spark ui with sandbox uri, The other option is to get the slave info and parse the json file there and get executor paths not so straightforward or elegant though. Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess this is hidden info, this is the needed piece of info to re-write the uri without redirection. was: The SandBox uri of a framework does not work if i just copy paste it to the browser. For example the following sandbox uri: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse should redirect to: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 yet it fails with the message: "Failed to find slaves. Navigate to the slave's sandbox via the Mesos UI." and redirects to: http://172.17.0.1:5050/#/ It is an issue for me because im working on expanding the mesos spark ui with sandbox uri, The other option is to get the slave info and parse the json file there and get executor paths not so straightforward or elegant though. Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess this is hidden info, this is the needed piece of info to re-write the uri withotu redirection. > sandbox uri does not work outisde mesos http server > --- > > Key: MESOS-4992 > URL: https://issues.apache.org/jira/browse/MESOS-4992 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 0.27.1 >Reporter: Stavros Kontopoulos > > The SandBox uri of a framework does not work if i just copy paste it to the > browser. > For example the following sandbox uri: > http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse > should redirect to: > http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 > yet it fails with the message: > "Failed to find slaves. > Navigate to the slave's sandbox via the Mesos UI." > and redirects to: > http://172.17.0.1:5050/#/ > It is an issue for me because im working on expanding the mesos spark ui with > sandbox uri, The other option is to get the slave info and parse the json > file there and get executor paths not so straightforward or elegant though. > Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess > this is hidden info, this is the needed piece of info to re-write the uri > without redirection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4992) sandbox uri does not work outisde mesos http server
Stavros Kontopoulos created MESOS-4992: -- Summary: sandbox uri does not work outisde mesos http server Key: MESOS-4992 URL: https://issues.apache.org/jira/browse/MESOS-4992 Project: Mesos Issue Type: Bug Components: webui Affects Versions: 0.27.1 Reporter: Stavros Kontopoulos The SandBox uri of a framework does not work if i just copy paste it to the browser. For example the following sandbox uri: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse should redirect to: http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80 yet it fails with the message: "Failed to find slaves. Navigate to the slave's sandbox via the Mesos UI." and redirects to: http://172.17.0.1:5050/#/ It is an issue for me because im working on expanding the mesos spark ui with sandbox uris and the other option is to get the slave info and parse the json file there and get executor paths not so straightforward or elegant. Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess this is hidden info, this is the needed piece of info to re-write the uri withotu redirection. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4991) network isolator build should use {{AC_ARG_ENABLE}}
James Peach created MESOS-4991: -- Summary: network isolator build should use {{AC_ARG_ENABLE}} Key: MESOS-4991 URL: https://issues.apache.org/jira/browse/MESOS-4991 Project: Mesos Issue Type: Improvement Components: build Reporter: James Peach Priority: Minor As per comments in [r44945](https://reviews.apache.org/r/44945/), the network isolator should be enabled in the build using {{AC_ARG_ENABLE}} rather than {{AC_ARG_WITH}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4991) network isolator build should use AC_ARG_ENABLE
[ https://issues.apache.org/jira/browse/MESOS-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach updated MESOS-4991: --- Summary: network isolator build should use AC_ARG_ENABLE (was: network isolator build should use {{AC_ARG_ENABLE}}) > network isolator build should use AC_ARG_ENABLE > --- > > Key: MESOS-4991 > URL: https://issues.apache.org/jira/browse/MESOS-4991 > Project: Mesos > Issue Type: Improvement > Components: build >Reporter: James Peach >Priority: Minor > > As per comments in [r44945](https://reviews.apache.org/r/44945/), the network > isolator should be enabled in the build using {{AC_ARG_ENABLE}} rather than > {{AC_ARG_WITH}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4990) Add unit test for MOUNT volumes + out-of-disk-space
Neil Conway created MESOS-4990: -- Summary: Add unit test for MOUNT volumes + out-of-disk-space Key: MESOS-4990 URL: https://issues.apache.org/jira/browse/MESOS-4990 Project: Mesos Issue Type: Task Components: tests Reporter: Neil Conway Tasks that write data to MOUNT volumes should be able to rely on receiving {{write(2)}} errors when they run out of disk space. This is important behavior for several use-cases of MOUNT volumes (e.g., databases), so we should have a unit test for this. Might need to be Linux-specific for now, because we currently only use {{tmpfs}} mounts on Linux for the unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4842) Sudden framework crash may bring down mesos master
[ https://issues.apache.org/jira/browse/MESOS-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway reassigned MESOS-4842: -- Assignee: Neil Conway > Sudden framework crash may bring down mesos master > -- > > Key: MESOS-4842 > URL: https://issues.apache.org/jira/browse/MESOS-4842 > Project: Mesos > Issue Type: Bug > Components: framework, master >Affects Versions: 0.27.1 >Reporter: Guillermo Rodriguez >Assignee: Neil Conway > > Using: > swarm 1.1.3-rc1 > CoreOS 899.9 > Mesos 0.27.1 > Marathon 0.15.3 > When swarm is stopped/restarted it may crash the mesos-master. It doesn't > happens always but frequently enough to be a pain. > If for some reason the swarm service fails, marathon will try and restart the > service. When this happens mesos may crash. I would say it happens around 50% > of the times. > This looks like a swarm/mesos problem so I will report the same error to both > lists. > This is the final lines of mesos logs: > {code} > 0303 04:32:45.327628 8 master.cpp:5202] Framework failover timeout, removing > framework b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (swarm) at > scheduler(1)@172.31.39.68:3375 > I0303 04:32:45.327651 8 master.cpp:5933] Removing framework > b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (swarm) at > scheduler(1)@172.31.39.68:3375 > I0303 04:32:45.327847 8 master.cpp:6445] Updating the state of task > trth_download-SES.1fba45934ce4 of framework > b4149972-942d-46bf-b886-644cd3d0c6f0-0004 (latest state: TASK_FAILED, status > update state: TASK_KILLED) > I0303 04:32:45.327879 8 master.cpp:6511] Removing task > trth_download-SES.1fba45934ce4 with resources cpus(*):0.3; mem(*):450 of > framework b4149972-942d-46bf-b886-644cd3d0c6f0-0004 on slave > 2ce3e7f3-1712-4a4b-8338-04077c371a67-S7 at slave(1)@172.31.33.80:5051 > (172.31.33.80) > F0303 04:32:45.328032 8 sorter.cpp:251] Check failed: > total_.resources.contains(slaveId) > *** Check failure stack trace: *** > E0303 04:32:45.328198 13 process.cpp:1966] Failed to shutdown socket with fd > 53: Transport endpoint is not connected > @ 0x7fc22173893d google::LogMessage::Fail() > @ 0x7fc22173a76d google::LogMessage::SendToLog() > @ 0x7fc22173852c google::LogMessage::Flush() > @ 0x7fc22173b069 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fc22105773b mesos::internal::master::allocator::DRFSorter::remove() > @ 0x7fc22104623e > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework() > @ 0x7fc2216e5681 process::ProcessManager::resume() > @ 0x7fc2216e5987 > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7fc22022ea60 (unknown) > @ 0x7fc21fa4b182 start_thread > @ 0x7fc21f77847d (unknown) > *** Aborted at 1456979565 (unix time) try "date -d @1456979565" if you are > using GNU date *** > PC: @ 0x7fc21f6b8227 (unknown) > *** SIGSEGV (@0x0) received by PID 1 (TID 0x7fc2181a2700) from PID 0; stack > trace: *** > @ 0x7fc21fa53340 (unknown) > @ 0x7fc21f6b8227 (unknown) > @ 0x7fc221740be9 google::DumpStackTraceAndExit() > @ 0x7fc22173893d google::LogMessage::Fail() > @ 0x7fc22173a76d google::LogMessage::SendToLog() > @ 0x7fc22173852c google::LogMessage::Flush() > @ 0x7fc22173b069 google::LogMessageFatal::~LogMessageFatal() > @ 0x7fc22105773b mesos::internal::master::allocator::DRFSorter::remove() > @ 0x7fc22104623e > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework() > @ 0x7fc2216e5681 process::ProcessManager::resume() > @ 0x7fc2216e5987 > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv > @ 0x7fc22022ea60 (unknown) > @ 0x7fc21fa4b182 start_thread > @ 0x7fc21f77847d (unknown) > {code} > If you ask me, looks like swarm is terminated and connections are lost, but > just at that moment there is a tasks that was running on swarm that was > finishing. Then mesos tries to inform the already deceased framework that the > task is finished and the resources need to be recovered but the framework is > no longer there... so it crashes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4879) Update glog patch to support PowerPC LE
[ https://issues.apache.org/jira/browse/MESOS-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204463#comment-15204463 ] Abhishek Dasgupta commented on MESOS-4879: -- I have submitted a PR for this in Glog: https://github.com/google/glog/pull/91 > Update glog patch to support PowerPC LE > --- > > Key: MESOS-4879 > URL: https://issues.apache.org/jira/browse/MESOS-4879 > Project: Mesos > Issue Type: Improvement >Reporter: Chen Zhiwei >Assignee: Chen Zhiwei > > This is a part of PowerPC LE porting -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4977) Sometime Cmd":["-c","echo 'No such file or directory'] in task.
[ https://issues.apache.org/jira/browse/MESOS-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204288#comment-15204288 ] Sergey Galkin edited comment on MESOS-4977 at 3/21/16 2:17 PM: --- Traffic dump on port 5050 during deploy one Marathon applications with 100 instances. In this dump, for example failed task b4ee1f97bf56980fbc0891a83e3652a4.b7b6bf11-ef5a-11e5-89d2-6805ca32e0f0 running task b4ee1f97bf56980fbc0891a83e3652a4.fd840243-ef5a-11e5-89d2-6805ca32e0f0 (mesos.pcap.xz attached) was (Author: sergeygals): Traffic dump on port 5050 during deploy one Marathon applications with 100 instances. In this dump, for example failed task b4ee1f97bf56980fbc0891a83e3652a4.b7b6bf11-ef5a-11e5-89d2-6805ca32e0f0 running task b4ee1f97bf56980fbc0891a83e3652a4.fd840243-ef5a-11e5-89d2-6805ca32e0f0 > Sometime Cmd":["-c","echo 'No such file or directory'] in task. > --- > > Key: MESOS-4977 > URL: https://issues.apache.org/jira/browse/MESOS-4977 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.2 > Environment: 189 mesos slaves on Ubuntu 14.04.3 LTS >Reporter: Sergey Galkin > Attachments: mesos.pcap.xz > > > mesos - 0.27.0 > marathon - 0.15.2 > I am trying to launch 1 simple docker application with nginx with 500 > instances on cluster with 189 HW nodes through Marathon > {code} > ID /1f532267a08494e3081c1acb42d273b7 > Command Unspecified > Constraints Unspecified > Dependencies Unspecified > Labels Unspecified > Resource Roles Unspecified > Container > { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "nginx", > "network": "BRIDGE", > "portMappings": [ > { > "containerPort": 80, > "hostPort": 0, > "servicePort": 1, > "protocol": "tcp" > } > ], > "privileged": false, > "parameters": [], > "forcePullImage": false > } > } > CPUs 1 > Environment Unspecified > Executor Unspecified > Health Checks > [ > { > "path": "/", > "protocol": "HTTP", > "portIndex": 0, > "gracePeriodSeconds": 300, > "intervalSeconds": 60, > "timeoutSeconds": 20, > "maxConsecutiveFailures": 3, > "ignoreHttp1xx": false > } > ] > Instances 500 > IP Address Unspecified > Memory 256 MiB > Disk Space 50 MiB > Ports 1 > Backoff Factor 1.15 > Backoff 1 seconds > Max Launch Delay 3600 seconds > URIs Unspecified > User Unspecified > {code} > Deployment stopped on Delayed, only about 360-370 of 500 instances are > successful. In the stdout in the failed mesos tasks I see "No such file or > directory" > As I see in /var/log/upstarе/docker.log with enabled debug mesos sometimes > try to start containers with strange Cmd ("Cmd":["-c","echo 'No such file or > directory'; exit 1"]) and this task failed. Sometime everything is ok > "Cmd":null and task in RUNNING state > Part of the log available in http://paste.openstack.org/show/491122/ > I successfully started 700 nginx with docker applications with 10 instances > simultaneously in this cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4977) Sometime Cmd":["-c","echo 'No such file or directory'] in task.
[ https://issues.apache.org/jira/browse/MESOS-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Galkin updated MESOS-4977: - Attachment: mesos.pcap.xz > Sometime Cmd":["-c","echo 'No such file or directory'] in task. > --- > > Key: MESOS-4977 > URL: https://issues.apache.org/jira/browse/MESOS-4977 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.2 > Environment: 189 mesos slaves on Ubuntu 14.04.3 LTS >Reporter: Sergey Galkin > Attachments: mesos.pcap.xz > > > mesos - 0.27.0 > marathon - 0.15.2 > I am trying to launch 1 simple docker application with nginx with 500 > instances on cluster with 189 HW nodes through Marathon > {code} > ID /1f532267a08494e3081c1acb42d273b7 > Command Unspecified > Constraints Unspecified > Dependencies Unspecified > Labels Unspecified > Resource Roles Unspecified > Container > { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "nginx", > "network": "BRIDGE", > "portMappings": [ > { > "containerPort": 80, > "hostPort": 0, > "servicePort": 1, > "protocol": "tcp" > } > ], > "privileged": false, > "parameters": [], > "forcePullImage": false > } > } > CPUs 1 > Environment Unspecified > Executor Unspecified > Health Checks > [ > { > "path": "/", > "protocol": "HTTP", > "portIndex": 0, > "gracePeriodSeconds": 300, > "intervalSeconds": 60, > "timeoutSeconds": 20, > "maxConsecutiveFailures": 3, > "ignoreHttp1xx": false > } > ] > Instances 500 > IP Address Unspecified > Memory 256 MiB > Disk Space 50 MiB > Ports 1 > Backoff Factor 1.15 > Backoff 1 seconds > Max Launch Delay 3600 seconds > URIs Unspecified > User Unspecified > {code} > Deployment stopped on Delayed, only about 360-370 of 500 instances are > successful. In the stdout in the failed mesos tasks I see "No such file or > directory" > As I see in /var/log/upstarе/docker.log with enabled debug mesos sometimes > try to start containers with strange Cmd ("Cmd":["-c","echo 'No such file or > directory'; exit 1"]) and this task failed. Sometime everything is ok > "Cmd":null and task in RUNNING state > Part of the log available in http://paste.openstack.org/show/491122/ > I successfully started 700 nginx with docker applications with 10 instances > simultaneously in this cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4977) Sometime Cmd":["-c","echo 'No such file or directory'] in task.
[ https://issues.apache.org/jira/browse/MESOS-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Galkin updated MESOS-4977: - Attachment: (was: mesos.pcap.xz) > Sometime Cmd":["-c","echo 'No such file or directory'] in task. > --- > > Key: MESOS-4977 > URL: https://issues.apache.org/jira/browse/MESOS-4977 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.2 > Environment: 189 mesos slaves on Ubuntu 14.04.3 LTS >Reporter: Sergey Galkin > Attachments: mesos.pcap.xz > > > mesos - 0.27.0 > marathon - 0.15.2 > I am trying to launch 1 simple docker application with nginx with 500 > instances on cluster with 189 HW nodes through Marathon > {code} > ID /1f532267a08494e3081c1acb42d273b7 > Command Unspecified > Constraints Unspecified > Dependencies Unspecified > Labels Unspecified > Resource Roles Unspecified > Container > { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "nginx", > "network": "BRIDGE", > "portMappings": [ > { > "containerPort": 80, > "hostPort": 0, > "servicePort": 1, > "protocol": "tcp" > } > ], > "privileged": false, > "parameters": [], > "forcePullImage": false > } > } > CPUs 1 > Environment Unspecified > Executor Unspecified > Health Checks > [ > { > "path": "/", > "protocol": "HTTP", > "portIndex": 0, > "gracePeriodSeconds": 300, > "intervalSeconds": 60, > "timeoutSeconds": 20, > "maxConsecutiveFailures": 3, > "ignoreHttp1xx": false > } > ] > Instances 500 > IP Address Unspecified > Memory 256 MiB > Disk Space 50 MiB > Ports 1 > Backoff Factor 1.15 > Backoff 1 seconds > Max Launch Delay 3600 seconds > URIs Unspecified > User Unspecified > {code} > Deployment stopped on Delayed, only about 360-370 of 500 instances are > successful. In the stdout in the failed mesos tasks I see "No such file or > directory" > As I see in /var/log/upstarе/docker.log with enabled debug mesos sometimes > try to start containers with strange Cmd ("Cmd":["-c","echo 'No such file or > directory'; exit 1"]) and this task failed. Sometime everything is ok > "Cmd":null and task in RUNNING state > Part of the log available in http://paste.openstack.org/show/491122/ > I successfully started 700 nginx with docker applications with 10 instances > simultaneously in this cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4977) Sometime Cmd":["-c","echo 'No such file or directory'] in task.
[ https://issues.apache.org/jira/browse/MESOS-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204284#comment-15204284 ] Sergey Galkin edited comment on MESOS-4977 at 3/21/16 2:17 PM: --- During creating cluster in Marathon I dumped traffic on the 5050 port and did not find differences in the requests between was (Author: sergeygals): During creating cluster in Marathon I dumped traffic on the 5050 port and did not find differences in the requests between failed b4ee1f97bf56980fbc0891a83e3652a4.b7b6bf11-ef5a-11e5-89d2-6805ca32e0f0 and running b4ee1f97bf56980fbc0891a83e3652a4.fd840243-ef5a-11e5-89d2-6805ca32e0f0 tasks > Sometime Cmd":["-c","echo 'No such file or directory'] in task. > --- > > Key: MESOS-4977 > URL: https://issues.apache.org/jira/browse/MESOS-4977 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.2 > Environment: 189 mesos slaves on Ubuntu 14.04.3 LTS >Reporter: Sergey Galkin > Attachments: mesos.pcap.xz > > > mesos - 0.27.0 > marathon - 0.15.2 > I am trying to launch 1 simple docker application with nginx with 500 > instances on cluster with 189 HW nodes through Marathon > {code} > ID /1f532267a08494e3081c1acb42d273b7 > Command Unspecified > Constraints Unspecified > Dependencies Unspecified > Labels Unspecified > Resource Roles Unspecified > Container > { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "nginx", > "network": "BRIDGE", > "portMappings": [ > { > "containerPort": 80, > "hostPort": 0, > "servicePort": 1, > "protocol": "tcp" > } > ], > "privileged": false, > "parameters": [], > "forcePullImage": false > } > } > CPUs 1 > Environment Unspecified > Executor Unspecified > Health Checks > [ > { > "path": "/", > "protocol": "HTTP", > "portIndex": 0, > "gracePeriodSeconds": 300, > "intervalSeconds": 60, > "timeoutSeconds": 20, > "maxConsecutiveFailures": 3, > "ignoreHttp1xx": false > } > ] > Instances 500 > IP Address Unspecified > Memory 256 MiB > Disk Space 50 MiB > Ports 1 > Backoff Factor 1.15 > Backoff 1 seconds > Max Launch Delay 3600 seconds > URIs Unspecified > User Unspecified > {code} > Deployment stopped on Delayed, only about 360-370 of 500 instances are > successful. In the stdout in the failed mesos tasks I see "No such file or > directory" > As I see in /var/log/upstarе/docker.log with enabled debug mesos sometimes > try to start containers with strange Cmd ("Cmd":["-c","echo 'No such file or > directory'; exit 1"]) and this task failed. Sometime everything is ok > "Cmd":null and task in RUNNING state > Part of the log available in http://paste.openstack.org/show/491122/ > I successfully started 700 nginx with docker applications with 10 instances > simultaneously in this cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4977) Sometime Cmd":["-c","echo 'No such file or directory'] in task.
[ https://issues.apache.org/jira/browse/MESOS-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Galkin updated MESOS-4977: - Attachment: mesos.pcap.xz Traffic dump on port 5050 during deploy one Marathon applications with 100 instances. In this dump, for example failed task b4ee1f97bf56980fbc0891a83e3652a4.b7b6bf11-ef5a-11e5-89d2-6805ca32e0f0 running task b4ee1f97bf56980fbc0891a83e3652a4.fd840243-ef5a-11e5-89d2-6805ca32e0f0 > Sometime Cmd":["-c","echo 'No such file or directory'] in task. > --- > > Key: MESOS-4977 > URL: https://issues.apache.org/jira/browse/MESOS-4977 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.2 > Environment: 189 mesos slaves on Ubuntu 14.04.3 LTS >Reporter: Sergey Galkin > Attachments: mesos.pcap.xz > > > mesos - 0.27.0 > marathon - 0.15.2 > I am trying to launch 1 simple docker application with nginx with 500 > instances on cluster with 189 HW nodes through Marathon > {code} > ID /1f532267a08494e3081c1acb42d273b7 > Command Unspecified > Constraints Unspecified > Dependencies Unspecified > Labels Unspecified > Resource Roles Unspecified > Container > { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "nginx", > "network": "BRIDGE", > "portMappings": [ > { > "containerPort": 80, > "hostPort": 0, > "servicePort": 1, > "protocol": "tcp" > } > ], > "privileged": false, > "parameters": [], > "forcePullImage": false > } > } > CPUs 1 > Environment Unspecified > Executor Unspecified > Health Checks > [ > { > "path": "/", > "protocol": "HTTP", > "portIndex": 0, > "gracePeriodSeconds": 300, > "intervalSeconds": 60, > "timeoutSeconds": 20, > "maxConsecutiveFailures": 3, > "ignoreHttp1xx": false > } > ] > Instances 500 > IP Address Unspecified > Memory 256 MiB > Disk Space 50 MiB > Ports 1 > Backoff Factor 1.15 > Backoff 1 seconds > Max Launch Delay 3600 seconds > URIs Unspecified > User Unspecified > {code} > Deployment stopped on Delayed, only about 360-370 of 500 instances are > successful. In the stdout in the failed mesos tasks I see "No such file or > directory" > As I see in /var/log/upstarе/docker.log with enabled debug mesos sometimes > try to start containers with strange Cmd ("Cmd":["-c","echo 'No such file or > directory'; exit 1"]) and this task failed. Sometime everything is ok > "Cmd":null and task in RUNNING state > Part of the log available in http://paste.openstack.org/show/491122/ > I successfully started 700 nginx with docker applications with 10 instances > simultaneously in this cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4977) Sometime Cmd":["-c","echo 'No such file or directory'] in task.
[ https://issues.apache.org/jira/browse/MESOS-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204284#comment-15204284 ] Sergey Galkin commented on MESOS-4977: -- During creating cluster in Marathon I dumped traffic on the 5050 port and did not find differences in the requests between failed b4ee1f97bf56980fbc0891a83e3652a4.b7b6bf11-ef5a-11e5-89d2-6805ca32e0f0 and running b4ee1f97bf56980fbc0891a83e3652a4.fd840243-ef5a-11e5-89d2-6805ca32e0f0 tasks > Sometime Cmd":["-c","echo 'No such file or directory'] in task. > --- > > Key: MESOS-4977 > URL: https://issues.apache.org/jira/browse/MESOS-4977 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.2 > Environment: 189 mesos slaves on Ubuntu 14.04.3 LTS >Reporter: Sergey Galkin > > mesos - 0.27.0 > marathon - 0.15.2 > I am trying to launch 1 simple docker application with nginx with 500 > instances on cluster with 189 HW nodes through Marathon > {code} > ID /1f532267a08494e3081c1acb42d273b7 > Command Unspecified > Constraints Unspecified > Dependencies Unspecified > Labels Unspecified > Resource Roles Unspecified > Container > { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "nginx", > "network": "BRIDGE", > "portMappings": [ > { > "containerPort": 80, > "hostPort": 0, > "servicePort": 1, > "protocol": "tcp" > } > ], > "privileged": false, > "parameters": [], > "forcePullImage": false > } > } > CPUs 1 > Environment Unspecified > Executor Unspecified > Health Checks > [ > { > "path": "/", > "protocol": "HTTP", > "portIndex": 0, > "gracePeriodSeconds": 300, > "intervalSeconds": 60, > "timeoutSeconds": 20, > "maxConsecutiveFailures": 3, > "ignoreHttp1xx": false > } > ] > Instances 500 > IP Address Unspecified > Memory 256 MiB > Disk Space 50 MiB > Ports 1 > Backoff Factor 1.15 > Backoff 1 seconds > Max Launch Delay 3600 seconds > URIs Unspecified > User Unspecified > {code} > Deployment stopped on Delayed, only about 360-370 of 500 instances are > successful. In the stdout in the failed mesos tasks I see "No such file or > directory" > As I see in /var/log/upstarе/docker.log with enabled debug mesos sometimes > try to start containers with strange Cmd ("Cmd":["-c","echo 'No such file or > directory'; exit 1"]) and this task failed. Sometime everything is ok > "Cmd":null and task in RUNNING state > Part of the log available in http://paste.openstack.org/show/491122/ > I successfully started 700 nginx with docker applications with 10 instances > simultaneously in this cluster -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4977) Sometime Cmd":["-c","echo 'No such file or directory'] in task.
[ https://issues.apache.org/jira/browse/MESOS-4977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201757#comment-15201757 ] Sergey Galkin edited comment on MESOS-4977 at 3/21/16 1:52 PM: --- Logs from mesos-master 1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0 - with "Cmd":["-c","echo 'No such file or directory'; exit 1"] (failed) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:I0318 15:14:27.224059 2638 master.hpp:176] Adding task 1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0 with resources cpus(*):1; mem(*):256; disk(*):50; ports(*):[19743-19743] on slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 (172.20.9.205) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:I0318 15:14:27.224105 2638 master.cpp:3621] Launching task 1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0 of framework 5445dbdc-c58a-4f78-aef2-9ab129a640fa- (marathon) at scheduler-f59022ec-3650-4212-beea-38f50ce6e427@172.20.9.50:56418 with resources cpus(*):1; mem(*):256; disk(*):50; ports(*):[19743-19743] on slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 at slave(1)@172.20.9.205:5051 (172.20.9.205) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:W0318 15:14:33.154769 2656 master.cpp:4885] *Ignoring unknown exited executor* '1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0' of framework 5445dbdc-c58a-4f78-aef2-9ab129a640fa- on slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 at slave(1)@172.20.9.205:5051 (172.20.9.205) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:I0318 15:14:33.156250 2639 master.cpp:4789] Status update TASK_FAILED (UUID: 7c90d238-fcc4-4ede-9238-200744693449) for task 1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0 of framework 5445dbdc-c58a-4f78-aef2-9ab129a640fa- from slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 at slave(1)@172.20.9.205:5051 (172.20.9.205) 1f532267a08494e3081c1acb42d273b7.e2548d07-ed1b-11e5-89d2-6805ca32e0f0 - with "Cmd":null (running) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:I0318 15:14:27.223767 2638 master.hpp:176] Adding task 1f532267a08494e3081c1acb42d273b7.e2548d07-ed1b-11e5-89d2-6805ca32e0f0 with resources cpus(*):1; mem(*):256; disk(*):50; ports(*):[9016-9016] on slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 (172.20.9.205) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:I0318 15:14:27.223814 2638 master.cpp:3621] Launching task 1f532267a08494e3081c1acb42d273b7.e2548d07-ed1b-11e5-89d2-6805ca32e0f0 of framework 5445dbdc-c58a-4f78-aef2-9ab129a640fa- (marathon) at scheduler-f59022ec-3650-4212-beea-38f50ce6e427@172.20.9.50:56418 with resources cpus(*):1; mem(*):256; disk(*):50; ports(*):[9016-9016] on slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 at slave(1)@172.20.9.205:5051 (172.20.9.205) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:I0318 15:14:33.200388 2648 master.cpp:4789] Status update TASK_RUNNING (UUID: 563864b0-8780-4fd3-a106-041600599e2e) for task 1f532267a08494e3081c1acb42d273b7.e2548d07-ed1b-11e5-89d2-6805ca32e0f0 of framework 5445dbdc-c58a-4f78-aef2-9ab129a640fa- from slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 at slave(1)@172.20.9.205:5051 (172.20.9.205) was (Author: sergeygals): Logs from mesos-master 1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0 - with "Cmd":["-c","echo 'No such file or directory'; exit 1"] (failed) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:I0318 15:14:27.224059 2638 master.hpp:176] Adding task 1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0 with resources cpus(*):1; mem(*):256; disk(*):50; ports(*):[19743-19743] on slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 (172.20.9.205) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:I0318 15:14:27.224105 2638 master.cpp:3621] Launching task 1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0 of framework 5445dbdc-c58a-4f78-aef2-9ab129a640fa- (marathon) at scheduler-f59022ec-3650-4212-beea-38f50ce6e427@172.20.9.50:56418 with resources cpus(*):1; mem(*):256; disk(*):50; ports(*):[19743-19743] on slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 at slave(1)@172.20.9.205:5051 (172.20.9.205) mesos-master.729039-comp-disk-280.invalid-user.log.INFO.20160318-151426.2595:W0318 15:14:33.154769 2656 master.cpp:4885] Ignoring unknown exited executor '1f532267a08494e3081c1acb42d273b7.e25466eb-ed1b-11e5-89d2-6805ca32e0f0' of framework 5445dbdc-c58a-4f78-aef2-9ab129a640fa- on slave 5445dbdc-c58a-4f78-aef2-9ab129a640fa-S60 at slave(1)@172.20.9.205:5051 (172.20.9.205) mesos-master.729039-com
[jira] [Commented] (MESOS-3548) Investigate federations of Mesos masters
[ https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203930#comment-15203930 ] Guangya Liu commented on MESOS-3548: [~rncry] The "Docker Volume Driver Isolator" may help for your storage and data locality when migrating workloads, please refer to MESOS-4355 for detail. > Investigate federations of Mesos masters > > > Key: MESOS-3548 > URL: https://issues.apache.org/jira/browse/MESOS-3548 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: federation, mesosphere, multi-dc > > In a large Mesos installation, the operator might want to ensure that even if > the Mesos masters are inaccessible or failed, new tasks can still be > scheduled (across multiple different frameworks). HA masters are only a > partial solution here: the masters might still be inaccessible due to a > correlated failure (e.g., Zookeeper misconfiguration/human error). > To support this, we could support the notion of "hierarchies" or > "federations" of Mesos masters. In a Mesos installation with 10k machines, > the operator might configure 10 Mesos masters (each of which might be HA) to > manage 1k machines each. Then an additional "meta-Master" would manage the > allocation of cluster resources to the 10 masters. Hence, the failure of any > individual master would impact 1k machines at most. The meta-master might not > have a lot of work to do: e.g., it might be limited to occasionally > reallocating cluster resources among the 10 masters, or ensuring that newly > added cluster resources are allocated among the masters as appropriate. > Hence, the failure of the meta-master would not prevent any of the individual > masters from scheduling new tasks. A single framework instance probably > wouldn't be able to use more resources than have been assigned to a single > Master, but that seems like a reasonable restriction. > This feature might also be a good fit for a multi-datacenter deployment of > Mesos: each Mesos master instance would manage a single DC. Naturally, > reducing the traffic between frameworks and the meta-master would be > important for performance reasons in a configuration like this. > Operationally, this might be simpler if Mesos processes were self-hosting > ([MESOS-3547]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4989) Design document for docker volume driver
[ https://issues.apache.org/jira/browse/MESOS-4989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203927#comment-15203927 ] Guangya Liu commented on MESOS-4989: cc [~cantbewong] > Design document for docker volume driver > > > Key: MESOS-4989 > URL: https://issues.apache.org/jira/browse/MESOS-4989 > Project: Mesos > Issue Type: Task >Reporter: Guangya Liu >Assignee: Guangya Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4989) Design document for docker volume driver
Guangya Liu created MESOS-4989: -- Summary: Design document for docker volume driver Key: MESOS-4989 URL: https://issues.apache.org/jira/browse/MESOS-4989 Project: Mesos Issue Type: Task Reporter: Guangya Liu Assignee: Guangya Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4355) Implement isolator for Docker volume
[ https://issues.apache.org/jira/browse/MESOS-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-4355: --- Epic Name: Docker Volume Driver > Implement isolator for Docker volume > > > Key: MESOS-4355 > URL: https://issues.apache.org/jira/browse/MESOS-4355 > Project: Mesos > Issue Type: Epic > Components: docker, isolation >Reporter: Qian Zhang >Assignee: Guangya Liu > Labels: mesosphere > > In Docker, user can create a volume with Docker CLI, e.g., {{docker volume > create --name my-volume}}, we need to implement an isolator to make the > container launched by MesosContainerizer can use such volume. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4355) Implement isolator for Docker volume
[ https://issues.apache.org/jira/browse/MESOS-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-4355: --- Issue Type: Epic (was: Improvement) > Implement isolator for Docker volume > > > Key: MESOS-4355 > URL: https://issues.apache.org/jira/browse/MESOS-4355 > Project: Mesos > Issue Type: Epic > Components: docker, isolation >Reporter: Qian Zhang >Assignee: Guangya Liu > Labels: mesosphere > > In Docker, user can create a volume with Docker CLI, e.g., {{docker volume > create --name my-volume}}, we need to implement an isolator to make the > container launched by MesosContainerizer can use such volume. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3548) Investigate federations of Mesos masters
[ https://issues.apache.org/jira/browse/MESOS-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203909#comment-15203909 ] Aaron Carey commented on MESOS-3548: We're also very interested in this, we have datacentres distributed globally and are experimenting with ways to move workloads from one region to another during peak periods. This raises big questions with regards to storage and data locality for us, but having Mesos support multiple datacentres would be a huge step for us! > Investigate federations of Mesos masters > > > Key: MESOS-3548 > URL: https://issues.apache.org/jira/browse/MESOS-3548 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: federation, mesosphere, multi-dc > > In a large Mesos installation, the operator might want to ensure that even if > the Mesos masters are inaccessible or failed, new tasks can still be > scheduled (across multiple different frameworks). HA masters are only a > partial solution here: the masters might still be inaccessible due to a > correlated failure (e.g., Zookeeper misconfiguration/human error). > To support this, we could support the notion of "hierarchies" or > "federations" of Mesos masters. In a Mesos installation with 10k machines, > the operator might configure 10 Mesos masters (each of which might be HA) to > manage 1k machines each. Then an additional "meta-Master" would manage the > allocation of cluster resources to the 10 masters. Hence, the failure of any > individual master would impact 1k machines at most. The meta-master might not > have a lot of work to do: e.g., it might be limited to occasionally > reallocating cluster resources among the 10 masters, or ensuring that newly > added cluster resources are allocated among the masters as appropriate. > Hence, the failure of the meta-master would not prevent any of the individual > masters from scheduling new tasks. A single framework instance probably > wouldn't be able to use more resources than have been assigned to a single > Master, but that seems like a reasonable restriction. > This feature might also be a good fit for a multi-datacenter deployment of > Mesos: each Mesos master instance would manage a single DC. Naturally, > reducing the traffic between frameworks and the meta-master would be > important for performance reasons in a configuration like this. > Operationally, this might be simpler if Mesos processes were self-hosting > ([MESOS-3547]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4981) Framework (re-)register metric counters broken for calls made via scheduler driver
[ https://issues.apache.org/jira/browse/MESOS-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203887#comment-15203887 ] Fan Du edited comment on MESOS-4981 at 3/21/16 8:34 AM: [~anandmazumdar] I happened to look a deep look at this, here is the fix works on my env. Please review: https://reviews.apache.org/r/45096 https://reviews.apache.org/r/45097 was (Author: fan.du): [~anandmazumdar] I happened to look a deep look at this, here is the fix works on my env. Please review: https://reviews.apache.org/r/45094/ > Framework (re-)register metric counters broken for calls made via scheduler > driver > -- > > Key: MESOS-4981 > URL: https://issues.apache.org/jira/browse/MESOS-4981 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Anand Mazumdar >Assignee: Fan Du > Labels: mesosphere > > The counters {{master/messages_register_framework}} and > {{master/messages_reregister_framework}} are no longer being incremented > after the scheduler driver started sending {{Call}} messages to the master in > Mesos 0.23. Either, we should think about adding new counter(s) for > {{Subscribe}} calls to the master for both PID/HTTP frameworks or modify the > existing code to correctly increment the counters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags
[ https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203895#comment-15203895 ] Jay Guo commented on MESOS-3781: As I pick up this ticket, just wanna confirm the actual requirements. Are we going to duplicate following flags with keyword {{agent}}? In {{src/master/flags.hpp}} * slave_reregister_timeout * recovery_slave_removal_limit * slave_removal_rate_limit * authenticate_slaves * slave_ping_timeout * max_slave_ping_timeouts * max_executors_per_slave In {{src/slave/flags.hpp}} * slave_subsystems [~darroyo] > Replace Master/Slave Terminology Phase I - Add duplicate agent flags > - > > Key: MESOS-3781 > URL: https://issues.apache.org/jira/browse/MESOS-3781 > Project: Mesos > Issue Type: Task >Reporter: Diana Arroyo >Assignee: Jay Guo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4981) Framework (re-)register metric counters broken for calls made via scheduler driver
[ https://issues.apache.org/jira/browse/MESOS-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203887#comment-15203887 ] Fan Du edited comment on MESOS-4981 at 3/21/16 8:19 AM: [~anandmazumdar] I happened to look a deep look at this, here is the fix works on my env. Please review: https://reviews.apache.org/r/45094/ was (Author: fan.du): [~anandmazumdar] I happened to look a deep look at this, here is fix works on my env. Please review: https://reviews.apache.org/r/45094/ > Framework (re-)register metric counters broken for calls made via scheduler > driver > -- > > Key: MESOS-4981 > URL: https://issues.apache.org/jira/browse/MESOS-4981 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Anand Mazumdar >Assignee: Fan Du > Labels: mesosphere > > The counters {{master/messages_register_framework}} and > {{master/messages_reregister_framework}} are no longer being incremented > after the scheduler driver started sending {{Call}} messages to the master in > Mesos 0.23. Either, we should think about adding new counter(s) for > {{Subscribe}} calls to the master for both PID/HTTP frameworks or modify the > existing code to correctly increment the counters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4981) Framework (re-)register metric counters broken for calls made via scheduler driver
[ https://issues.apache.org/jira/browse/MESOS-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203887#comment-15203887 ] Fan Du commented on MESOS-4981: --- [~anandmazumdar] I happened to look a deep look at this, here is fix works on my env. Please review: https://reviews.apache.org/r/45094/ > Framework (re-)register metric counters broken for calls made via scheduler > driver > -- > > Key: MESOS-4981 > URL: https://issues.apache.org/jira/browse/MESOS-4981 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Anand Mazumdar >Assignee: Fan Du > Labels: mesosphere > > The counters {{master/messages_register_framework}} and > {{master/messages_reregister_framework}} are no longer being incremented > after the scheduler driver started sending {{Call}} messages to the master in > Mesos 0.23. Either, we should think about adding new counter(s) for > {{Subscribe}} calls to the master for both PID/HTTP frameworks or modify the > existing code to correctly increment the counters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4981) Framework (re-)register metric counters broken for calls made via scheduler driver
[ https://issues.apache.org/jira/browse/MESOS-4981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Du reassigned MESOS-4981: - Assignee: Fan Du > Framework (re-)register metric counters broken for calls made via scheduler > driver > -- > > Key: MESOS-4981 > URL: https://issues.apache.org/jira/browse/MESOS-4981 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Anand Mazumdar >Assignee: Fan Du > Labels: mesosphere > > The counters {{master/messages_register_framework}} and > {{master/messages_reregister_framework}} are no longer being incremented > after the scheduler driver started sending {{Call}} messages to the master in > Mesos 0.23. Either, we should think about adding new counter(s) for > {{Subscribe}} calls to the master for both PID/HTTP frameworks or modify the > existing code to correctly increment the counters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags
[ https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Guo reassigned MESOS-3781: -- Assignee: Jay Guo > Replace Master/Slave Terminology Phase I - Add duplicate agent flags > - > > Key: MESOS-3781 > URL: https://issues.apache.org/jira/browse/MESOS-3781 > Project: Mesos > Issue Type: Task >Reporter: Diana Arroyo >Assignee: Jay Guo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3778) Replace Master/Slave Terminology Phase I - Add duplicate HTTP endpoints
[ https://issues.apache.org/jira/browse/MESOS-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203849#comment-15203849 ] zhou xing commented on MESOS-3778: -- We took some investigation on this ticket and to change to use 'Agent' in endpoint seems include the change of the following 3 aspects: - change path prefix autocompletion: e.g. url {{/state}} is autocompleted as {{/slave(1)/state}}, which should be {{/agent(1)/state}}, where *agent* is inferred from process pid - change endpoint that contains slave: e.g. {{/master/slaves}} endpoint need to be modified to {{/master/agents}} - change the fields of the endpoints that contain slave: e.g. {{/master/create-volumes}} requires a parameter named {{slaveId}}, which needs to be changed to {{agentId}} we are planning to submit 2 independent patches for #2 and #3, which duplicate current endpoint/fields with no compatibility issue. Whereas #1 *REPLACES* current path prefix, which is going to compromise compatibility. > Replace Master/Slave Terminology Phase I - Add duplicate HTTP endpoints > --- > > Key: MESOS-3778 > URL: https://issues.apache.org/jira/browse/MESOS-3778 > Project: Mesos > Issue Type: Task >Reporter: Diana Arroyo >Assignee: zhou xing > -- This message was sent by Atlassian JIRA (v6.3.4#6332)