[jira] [Commented] (MESOS-3286) Revocable metrics information are missed for slave node
[ https://issues.apache.org/jira/browse/MESOS-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700748#comment-14700748 ] Yong Qiao Wang commented on MESOS-3286: --- Append the related review request: https://reviews.apache.org/r/37562/ Revocable metrics information are missed for slave node --- Key: MESOS-3286 URL: https://issues.apache.org/jira/browse/MESOS-3286 Project: Mesos Issue Type: Documentation Reporter: Yong Qiao Wang Assignee: Yong Qiao Wang Priority: Minor In bug 3278, the revocable metrics information of master node are added, but I also found those information also are missed for slave node in monitoring doc, fix it in this new patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3279) mesos doesn't compatibable docker 1.8.1
Stream Liu created MESOS-3279: - Summary: mesos doesn't compatibable docker 1.8.1 Key: MESOS-3279 URL: https://issues.apache.org/jira/browse/MESOS-3279 Project: Mesos Issue Type: Bug Reporter: Stream Liu Failed to create a containerizer: Could not create DockerContainerizer: Insufficient version of Docker! Please upgrade to = 1.0.0 docker version 1.8.1 but can work on 1.6.2 i using mesos 0.22.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2516) Move allocation-related types to mesos::master namespace
[ https://issues.apache.org/jira/browse/MESOS-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699225#comment-14699225 ] Alexander Rukletsov commented on MESOS-2516: Yep, sending a mail to the list is a good option. Move allocation-related types to mesos::master namespace Key: MESOS-2516 URL: https://issues.apache.org/jira/browse/MESOS-2516 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Alexander Rukletsov Assignee: José Guilherme Vanz Priority: Minor Labels: easyfix, newbie {{Allocator}}, {{Sorter}} and {{Comaprator}} types live in {{master::allocator}} namespace. This is not consistent with the rest of the codebase: {{Isolator}}, {{Fetcher}}, {{Containerizer}} all live in {{slave}} namespace. Namespace {{allocator}} should be killed for consistency. Since sorters are poorly named, they should be renamed (or namespaced) prior to this change in order not to pollute {{master}} namespace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1791) Introduce Master / Offer Resource Reservations aka Quota
[ https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-1791: -- Assignee: Alexander Rukletsov Introduce Master / Offer Resource Reservations aka Quota Key: MESOS-1791 URL: https://issues.apache.org/jira/browse/MESOS-1791 Project: Mesos Issue Type: Epic Components: allocation, master, replicated log Reporter: Tom Arnfeld Assignee: Alexander Rukletsov Labels: mesosphere Currently Mesos supports the ability to reserve resources (for a given role) on a per-slave basis, as introduced in MESOS-505. This allows you to almost statically partition off a set of resources on a set of machines, to guarantee certain types of frameworks get some resources. This is very useful, though it is also very useful to be able to control these reservations through the master (instead of per-slave) for when I don't care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of (X,Y). I'm not sure what structure this could take, but apparently it has already been discussed. Would this be a CLI flag? Could there be a (authenticated) web interface to control these reservations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3284) JSON representation of Protobuf should use base64 encoding for 'bytes' fields.
[ https://issues.apache.org/jira/browse/MESOS-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-3284: --- Sprint: Twitter Mesos Q3 Sprint 3 Story Points: 3 JSON representation of Protobuf should use base64 encoding for 'bytes' fields. -- Key: MESOS-3284 URL: https://issues.apache.org/jira/browse/MESOS-3284 Project: Mesos Issue Type: Bug Components: stout Reporter: Benjamin Mahler Assignee: Benjamin Mahler Labels: twitter Currently we encode 'bytes' fields as UTF-8 strings, which is lossy for binary data due to invalid byte sequences! In order to encode binary data in a lossless fashion, we can encode 'bytes' fields in base64. Note that this is also how proto3 does its encoding (see [here|https://developers.google.com/protocol-buffers/docs/proto3?hl=en#json]), so this would make migration easier as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3273) EventCall Test Framework is flaky
[ https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700428#comment-14700428 ] Vinod Kone commented on MESOS-3273: --- Review for the first problem: https://reviews.apache.org/r/37559/ EventCall Test Framework is flaky - Key: MESOS-3273 URL: https://issues.apache.org/jira/browse/MESOS-3273 Project: Mesos Issue Type: Bug Affects Versions: 0.24.0 Environment: https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull Reporter: Vinod Kone Assignee: Vinod Kone Observed this on ASF CI. h/t [~haosd...@gmail.com] Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master. {code} [ RUN ] ExamplesTest.EventCallFramework Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx' I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the driver is aborted! Shutting down Sending SIGTERM to process tree at pid 26061 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26062 Shutting down Killing the following process trees: [ ] Sending SIGTERM to process tree at pid 26063 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26098 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26099 Killing the following process trees: [ ] WARNING: Logging before InitGoogleLogging() is written to STDERR I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 172.17.2.10:60249 for 16 cpus I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0 I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 8429ns I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the db in 4219ns I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status I0813 19:55:17.181970 26126 master.cpp:378] Master 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 172.17.2.10:60249 I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: --acls=permissive: false register_frameworks { principals { type: SOME values: test-principal } roles { type: SOME values: * } } run_tasks { principals { type: SOME values: test-principal } users { type: SOME values: mesos } } --allocation_interval=1secs --allocator=HierarchicalDRF --authenticate=false --authenticate_slaves=false --authenticators=crammd5 --credentials=/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials --framework_sorter=drf --help=false --initialize_driver_logging=true --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO --max_slave_ping_timeouts=5 --quiet=false --recovery_slave_removal_limit=100% --registry=replicated_log --registry_fetch_timeout=1mins --registry_store_timeout=5secs --registry_strict=false --root_submissions=true --slave_ping_timeout=15secs --slave_reregister_timeout=10mins --user_sorter=drf --version=false --webui_dir=/mesos/mesos-0.24.0/src/webui --work_dir=/tmp/mesos-II8Gua --zk_session_timeout=10secs I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated frameworks to register I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated slaves to register I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. It is recommended that your credentials file is NOT accessible by others. I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' authenticator I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL I0813 19:55:17.185104 26138 recover.cpp:195] Received a recover response from a replica in EMPTY status I0813 19:55:17.185972 26100 containerizer.cpp:143] Using isolation: posix/cpu,posix/mem,filesystem/posix I0813
[jira] [Updated] (MESOS-3073) Introduce HTTP endpoints for Quota
[ https://issues.apache.org/jira/browse/MESOS-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3073: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Introduce HTTP endpoints for Quota -- Key: MESOS-3073 URL: https://issues.apache.org/jira/browse/MESOS-3073 Project: Mesos Issue Type: Improvement Reporter: Joerg Schad Assignee: Joerg Schad Labels: mesosphere We need to implement the HTTP endpoints for Quota as outlined in the Design Doc: (https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2949) Design generalized Authorizer interface
[ https://issues.apache.org/jira/browse/MESOS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2949: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Design generalized Authorizer interface --- Key: MESOS-2949 URL: https://issues.apache.org/jira/browse/MESOS-2949 Project: Mesos Issue Type: Task Components: master, security Reporter: Alexander Rojas Assignee: Alexander Rojas Labels: acl, mesosphere, security As mentioned in MESOS-2948 the current {{mesos::Authorizer}} interface is rather inflexible if new _Actions_ or _Objects_ need to be added. A new API needs to be designed in a way that allows for arbitrary _Actions_ and _Objects_ to be added to the authorization mechanism without having to recompile mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3223) Implement token manager for docker registry
[ https://issues.apache.org/jira/browse/MESOS-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3223: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Implement token manager for docker registry --- Key: MESOS-3223 URL: https://issues.apache.org/jira/browse/MESOS-3223 Project: Mesos Issue Type: Task Components: containerization, docker Environment: linux Reporter: Jojy Varghese Assignee: Jojy Varghese Labels: mesosphere Implement the following: - A component that fetches JSON web authorization token from a given registry. - Caches the token keyed on registry, service and scope - Validates the cache for expiry date Nice to have: - Cache gets pruned as tokens are aged beyond expiration time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2937) Initial design document for Quota support in Allocator.
[ https://issues.apache.org/jira/browse/MESOS-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2937: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Initial design document for Quota support in Allocator. --- Key: MESOS-2937 URL: https://issues.apache.org/jira/browse/MESOS-2937 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov Labels: mesosphere Create a design document for the Quota feature support in the built-in Hierarchical DRF allocator to be shared with the Mesos community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3092) Configure Jenkins to run Docker tests
[ https://issues.apache.org/jira/browse/MESOS-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3092: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Configure Jenkins to run Docker tests - Key: MESOS-3092 URL: https://issues.apache.org/jira/browse/MESOS-3092 Project: Mesos Issue Type: Improvement Components: docker Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesosphere Add a jenkin job to run the Docker tests -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3021) Implement Docker Image Provisioner Reference Store
[ https://issues.apache.org/jira/browse/MESOS-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3021: --- Sprint: Mesosphere Sprint 14, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 14, Mesosphere Sprint 16) Implement Docker Image Provisioner Reference Store -- Key: MESOS-3021 URL: https://issues.apache.org/jira/browse/MESOS-3021 Project: Mesos Issue Type: Improvement Reporter: Lily Chen Assignee: Lily Chen Labels: mesosphere Create a comprehensive store to look up an image and tag's associated image layer ID. Implement add, remove, save, and update images and their associated tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3062) Add authorization for dynamic reservation
[ https://issues.apache.org/jira/browse/MESOS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3062: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Add authorization for dynamic reservation - Key: MESOS-3062 URL: https://issues.apache.org/jira/browse/MESOS-3062 Project: Mesos Issue Type: Task Components: master Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Dynamic reservations should be authorized with the {{principal}} of the reserving entity (framework or master). The idea is to introduce {{Reserve}} and {{Unreserve}} into the ACL. {code} message Reserve { // Subjects. required Entity principals = 1; // Objects. MVP: Only possible values = ANY, NONE required Entity resources = 1; } message Unreserve { // Subjects. required Entity principals = 1; // Objects. required Entity reserver_principals = 2; } {code} When a framework/operator reserves resources, reserve ACLs are checked to see if the framework ({{FrameworkInfo.principal}}) or the operator ({{Credential.user}}) is authorized to reserve the specified resources. If not authorized, the reserve operation is rejected. When a framework/operator unreserves resources, unreserve ACLs are checked to see if the framework ({{FrameworkInfo.principal}}) or the operator ({{Credential.user}}) is authorized to unreserve the resources reserved by a framework or operator ({{Resource.ReservationInfo.principal}}). If not authorized, the unreserve operation is rejected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2066) Add optional 'Unavailability' to resource offers to provide maintenance awareness.
[ https://issues.apache.org/jira/browse/MESOS-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2066: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Add optional 'Unavailability' to resource offers to provide maintenance awareness. -- Key: MESOS-2066 URL: https://issues.apache.org/jira/browse/MESOS-2066 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler Assignee: Joseph Wu Labels: mesosphere In order to inform frameworks about upcoming maintenance on offered resources, per MESOS-1474, we'd like to add an optional 'Unavailability' information to offers: {code} message Interval { optional double start = 1; // Time, in seconds since the Epoch. optional double duration = 2; // Time, in seconds. } message Offer { // Existing fields ... // Signifies that the resources in this Offer are part of a planned // maintenance schedule in the specified window. Any tasks launched // using these resources may be killed when the window arrives. // This field gives additional information about the maintenance. // The maintenance may not necessarily start at exactly at this interval, // nor last for exactly the duration of this interval. optional Interval unavailability = 9; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2061) Add InverseOffer protobuf message.
[ https://issues.apache.org/jira/browse/MESOS-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2061: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Add InverseOffer protobuf message. -- Key: MESOS-2061 URL: https://issues.apache.org/jira/browse/MESOS-2061 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler Assignee: Joseph Wu Labels: mesosphere InverseOffer was defined as part of the maintenance work in MESOS-1474, design doc here: https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing {code} /** * A request to return some resources occupied by a framework. */ message InverseOffer { required OfferID id = 1; required FrameworkID framework_id = 2; // A list of resources being requested back from the framework. repeated Resource resources = 3; // Specified if the resources need to be released from a particular slave. optional SlaveID slave_id = 4; // The resources in this InverseOffer are part of a planned maintenance // schedule in the specified window. Any tasks running using these // resources may be killed when the window arrives. optional Interval unavailability = 5; } {code} This ticket is to capture the addition of the InverseOffer protobuf to mesos.proto, the necessary API changes for Event/Call and the language bindings will be tracked separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2200) bogus docker images result in bad error message to scheduler
[ https://issues.apache.org/jira/browse/MESOS-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2200: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) bogus docker images result in bad error message to scheduler Key: MESOS-2200 URL: https://issues.apache.org/jira/browse/MESOS-2200 Project: Mesos Issue Type: Bug Components: containerization, docker Reporter: Jay Buffington Assignee: Joerg Schad Labels: mesosphere When a scheduler specifies a bogus image in ContainerInfo mesos doesn't tell the scheduler that the docker pull failed or why. This error is logged in the mesos-slave log, but it isn't given to the scheduler (as far as I can tell): {noformat} E1218 23:50:55.406230 8123 slave.cpp:2730] Container '8f70784c-3e40-4072-9ca2-9daed23f15ff' for executor 'thermos-1418946354013-xxx-xxx-curl-0-f500cc41-dd0a-4338-8cbc-d631cb588bb1' of framework '20140522-213145-1749004561-5050-29512-' failed to start: Failed to 'docker pull docker-registry.example.com/doesntexist/hello1.1:latest': exit status = exited with status 1 stderr = 2014/12/18 23:50:55 Error: image doesntexist/hello1.1 not found {noformat} If the docker image is not in the registry, the scheduler should give the user an error message. If docker pull failed because of networking issues, it should be retried. Mesos should give the scheduler enough information to be able to make that decision. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3227) Implement image chroot support into command executor
[ https://issues.apache.org/jira/browse/MESOS-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3227: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Implement image chroot support into command executor Key: MESOS-3227 URL: https://issues.apache.org/jira/browse/MESOS-3227 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesosphere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2455) Add operator endpoint to destroy persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2455: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Add operator endpoint to destroy persistent volumes. Key: MESOS-2455 URL: https://issues.apache.org/jira/browse/MESOS-2455 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Michael Park Priority: Critical Labels: mesosphere Persistent volumes will not be released automatically. So we probably need an endpoint for operators to forcefully release persistent volumes. We probably need to add principal to Persistence struct and use ACLs to control who can release what. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3251) http::get API evaluates host wrongly
[ https://issues.apache.org/jira/browse/MESOS-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3251: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) http::get API evaluates host wrongly -- Key: MESOS-3251 URL: https://issues.apache.org/jira/browse/MESOS-3251 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Jojy Varghese Assignee: Jojy Varghese Labels: mesosphere Currently libprocess http API sets the Host header field from the peer socket address (IP:port). The problem is that socket address might not be right HTTP server and might be just a proxy. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3042) Master/Allocator does not send InverseOffers to resources to be maintained
[ https://issues.apache.org/jira/browse/MESOS-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3042: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Master/Allocator does not send InverseOffers to resources to be maintained -- Key: MESOS-3042 URL: https://issues.apache.org/jira/browse/MESOS-3042 Project: Mesos Issue Type: Task Components: allocation, master Reporter: Joseph Wu Assignee: Joris Van Remoortere Labels: mesosphere Offers are currently sent from master/allocator to framework via ResourceOffersMessage's. InverseOffers, which are roughly equivalent to negative Offers, can be sent in the same package. In src/messages/messages.proto {code} message ResourceOffersMessage { repeated Offer offers = 1; repeated string pids = 2; // New field with InverseOffers repeated InverseOffer inverseOffers = 3; } {code} Sent InverseOffers can be tracked in the master's local state: i.e. In src/master/master.hpp: {code} struct Slave { ... // Existing fields. // Active InverseOffers on this slave. // Similar pattern to the offers field hashsetInverseOffer* inverseOffers; } {code} One actor (master or allocator) should populate the new InverseOffers field. * In master (src/master/master.cpp) ** Master::offer is where the ResourceOffersMessage and Offer object is constructed. ** The same method could also check for maintenance and send InverseOffers. * In the allocator (src/master/allocator/mesos/hierarchical.hpp) ** HierarchicalAllocatorProcess::allocate is where slave resources are aggregated an sent off to the frameworks. ** InverseOffers (i.e. negative resources) allocation could be calculated in this method. ** A change to Master::offer (i.e. the offerCallback) may be necessary to account for the negative resources. Possible test(s): * InverseOfferTest ** Start master, slave, framework. ** Accept resource offer, start task. ** Set maintenance schedule to the future. ** Check that InverseOffer(s) are sent to the framework. ** Decline InverseOffer. ** Check that more InverseOffer(s) are sent. ** Accept InverseOffer. ** Check that more InverseOffer(s) are sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3066) Replicated registry does not have a representation of maintenance schedules
[ https://issues.apache.org/jira/browse/MESOS-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3066: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Replicated registry does not have a representation of maintenance schedules --- Key: MESOS-3066 URL: https://issues.apache.org/jira/browse/MESOS-3066 Project: Mesos Issue Type: Task Components: master, replicated log Reporter: Joseph Wu Assignee: Joseph Wu Labels: mesosphere In order to persist maintenance schedules across failovers of the master, the schedule information must be kept in the replicated registry. This means adding an additional message in the Registry protobuf in src/master/registry.proto. The status of each individual slave's maintenance will also be persisted in this way. {code} message Maintenance { message HostStatus { required string hostname = 1; // True if the slave is deactivated for maintenance. // False if the slave is draining in preparation for maintenance. required bool is_down = 2; // Or an enum } message Schedule { // The set of affected slave(s). repeated HostStatus hosts = 1; // Interval in which this set of slaves is expected to be down for. optional Unavailability interval = 2; } message Schedules { repeated Schedule schedules; } optional Schedules schedules = 1; } {code} Note: There can be multiple SlaveID's attached to a single hostname. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3064) Add 'principal' field to 'Resource.DiskInfo'
[ https://issues.apache.org/jira/browse/MESOS-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3064: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Add 'principal' field to 'Resource.DiskInfo' Key: MESOS-3064 URL: https://issues.apache.org/jira/browse/MESOS-3064 Project: Mesos Issue Type: Task Reporter: Michael Park Assignee: Michael Park Labels: mesosphere In order to support authorization for persistent volumes, we should add the {{principal}} to {{Resource.DiskInfo}}, analogous to {{Resource.ReservationInfo.principal}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2964) libprocess io does not support peek()
[ https://issues.apache.org/jira/browse/MESOS-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2964: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) libprocess io does not support peek() - Key: MESOS-2964 URL: https://issues.apache.org/jira/browse/MESOS-2964 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Artem Harutyunyan Assignee: Artem Harutyunyan Priority: Minor Labels: beginner, mesosphere, newbie Finally, I so wish we could just do: {code} io::peek(request-socket, 6) .then([request](const string data) { // Comment about the rules ... if (data.length() 2) { // Rule 1 } else if (...) { // Rule 2. } else if (...) { // Rule 3. } if (ssl) { accept_SSL_callback(request); } else { ...; } }); {code} from: https://reviews.apache.org/r/31207/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3086) Create cgroups TasksKiller for non freeze subsystems.
[ https://issues.apache.org/jira/browse/MESOS-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3086: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Create cgroups TasksKiller for non freeze subsystems. - Key: MESOS-3086 URL: https://issues.apache.org/jira/browse/MESOS-3086 Project: Mesos Issue Type: Bug Reporter: Joerg Schad Assignee: Joerg Schad Labels: mesosphere We have a number of test issues when we cannot remove cgroups (in case there are still related tasks running) in cases where the freezer subsystem is not available. In the current code (https://github.com/apache/mesos/blob/0.22.1/src/linux/cgroups.cpp#L1728) we will fallback to a very simple mechnism of recursivly trying to remove the cgroups which fails if there are still tasks running. Therefore we need an additional (NonFreeze)TasksKiller which doesn't rely on the freezer subsystem. This problem caused issues when running 'sudo make check' during 0.23 release testing, where BenH provided already a better error message with b1a23d6a52c31b8c5c840ab01902dbe00cb1feef / https://reviews.apache.org/r/36604. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3074) Check satisfiability of quota requests in Master
[ https://issues.apache.org/jira/browse/MESOS-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3074: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Check satisfiability of quota requests in Master Key: MESOS-3074 URL: https://issues.apache.org/jira/browse/MESOS-3074 Project: Mesos Issue Type: Improvement Reporter: Joerg Schad Assignee: Alexander Rukletsov Labels: mesosphere We need to to validate and quota requests in the Mesos Master as outlined in the Design Doc: https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I This ticket aims to validate satisfiability (in terms of available resources) of a quota request using a heuristic algorithm in the Mesos Master, rather than validating the syntax of the request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3050: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Failing ROOT_ tests on CentOS 7.1 - Key: MESOS-3050 URL: https://issues.apache.org/jira/browse/MESOS-3050 Project: Mesos Issue Type: Bug Components: containerization, docker, test Affects Versions: 0.23.0 Environment: CentOS Linux release 7.1.1503 0.24.0 Reporter: Adam B Assignee: Timothy Chen Priority: Blocker Labels: mesosphere Attachments: ROOT_tests.log Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several several failures/errors: {code} [ RUN ] DockerTest.ROOT_DOCKER_CheckPortResource ../../src/tests/docker_tests.cpp:303: Failure (run).failure(): Container exited on error: exited with status 1 [ FAILED ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms) {code} ... {code} [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample ../../src/tests/isolator_tests.cpp:837: Failure isolator: Failed to create PerfEvent isolator, invalid events: { cycles, task-clock } [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms) [--] 1 test from PerfEventIsolatorTest (9 ms total) [--] 2 tests from SharedFilesystemIsolatorTest [ RUN ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac ../../src/tests/isolator_tests.cpp:1001: Failure Value of: os::exists(file) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms) [ RUN ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d ../../src/tests/isolator_tests.cpp:1086: Failure Value of: os::exists(path::join(containerPath, filename)) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms) {code} ... {code} [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user 'mesos.test.unprivileged.user' does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: Permission denied ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs: Permission denied -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 [ FAILED ]
[jira] [Updated] (MESOS-3015) Add hooks for Slave exits
[ https://issues.apache.org/jira/browse/MESOS-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3015: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Add hooks for Slave exits - Key: MESOS-3015 URL: https://issues.apache.org/jira/browse/MESOS-3015 Project: Mesos Issue Type: Task Reporter: Kapil Arya Assignee: Kapil Arya Labels: mesosphere The hook will be triggered on slave exits. A master hook module can use this to do Slave-specific cleanups. In our particular use case, the hook would trigger cleanup of IPs assigned to the given Slave (see the [design doc | https://docs.google.com/document/d/17mXtAmdAXcNBwp_JfrxmZcQrs7EO6ancSbejrqjLQ0g/edit#]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2600) Add /reserve and /unreserve endpoints on the master for dynamic reservation
[ https://issues.apache.org/jira/browse/MESOS-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2600: --- Sprint: Mesosphere Sprint 10, Mesosphere Sprint 11, Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 10, Mesosphere Sprint 11, Mesosphere Sprint 15, Mesosphere Sprint 16) Add /reserve and /unreserve endpoints on the master for dynamic reservation --- Key: MESOS-2600 URL: https://issues.apache.org/jira/browse/MESOS-2600 Project: Mesos Issue Type: Task Components: master Reporter: Michael Park Assignee: Michael Park Priority: Critical Labels: mesosphere Enable operators to manage dynamic reservations by Introducing the {{/reserve}} and {{/unreserve}} HTTP endpoints on the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3069) Registry operations do not exist for manipulating maintanence schedules
[ https://issues.apache.org/jira/browse/MESOS-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3069: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Registry operations do not exist for manipulating maintanence schedules --- Key: MESOS-3069 URL: https://issues.apache.org/jira/browse/MESOS-3069 Project: Mesos Issue Type: Task Components: master, replicated log Reporter: Joseph Wu Assignee: Joseph Wu Labels: mesosphere In order to modify the maintenance schedule in the replicated registry, we will need Operations (src/master/registrar.hpp). The operations will likely correspond to the HTTP API: * UpdateMaintenanceSchedule: Given a blob representing a maintenance schedule, perform some verification on the blob. Write the blob to the registry. * StartMaintenance: Given a set of machines, verify then transition machines from Draining to Deactivated. * StopMaintenance: Given a set of machines, verify then transition machines from Deactivated to Normal. Remove affected machines from the schedule(s). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3164) Introduce QuotaInfo message
[ https://issues.apache.org/jira/browse/MESOS-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3164: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 15, Mesosphere Sprint 16) Introduce QuotaInfo message --- Key: MESOS-3164 URL: https://issues.apache.org/jira/browse/MESOS-3164 Project: Mesos Issue Type: Task Components: master Reporter: Alexander Rukletsov Assignee: Joerg Schad Labels: mesosphere A {{QuotaInfo}} protobuf message is internal representation for quota related information (e.g. for persisting quota). The protobuf message should be extendable for future needs and allows for easy aggregation across roles and operator principals. It may also be used to pass quota information to allocators. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2067) Add HTTP API to the master for maintenance operations.
[ https://issues.apache.org/jira/browse/MESOS-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700461#comment-14700461 ] Joseph Wu commented on MESOS-2067: -- Just realized it might be useful to have an endpoint which retrieves the latest accept/decline info of the maintenance schedule. (Updated description) Add HTTP API to the master for maintenance operations. -- Key: MESOS-2067 URL: https://issues.apache.org/jira/browse/MESOS-2067 Project: Mesos Issue Type: Task Components: master Reporter: Benjamin Mahler Assignee: Joseph Wu Labels: mesosphere, twitter Based on MESOS-1474, we'd like to provide an HTTP API on the master for the maintenance primitives in mesos. For the MVP, we'll want something like this for manipulating the schedule: {code} /maintenance/schedule GET - returns the schedule, which will include the various maintenance windows. POST - create or update the schedule with a JSON blob (see below). /maintenance/status GET - returns a list of machines and their maintenance mode. /maintenance/start POST - Transition a set of machines from Draining into Deactivated mode. /maintenance/stop POST - Transition a set of machines from Deactivated into Normal mode. /maintenance/consensus - (Not sure what the right name is. matrix? acceptance?) GET - Returns the latest info on which frameworks have accepted or declined the maintenance schedule. {code} (Note: The slashes in URLs might not be supported yet.) A schedule might look like: {code} { windows : [ { machines : [ { ip : 192.168.0.1 }, { hostname : localhost }, ... ], unavailability : { start : 12345, // Epoch seconds. duration : 1000 // Seconds. } }, ... ] } {code} There should be firewall settings such that only those with access to master can use these endpoints. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2067) Add HTTP API to the master for maintenance operations.
[ https://issues.apache.org/jira/browse/MESOS-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-2067: - Description: Based on MESOS-1474, we'd like to provide an HTTP API on the master for the maintenance primitives in mesos. For the MVP, we'll want something like this for manipulating the schedule: {code} /maintenance/schedule GET - returns the schedule, which will include the various maintenance windows. POST - create or update the schedule with a JSON blob (see below). /maintenance/status GET - returns a list of machines and their maintenance mode. /maintenance/start POST - Transition a set of machines from Draining into Deactivated mode. /maintenance/stop POST - Transition a set of machines from Deactivated into Normal mode. /maintenance/consensus - (Not sure what the right name is. matrix? acceptance?) GET - Returns the latest info on which frameworks have accepted or declined the maintenance schedule. {code} (Note: The slashes in URLs might not be supported yet.) A schedule might look like: {code} { windows : [ { machines : [ { ip : 192.168.0.1 }, { hostname : localhost }, ... ], unavailability : { start : 12345, // Epoch seconds. duration : 1000 // Seconds. } }, ... ] } {code} There should be firewall settings such that only those with access to master can use these endpoints. was: Based on MESOS-1474, we'd like to provide an HTTP API on the master for the maintenance primitives in mesos. For the MVP, we'll want something like this for manipulating the schedule: {code} /maintenance/schedule GET - returns the schedule, which will include the various maintenance windows. POST - create or update the schedule with a JSON blob (see below). /maintenance/status GET - returns a list of machines and their maintenance mode. /maintenance/start POST - Transition a set of machines from Draining into Deactivated mode. /maintenance/stop POST - Transition a set of machines from Deactivated into Normal mode. {code} (Note: The slashes in URLs might not be supported yet.) A schedule might look like: {code} { windows : [ { machines : [ { ip : 192.168.0.1 }, { hostname : localhost }, ... ], unavailability : { start : 12345, // Epoch seconds. duration : 1000 // Seconds. } }, ... ] } {code} There should be firewall settings such that only those with access to master can use these endpoints. Add HTTP API to the master for maintenance operations. -- Key: MESOS-2067 URL: https://issues.apache.org/jira/browse/MESOS-2067 Project: Mesos Issue Type: Task Components: master Reporter: Benjamin Mahler Assignee: Joseph Wu Labels: mesosphere, twitter Based on MESOS-1474, we'd like to provide an HTTP API on the master for the maintenance primitives in mesos. For the MVP, we'll want something like this for manipulating the schedule: {code} /maintenance/schedule GET - returns the schedule, which will include the various maintenance windows. POST - create or update the schedule with a JSON blob (see below). /maintenance/status GET - returns a list of machines and their maintenance mode. /maintenance/start POST - Transition a set of machines from Draining into Deactivated mode. /maintenance/stop POST - Transition a set of machines from Deactivated into Normal mode. /maintenance/consensus - (Not sure what the right name is. matrix? acceptance?) GET - Returns the latest info on which frameworks have accepted or declined the maintenance schedule. {code} (Note: The slashes in URLs might not be supported yet.) A schedule might look like: {code} { windows : [ { machines : [ { ip : 192.168.0.1 }, { hostname : localhost }, ... ], unavailability : { start : 12345, // Epoch seconds. duration : 1000 // Seconds. } }, ... ] } {code} There should be firewall settings such that only those with access to master can use these endpoints. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2769) Metric for cpu scheduling latency from all components
[ https://issues.apache.org/jira/browse/MESOS-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700359#comment-14700359 ] Cong Wang commented on MESOS-2769: -- https://reviews.apache.org/r/37540/ https://reviews.apache.org/r/37541/ Metric for cpu scheduling latency from all components - Key: MESOS-2769 URL: https://issues.apache.org/jira/browse/MESOS-2769 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.22.1 Reporter: Ian Downes Assignee: Cong Wang Labels: twitter The metric will provide statistics on the scheduling latency for processes/threads in a container, i.e., statistics on the delay before application code can run. This will be the aggregate effect of the normal scheduling period, contention from other threads/processes, both in the container and on the system, and any effects from the CFS bandwidth control (if enabled) or other CPU isolation strategies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3065) Add authorization for persistent volume
[ https://issues.apache.org/jira/browse/MESOS-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3065: --- Sprint: Mesosphere Sprint 16 (was: Mesosphere Sprint 16, Mesosphere Sprint 17) Add authorization for persistent volume --- Key: MESOS-3065 URL: https://issues.apache.org/jira/browse/MESOS-3065 Project: Mesos Issue Type: Task Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Persistent volume should be authorized with the {{principal}} of the reserving entity (framework or master). The idea is to introduce {{Create}} and {{Destroy}} into the ACL. {code} message Create { // Subjects. required Entity principals = 1; // Objects? Perhaps the kind of volume? allowed permissions? } message Unreserve { // Subjects. required Entity principals = 1; // Objects. required Entity creator_principals = 2; } {code} When a framework/operator creates a persistent volume, create ACLs are checked to see if the framework (FrameworkInfo.principal) or the operator (Credential.user) is authorized to create persistent volumes. If not authorized, the create operation is rejected. When a framework/operator destroys a persistent volume, destroy ACLs are checked to see if the framework (FrameworkInfo.principal) or the operator (Credential.user) is authorized to destroy the persistent volume created by a framework or operator (Resource.DiskInfo.principal). If not authorized, the destroy operation is rejected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3015) Add hooks for Slave exits
[ https://issues.apache.org/jira/browse/MESOS-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3015: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16 (was: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17) Add hooks for Slave exits - Key: MESOS-3015 URL: https://issues.apache.org/jira/browse/MESOS-3015 Project: Mesos Issue Type: Task Reporter: Kapil Arya Assignee: Kapil Arya Labels: mesosphere The hook will be triggered on slave exits. A master hook module can use this to do Slave-specific cleanups. In our particular use case, the hook would trigger cleanup of IPs assigned to the given Slave (see the [design doc | https://docs.google.com/document/d/17mXtAmdAXcNBwp_JfrxmZcQrs7EO6ancSbejrqjLQ0g/edit#]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3050: --- Sprint: Mesosphere Sprint 16 (was: Mesosphere Sprint 16, Mesosphere Sprint 17) Failing ROOT_ tests on CentOS 7.1 - Key: MESOS-3050 URL: https://issues.apache.org/jira/browse/MESOS-3050 Project: Mesos Issue Type: Bug Components: containerization, docker, test Affects Versions: 0.23.0 Environment: CentOS Linux release 7.1.1503 0.24.0 Reporter: Adam B Assignee: Timothy Chen Priority: Blocker Labels: mesosphere Attachments: ROOT_tests.log Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several several failures/errors: {code} [ RUN ] DockerTest.ROOT_DOCKER_CheckPortResource ../../src/tests/docker_tests.cpp:303: Failure (run).failure(): Container exited on error: exited with status 1 [ FAILED ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms) {code} ... {code} [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample ../../src/tests/isolator_tests.cpp:837: Failure isolator: Failed to create PerfEvent isolator, invalid events: { cycles, task-clock } [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms) [--] 1 test from PerfEventIsolatorTest (9 ms total) [--] 2 tests from SharedFilesystemIsolatorTest [ RUN ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac ../../src/tests/isolator_tests.cpp:1001: Failure Value of: os::exists(file) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms) [ RUN ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d ../../src/tests/isolator_tests.cpp:1086: Failure Value of: os::exists(path::join(containerPath, filename)) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms) {code} ... {code} [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user 'mesos.test.unprivileged.user' does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: Permission denied ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs: Permission denied -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 [ FAILED ]
[jira] [Commented] (MESOS-3217) Replace boost unordered_{set,map} and hash with std versions.
[ https://issues.apache.org/jira/browse/MESOS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700446#comment-14700446 ] Marco Massenzio commented on MESOS-3217: Has there been any progress on this one? There are 3x reviews out for 10 days now, any expected resolution? Replace boost unordered_{set,map} and hash with std versions. - Key: MESOS-3217 URL: https://issues.apache.org/jira/browse/MESOS-3217 Project: Mesos Issue Type: Task Components: stout Reporter: Michael Park Assignee: Jan Schlicht Labels: mesosphere As part of C++11 upgrade, we should replace boost {{unordered_\{set,map\}}} and {{hash}} with their standard counterparts. Aside from reducing the dependency on {{boost}} from Mesos internals, this is also beneficial in removing (at least, reducing) the dependency on {{boost}} in our public header files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3284) JSON representation of Protobuf should use base64 encoding for 'bytes' fields.
Benjamin Mahler created MESOS-3284: -- Summary: JSON representation of Protobuf should use base64 encoding for 'bytes' fields. Key: MESOS-3284 URL: https://issues.apache.org/jira/browse/MESOS-3284 Project: Mesos Issue Type: Bug Components: stout Reporter: Benjamin Mahler Assignee: Benjamin Mahler Currently we encode 'bytes' fields as UTF-8 strings, which is lossy for binary data due to invalid byte sequences! In order to encode binary data in a lossless fashion, we can encode 'bytes' fields in base64. Note that this is also how proto3 does its encoding (see [here|https://developers.google.com/protocol-buffers/docs/proto3?hl=en#json]), so this would make migration easier as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3136) COMMAND health checks with Marathon 0.10.0 are broken
[ https://issues.apache.org/jira/browse/MESOS-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700438#comment-14700438 ] Marco Massenzio commented on MESOS-3136: Just noticed this entirely randomly... I would strongly suggest to avoid skipping a version between 0.22 / 0.24 as the Leader Election would be terminally broken: we transitioned to JSON in ZK for {{MasterInfo}} and while the 0.22 -- 0.23 -- 0.24 chain all works just fine, skipping 0.23 would create no end of grief. (I'm almost sure other stuff around HTTP API would break, but not sure there). My 2c COMMAND health checks with Marathon 0.10.0 are broken - Key: MESOS-3136 URL: https://issues.apache.org/jira/browse/MESOS-3136 Project: Mesos Issue Type: Bug Affects Versions: 0.23.0 Reporter: Dr. Stefan Schimanski Assignee: haosdent Priority: Critical Labels: mesosphere When deploying Mesos 0.23rc4 with latest Marathon 0.10.0 RC3 command health check stop working. Rolling back to Mesos 0.22.1 fixes the problem. Containerizer is Docker. All packages are from official Mesosphere Ubuntu 14.04 sources. The issue must be analyzed further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3136) COMMAND health checks with Marathon 0.10.0 are broken
[ https://issues.apache.org/jira/browse/MESOS-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700440#comment-14700440 ] Marco Massenzio commented on MESOS-3136: [~vinodkone] does this need to be fixed before 0.24 is out? COMMAND health checks with Marathon 0.10.0 are broken - Key: MESOS-3136 URL: https://issues.apache.org/jira/browse/MESOS-3136 Project: Mesos Issue Type: Bug Affects Versions: 0.23.0 Reporter: Dr. Stefan Schimanski Assignee: haosdent Priority: Critical Labels: mesosphere When deploying Mesos 0.23rc4 with latest Marathon 0.10.0 RC3 command health check stop working. Rolling back to Mesos 0.22.1 fixes the problem. Containerizer is Docker. All packages are from official Mesosphere Ubuntu 14.04 sources. The issue must be analyzed further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3227) Implement image chroot support into command executor
[ https://issues.apache.org/jira/browse/MESOS-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3227: --- Sprint: Mesosphere Sprint 16 (was: Mesosphere Sprint 16, Mesosphere Sprint 17) Implement image chroot support into command executor Key: MESOS-3227 URL: https://issues.apache.org/jira/browse/MESOS-3227 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesosphere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3064) Add 'principal' field to 'Resource.DiskInfo'
[ https://issues.apache.org/jira/browse/MESOS-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3064: --- Sprint: Mesosphere Sprint 16 (was: Mesosphere Sprint 16, Mesosphere Sprint 17) Add 'principal' field to 'Resource.DiskInfo' Key: MESOS-3064 URL: https://issues.apache.org/jira/browse/MESOS-3064 Project: Mesos Issue Type: Task Reporter: Michael Park Assignee: Michael Park Labels: mesosphere In order to support authorization for persistent volumes, we should add the {{principal}} to {{Resource.DiskInfo}}, analogous to {{Resource.ReservationInfo.principal}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2455) Add operator endpoint to destroy persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2455: --- Sprint: Mesosphere Sprint 16 (was: Mesosphere Sprint 16, Mesosphere Sprint 17) Add operator endpoint to destroy persistent volumes. Key: MESOS-2455 URL: https://issues.apache.org/jira/browse/MESOS-2455 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Michael Park Priority: Critical Labels: mesosphere Persistent volumes will not be released automatically. So we probably need an endpoint for operators to forcefully release persistent volumes. We probably need to add principal to Persistence struct and use ACLs to control who can release what. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3199) Validate Quota Requests.
[ https://issues.apache.org/jira/browse/MESOS-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3199: --- Sprint: Mesosphere Sprint 16 (was: Mesosphere Sprint 16, Mesosphere Sprint 17) Validate Quota Requests. Key: MESOS-3199 URL: https://issues.apache.org/jira/browse/MESOS-3199 Project: Mesos Issue Type: Task Reporter: Joerg Schad Assignee: Joerg Schad Labels: mesosphere We need to validate quota requests in terms of syntax correctness, update Master bookkeeping structures, and persist quota requests in the {{Registry}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2200) bogus docker images result in bad error message to scheduler
[ https://issues.apache.org/jira/browse/MESOS-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2200: --- Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16 (was: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17) bogus docker images result in bad error message to scheduler Key: MESOS-2200 URL: https://issues.apache.org/jira/browse/MESOS-2200 Project: Mesos Issue Type: Bug Components: containerization, docker Reporter: Jay Buffington Assignee: Joerg Schad Labels: mesosphere When a scheduler specifies a bogus image in ContainerInfo mesos doesn't tell the scheduler that the docker pull failed or why. This error is logged in the mesos-slave log, but it isn't given to the scheduler (as far as I can tell): {noformat} E1218 23:50:55.406230 8123 slave.cpp:2730] Container '8f70784c-3e40-4072-9ca2-9daed23f15ff' for executor 'thermos-1418946354013-xxx-xxx-curl-0-f500cc41-dd0a-4338-8cbc-d631cb588bb1' of framework '20140522-213145-1749004561-5050-29512-' failed to start: Failed to 'docker pull docker-registry.example.com/doesntexist/hello1.1:latest': exit status = exited with status 1 stderr = 2014/12/18 23:50:55 Error: image doesntexist/hello1.1 not found {noformat} If the docker image is not in the registry, the scheduler should give the user an error message. If docker pull failed because of networking issues, it should be retried. Mesos should give the scheduler enough information to be able to make that decision. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2937) Initial design document for Quota support in Allocator.
[ https://issues.apache.org/jira/browse/MESOS-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-2937: --- Sprint: Mesosphere Sprint 16 (was: Mesosphere Sprint 16, Mesosphere Sprint 17) Initial design document for Quota support in Allocator. --- Key: MESOS-2937 URL: https://issues.apache.org/jira/browse/MESOS-2937 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Alexander Rukletsov Assignee: Alexander Rukletsov Labels: mesosphere Create a design document for the Quota feature support in the built-in Hierarchical DRF allocator to be shared with the Mesos community. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3273) EventCall Test Framework is flaky
[ https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700521#comment-14700521 ] Vinod Kone commented on MESOS-3273: --- commit c532490bfcb4470d0614640031ff854af8876ef6 Author: Vinod Kone vinodk...@gmail.com Date: Mon Aug 17 15:45:51 2015 -0700 Fixed mutex deadlock issue in ~scheduler::Mesos(). Review: https://reviews.apache.org/r/37559 EventCall Test Framework is flaky - Key: MESOS-3273 URL: https://issues.apache.org/jira/browse/MESOS-3273 Project: Mesos Issue Type: Bug Affects Versions: 0.24.0 Environment: https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull Reporter: Vinod Kone Assignee: Vinod Kone Observed this on ASF CI. h/t [~haosd...@gmail.com] Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master. {code} [ RUN ] ExamplesTest.EventCallFramework Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx' I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the driver is aborted! Shutting down Sending SIGTERM to process tree at pid 26061 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26062 Shutting down Killing the following process trees: [ ] Sending SIGTERM to process tree at pid 26063 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26098 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26099 Killing the following process trees: [ ] WARNING: Logging before InitGoogleLogging() is written to STDERR I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 172.17.2.10:60249 for 16 cpus I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0 I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 8429ns I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the db in 4219ns I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status I0813 19:55:17.181970 26126 master.cpp:378] Master 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 172.17.2.10:60249 I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: --acls=permissive: false register_frameworks { principals { type: SOME values: test-principal } roles { type: SOME values: * } } run_tasks { principals { type: SOME values: test-principal } users { type: SOME values: mesos } } --allocation_interval=1secs --allocator=HierarchicalDRF --authenticate=false --authenticate_slaves=false --authenticators=crammd5 --credentials=/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials --framework_sorter=drf --help=false --initialize_driver_logging=true --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO --max_slave_ping_timeouts=5 --quiet=false --recovery_slave_removal_limit=100% --registry=replicated_log --registry_fetch_timeout=1mins --registry_store_timeout=5secs --registry_strict=false --root_submissions=true --slave_ping_timeout=15secs --slave_reregister_timeout=10mins --user_sorter=drf --version=false --webui_dir=/mesos/mesos-0.24.0/src/webui --work_dir=/tmp/mesos-II8Gua --zk_session_timeout=10secs I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated frameworks to register I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated slaves to register I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. It is recommended that your credentials file is NOT accessible by others. I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' authenticator I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL I0813 19:55:17.185104 26138 recover.cpp:195] Received a
[jira] [Created] (MESOS-3285) Master should not accept /scheduler calls when not elected / recovered.
Benjamin Mahler created MESOS-3285: -- Summary: Master should not accept /scheduler calls when not elected / recovered. Key: MESOS-3285 URL: https://issues.apache.org/jira/browse/MESOS-3285 Project: Mesos Issue Type: Bug Components: master Reporter: Benjamin Mahler Priority: Blocker The master currently drops all MessageEvents when it is non-leading or hasn't finished recovering from the registrar (see [here|https://github.com/apache/mesos/blob/0.23.0/src/master/master.cpp#L1076]). The /scheduler HttpEvents should also be dropped in these cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1010) Python extension build is broken if gflags-dev is installed
[ https://issues.apache.org/jira/browse/MESOS-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-1010: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Python extension build is broken if gflags-dev is installed --- Key: MESOS-1010 URL: https://issues.apache.org/jira/browse/MESOS-1010 Project: Mesos Issue Type: Bug Components: build, python api Environment: Fedora 20, amd64, GCC: 4.8.2; OSX 10.10.4, Apple LLVM 6.1.0 (~LLVM 3.6.0) Reporter: Nikita Vetoshkin Assignee: Greg Mann Labels: flaky-test, mesosphere In my environment mesos build from master results in broken python api module {{_mesos.so}}: {noformat} nekto0n@ya-darkstar ~/workspace/mesos/src/python $ PYTHONPATH=build/lib.linux-x86_64-2.7/ python -c import _mesos Traceback (most recent call last): File string, line 1, in module ImportError: /home/nekto0n/workspace/mesos/src/python/build/lib.linux-x86_64-2.7/_mesos.so: undefined symbol: _ZN6google14FlagRegistererC1EPKcS2_S2_S2_PvS3_ {noformat} Unmangled version of symbol looks like this: {noformat} google::FlagRegisterer::FlagRegisterer(char const*, char const*, char const*, char const*, void*, void*) {noformat} During {{./configure}} step {{glog}} finds {{gflags}} development files and starts using them, thus *implicitly* adding dependency on {{libgflags.so}}. This breaks Python extensions module and perhaps can break other mesos subsystems when moved to hosts without {{gflags}} installed. This task is done when the ExamplesTest.PythonFramework test will pass on a system with gflags installed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3065) Add authorization for persistent volume
[ https://issues.apache.org/jira/browse/MESOS-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-3065: --- Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17 (was: Mesosphere Sprint 16) Add authorization for persistent volume --- Key: MESOS-3065 URL: https://issues.apache.org/jira/browse/MESOS-3065 Project: Mesos Issue Type: Task Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Persistent volume should be authorized with the {{principal}} of the reserving entity (framework or master). The idea is to introduce {{Create}} and {{Destroy}} into the ACL. {code} message Create { // Subjects. required Entity principals = 1; // Objects? Perhaps the kind of volume? allowed permissions? } message Unreserve { // Subjects. required Entity principals = 1; // Objects. required Entity creator_principals = 2; } {code} When a framework/operator creates a persistent volume, create ACLs are checked to see if the framework (FrameworkInfo.principal) or the operator (Credential.user) is authorized to create persistent volumes. If not authorized, the create operation is rejected. When a framework/operator destroys a persistent volume, destroy ACLs are checked to see if the framework (FrameworkInfo.principal) or the operator (Credential.user) is authorized to destroy the persistent volume created by a framework or operator (Resource.DiskInfo.principal). If not authorized, the destroy operation is rejected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3230) Create a HTTP based Authentication design doc
[ https://issues.apache.org/jira/browse/MESOS-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700443#comment-14700443 ] Marco Massenzio commented on MESOS-3230: [~arojas] - does your comment mean this story can be Resolved? Create a HTTP based Authentication design doc - Key: MESOS-3230 URL: https://issues.apache.org/jira/browse/MESOS-3230 Project: Mesos Issue Type: Task Components: security Reporter: Alexander Rojas Assignee: Alexander Rojas Labels: mesosphere Since most of the communication between mesosphere components will happen through HTTP with the arrival of the [HTTP API|https://issues.apache.org/jira/browse/MESOS-2288], it makes sense to use HTTP standard mechanisms to authenticate this communication. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2986) Docker version output is not compatible with Mesos
[ https://issues.apache.org/jira/browse/MESOS-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698534#comment-14698534 ] Steve Hoffman edited comment on MESOS-2986 at 8/17/15 1:38 PM: --- Yea, the 0.22.1 version of this code while ugly, just checked the major version number rather than creating a Version class which assumes there are just 3 number digits -- which clearly everybody doesn't follow (as in the FC case) {code} foreach (string line, strings::split(output.get(), \n)) { line = strings::trim(line); if (strings::startsWith(line, Client version: )) { line = line.substr(strlen(Client version: )); vectorstring version = strings::split(line, .); if (version.size() 1) { return Error(Failed to parse Docker version ' + line + '); } Tryint major = numifyint(version[0]); if (major.isError()) { return Error(Failed to parse Docker major version ' + version[0] + '); } else if (major.get() 1) { break; } return new Docker(path); } } {code} At this point in time do we still need a check here? Would anybody be using pre 1.0 docker with mesos? You could just dump the check outright... Also,when this is fixed, can we get a patch to the 0.22.1 RPM? was (Author: hoffman60613): Yea, the 0.22.1 version of this code while ugly, just checked the major version number rather than creating a Version class which assumes there are just 3 number digits -- which clearly everybody doesn't follow (as in the FC case) {code} foreach (string line, strings::split(output.get(), \n)) { line = strings::trim(line); if (strings::startsWith(line, Client version: )) { line = line.substr(strlen(Client version: )); vectorstring version = strings::split(line, .); if (version.size() 1) { return Error(Failed to parse Docker version ' + line + '); } Tryint major = numifyint(version[0]); if (major.isError()) { return Error(Failed to parse Docker major version ' + version[0] + '); } else if (major.get() 1) { break; } return new Docker(path); } } {code} At this point in time do we still need a check here? Would anybody be using pre 1.0 docker with mesos? You could just dump the check outright... Docker version output is not compatible with Mesos -- Key: MESOS-2986 URL: https://issues.apache.org/jira/browse/MESOS-2986 Project: Mesos Issue Type: Bug Components: docker Reporter: Isabel Jimenez Assignee: Isabel Jimenez Labels: mesosphere Fix For: 0.23.0 We currently use docker version to get Docker version, in Docker master branch and soon in Docker 1.8 [1] the output for this command changes. The solution for now will be to use the unchanged docker --version output, in the long term we should consider stop using the cli and use the API instead. [1] https://github.com/docker/docker/pull/14047 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3037) Add a QUIESCE call to the scheduler
[ https://issues.apache.org/jira/browse/MESOS-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699548#comment-14699548 ] gyliu commented on MESOS-3037: -- QUIESCE call https://reviews.apache.org/r/37532/ Add a QUIESCE call to the scheduler --- Key: MESOS-3037 URL: https://issues.apache.org/jira/browse/MESOS-3037 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone Assignee: gyliu SUPPRESS call is the complement to the current REVIVE call i.e., it will inform Mesos to stop sending offers to the framework. For the scheduler driver to send only Call messages (MESOS-2913), DeactivateFrameworkMessage needs to be converted to Call(s). We can implement this by having the driver send a SUPPRESS call followed by a DECLINE call for outstanding offers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2986) Docker version output is not compatible with Mesos
[ https://issues.apache.org/jira/browse/MESOS-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698534#comment-14698534 ] Steve Hoffman edited comment on MESOS-2986 at 8/17/15 1:23 PM: --- Yea, the 0.22.1 version of this code while ugly, just checked the major version number rather than creating a Version class which assumes there are just 3 number digits -- which clearly everybody doesn't follow (as in the FC case) {code} foreach (string line, strings::split(output.get(), \n)) { line = strings::trim(line); if (strings::startsWith(line, Client version: )) { line = line.substr(strlen(Client version: )); vectorstring version = strings::split(line, .); if (version.size() 1) { return Error(Failed to parse Docker version ' + line + '); } Tryint major = numifyint(version[0]); if (major.isError()) { return Error(Failed to parse Docker major version ' + version[0] + '); } else if (major.get() 1) { break; } return new Docker(path); } } {code} At this point in time do we still need a check here? Would anybody be using pre 1.0 docker with mesos? You could just dump the check outright... was (Author: hoffman60613): Yea, the 0.22.1 version of this code while ugly, just checked the major version number rather than creating a Version class which assumes there are just 1 numbers -- which clearly everybody doesn't follow. {code} foreach (string line, strings::split(output.get(), \n)) { line = strings::trim(line); if (strings::startsWith(line, Client version: )) { line = line.substr(strlen(Client version: )); vectorstring version = strings::split(line, .); if (version.size() 1) { return Error(Failed to parse Docker version ' + line + '); } Tryint major = numifyint(version[0]); if (major.isError()) { return Error(Failed to parse Docker major version ' + version[0] + '); } else if (major.get() 1) { break; } return new Docker(path); } } {code} At this point in time do we still need a check here? Would anybody be using pre 1.0 docker with mesos? You could just dump the check outright... Docker version output is not compatible with Mesos -- Key: MESOS-2986 URL: https://issues.apache.org/jira/browse/MESOS-2986 Project: Mesos Issue Type: Bug Components: docker Reporter: Isabel Jimenez Assignee: Isabel Jimenez Labels: mesosphere Fix For: 0.23.0 We currently use docker version to get Docker version, in Docker master branch and soon in Docker 1.8 [1] the output for this command changes. The solution for now will be to use the unchanged docker --version output, in the long term we should consider stop using the cli and use the API instead. [1] https://github.com/docker/docker/pull/14047 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2986) Docker version output is not compatible with Mesos
[ https://issues.apache.org/jira/browse/MESOS-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698534#comment-14698534 ] Steve Hoffman edited comment on MESOS-2986 at 8/17/15 1:22 PM: --- Yea, the 0.22.1 version of this code while ugly, just checked the major version number rather than creating a Version class which assumes there are just 1 numbers -- which clearly everybody doesn't follow. {code} foreach (string line, strings::split(output.get(), \n)) { line = strings::trim(line); if (strings::startsWith(line, Client version: )) { line = line.substr(strlen(Client version: )); vectorstring version = strings::split(line, .); if (version.size() 1) { return Error(Failed to parse Docker version ' + line + '); } Tryint major = numifyint(version[0]); if (major.isError()) { return Error(Failed to parse Docker major version ' + version[0] + '); } else if (major.get() 1) { break; } return new Docker(path); } } {code} At this point in time do we still need a check here? Would anybody be using pre 1.0 docker with mesos? You could just dump the check outright... was (Author: hoffman60613): Yea, the 0.22.1 version of this code while ugly, just checked the major version number rather than creating a Version class which assumes there are just 1 numbers -- which clearly everybody doesn't follow. {code} foreach (string line, strings::split(output.get(), \n)) { line = strings::trim(line); if (strings::startsWith(line, Client version: )) { line = line.substr(strlen(Client version: )); vectorstring version = strings::split(line, .); if (version.size() 1) { return Error(Failed to parse Docker version ' + line + '); } Tryint major = numifyint(version[0]); if (major.isError()) { return Error(Failed to parse Docker major version ' + version[0] + '); } else if (major.get() 1) { break; } return new Docker(path); } } {code} At this point do still need a check? Would anybody be using pre 1.0 docker with mesos? You could just dump the check outright... Docker version output is not compatible with Mesos -- Key: MESOS-2986 URL: https://issues.apache.org/jira/browse/MESOS-2986 Project: Mesos Issue Type: Bug Components: docker Reporter: Isabel Jimenez Assignee: Isabel Jimenez Labels: mesosphere Fix For: 0.23.0 We currently use docker version to get Docker version, in Docker master branch and soon in Docker 1.8 [1] the output for this command changes. The solution for now will be to use the unchanged docker --version output, in the long term we should consider stop using the cli and use the API instead. [1] https://github.com/docker/docker/pull/14047 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.
[ https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699560#comment-14699560 ] Klaus Ma commented on MESOS-3070: - [~vinodkone], the draft code diff of #4 was post at https://reviews.apache.org/r/37531/ to show the overall idea of it. The code is not completed: no code diff on GUI (show TaskTag in GUI), no UT case for it (will update it later), did not update other UT on task_id check. And, maybe add uid instead of TaskTag is better; so user did not need behavior changes, and not UT on task_id check will be broken. If any comments, please let me know. Master CHECK failure if a framework uses duplicated task id. Key: MESOS-3070 URL: https://issues.apache.org/jira/browse/MESOS-3070 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.22.1 Reporter: Jie Yu Assignee: Klaus Ma We observed this in one of our testing cluster. One framework (under development) keeps launching tasks using the same task_id. We don't expect the master to crash even if the framework is not doing what it's supposed to do. However, under a series of events, this could happen and keeps crashing the master. 1) frameworkA launches task 'task_id_1' on slaveA 2) master fails over 3) slaveA has not re-registered yet 4) frameworkA re-registered and launches task 'task_id_1' on slaveB 5) slaveA re-registering and add task task_id_1' to frameworkA 6) CHECK failure in addTask {noformat} I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with resources cpus(*):4; mem(*):32768 on slave 20150417-232509-1735470090-5050-48870-S25 (hostname) ... ... F0716 21:52:50.760136 28805 master.hpp:362] Check failed: !tasks.contains(task-task_id()) Duplicate task 'task_id_1' of framework framework_id {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3273) EventCall Test Framework is flaky
[ https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-3273: -- Target Version/s: (was: 0.24.0) Neither me nor benh were able to repro this tonight after running 1K iterations each. Seems like a very rare deadlock. I'm removing this as a blocker for 0.24.0 release but will keep the ticket open. EventCall Test Framework is flaky - Key: MESOS-3273 URL: https://issues.apache.org/jira/browse/MESOS-3273 Project: Mesos Issue Type: Bug Affects Versions: 0.24.0 Environment: https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull Reporter: Vinod Kone Assignee: Vinod Kone Observed this on ASF CI. h/t [~haosd...@gmail.com] Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master. {code} [ RUN ] ExamplesTest.EventCallFramework Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx' I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the driver is aborted! Shutting down Sending SIGTERM to process tree at pid 26061 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26062 Shutting down Killing the following process trees: [ ] Sending SIGTERM to process tree at pid 26063 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26098 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26099 Killing the following process trees: [ ] WARNING: Logging before InitGoogleLogging() is written to STDERR I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 172.17.2.10:60249 for 16 cpus I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0 I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 8429ns I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the db in 4219ns I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status I0813 19:55:17.181970 26126 master.cpp:378] Master 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 172.17.2.10:60249 I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: --acls=permissive: false register_frameworks { principals { type: SOME values: test-principal } roles { type: SOME values: * } } run_tasks { principals { type: SOME values: test-principal } users { type: SOME values: mesos } } --allocation_interval=1secs --allocator=HierarchicalDRF --authenticate=false --authenticate_slaves=false --authenticators=crammd5 --credentials=/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials --framework_sorter=drf --help=false --initialize_driver_logging=true --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO --max_slave_ping_timeouts=5 --quiet=false --recovery_slave_removal_limit=100% --registry=replicated_log --registry_fetch_timeout=1mins --registry_store_timeout=5secs --registry_strict=false --root_submissions=true --slave_ping_timeout=15secs --slave_reregister_timeout=10mins --user_sorter=drf --version=false --webui_dir=/mesos/mesos-0.24.0/src/webui --work_dir=/tmp/mesos-II8Gua --zk_session_timeout=10secs I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated frameworks to register I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated slaves to register I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. It is recommended that your credentials file is NOT accessible by others. I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' authenticator I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL I0813 19:55:17.185104 26138 recover.cpp:195] Received a recover response from a replica in EMPTY status
[jira] [Created] (MESOS-3286) Revocable metrics information are missed for slave node
Yong Qiao Wang created MESOS-3286: - Summary: Revocable metrics information are missed for slave node Key: MESOS-3286 URL: https://issues.apache.org/jira/browse/MESOS-3286 Project: Mesos Issue Type: Documentation Reporter: Yong Qiao Wang Assignee: Yong Qiao Wang Priority: Minor In bug 3278, the revocable metrics information of master node are added, but I also found those information also are missed for slave node in monitoring doc, fix it in this new patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3280) Master fails to access replicated log after network partition
Bernd Mathiske created MESOS-3280: - Summary: Master fails to access replicated log after network partition Key: MESOS-3280 URL: https://issues.apache.org/jira/browse/MESOS-3280 Project: Mesos Issue Type: Bug Components: master Reporter: Bernd Mathiske In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a network partition is forced, all the masters apparently lose access to their replicated log. The leading master halts. Unknown reasons, but presumably related to replicated log access. The others fail to recover from the replicated log. Unknown reasons. This could have to do with ZK setup, but it might also be a Mesos bug. This was observed in a Chronos test drive scenario described in detail here: https://github.com/mesos/chronos/issues/511 With setup instructions here: https://github.com/mesos/chronos/issues/508 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699881#comment-14699881 ] Jie Yu commented on MESOS-3050: --- Marco, can you re-run the failing tests using --verbose and paste the results here? Thanks Failing ROOT_ tests on CentOS 7.1 - Key: MESOS-3050 URL: https://issues.apache.org/jira/browse/MESOS-3050 Project: Mesos Issue Type: Bug Components: containerization, docker, test Affects Versions: 0.23.0 Environment: CentOS Linux release 7.1.1503 0.24.0 Reporter: Adam B Assignee: Timothy Chen Priority: Blocker Labels: mesosphere Attachments: ROOT_tests.log Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several several failures/errors: {code} [ RUN ] DockerTest.ROOT_DOCKER_CheckPortResource ../../src/tests/docker_tests.cpp:303: Failure (run).failure(): Container exited on error: exited with status 1 [ FAILED ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms) {code} ... {code} [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample ../../src/tests/isolator_tests.cpp:837: Failure isolator: Failed to create PerfEvent isolator, invalid events: { cycles, task-clock } [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms) [--] 1 test from PerfEventIsolatorTest (9 ms total) [--] 2 tests from SharedFilesystemIsolatorTest [ RUN ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac ../../src/tests/isolator_tests.cpp:1001: Failure Value of: os::exists(file) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms) [ RUN ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d ../../src/tests/isolator_tests.cpp:1086: Failure Value of: os::exists(path::join(containerPath, filename)) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms) {code} ... {code} [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user 'mesos.test.unprivileged.user' does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: Permission denied ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs: Permission denied -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256
[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699929#comment-14699929 ] Marco Massenzio commented on MESOS-3050: hrumpf... great investigation [~jieyu]! Is there an easy fix or does this require to 'introspect' the system at runtime? Failing ROOT_ tests on CentOS 7.1 - Key: MESOS-3050 URL: https://issues.apache.org/jira/browse/MESOS-3050 Project: Mesos Issue Type: Bug Components: containerization, docker, test Affects Versions: 0.23.0 Environment: CentOS Linux release 7.1.1503 0.24.0 Reporter: Adam B Assignee: Timothy Chen Priority: Blocker Labels: mesosphere Attachments: ROOT_tests.log Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several several failures/errors: {code} [ RUN ] DockerTest.ROOT_DOCKER_CheckPortResource ../../src/tests/docker_tests.cpp:303: Failure (run).failure(): Container exited on error: exited with status 1 [ FAILED ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms) {code} ... {code} [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample ../../src/tests/isolator_tests.cpp:837: Failure isolator: Failed to create PerfEvent isolator, invalid events: { cycles, task-clock } [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms) [--] 1 test from PerfEventIsolatorTest (9 ms total) [--] 2 tests from SharedFilesystemIsolatorTest [ RUN ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac ../../src/tests/isolator_tests.cpp:1001: Failure Value of: os::exists(file) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms) [ RUN ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d ../../src/tests/isolator_tests.cpp:1086: Failure Value of: os::exists(path::join(containerPath, filename)) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms) {code} ... {code} [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user 'mesos.test.unprivileged.user' does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: Permission denied ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs: Permission denied -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy,
[jira] [Updated] (MESOS-1554) Persistent resources support for storage-like services
[ https://issues.apache.org/jira/browse/MESOS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-1554: --- Assignee: Michael Park (was: Marco Massenzio) Persistent resources support for storage-like services -- Key: MESOS-1554 URL: https://issues.apache.org/jira/browse/MESOS-1554 Project: Mesos Issue Type: Epic Components: general, hadoop Reporter: Nikita Vetoshkin Assignee: Michael Park Priority: Critical Labels: mesosphere, twitter This question came up in [dev mailing list|http://mail-archives.apache.org/mod_mbox/mesos-dev/201406.mbox/%3CCAK8jAgNDs9Fe011Sq1jeNr0h%3DE-tDD9rak6hAsap3PqHx1y%3DKQ%40mail.gmail.com%3E]. It seems reasonable for storage like services (e.g. HDFS or Cassandra) to use Mesos to manage it's instances. But right now if we'd like to restart instance (e.g. to spin up a new version) - all previous instance version sandbox filesystem resources will be recycled by slave's garbage collector. At the moment filesystem resources can be managed out of band - i.e. instances can save their data in some database specific placed, that various instances can share (e.g. {{/var/lib/cassandra}}). [~benjaminhindman] suggested an idea in the mailing list (though it still needs some fleshing out): {quote} The idea originally came about because, even today, if we allocate some file system space to a task/executor, and then that task/executor terminates, we haven't officially freed those file system resources until after we garbage collect the task/executor sandbox! (We keep the sandbox around so a user/operator can get the stdout/stderr or anything else left around from their task/executor.) To solve this problem we wanted to be able to let a task/executor terminate but not *give up* all of it's resources, hence: persistent resources. Pushing this concept even further you could imagine always reallocating resources to a framework that had already been allocated those resources for a previous task/executor. Looked at from another perspective, these are late-binding, or lazy, resource reservations. At one point in time we had considered just doing 'right-of-first-refusal' for allocations after a task/executor terminate. But this is really insufficient for supporting storage-like frameworks well (and likely even harder to reliably implement then 'persistent resources' IMHO). There are a ton of things that need to get worked out in this model, including (but not limited to), how should a file system (or disk) be exposed in order to be made persistent? How should persistent resources be returned to a master? How many persistent resources can a framework get allocated? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1554) Persistent resources support for storage-like services
[ https://issues.apache.org/jira/browse/MESOS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio updated MESOS-1554: --- Labels: mesosphere twitter (was: twitter) Persistent resources support for storage-like services -- Key: MESOS-1554 URL: https://issues.apache.org/jira/browse/MESOS-1554 Project: Mesos Issue Type: Epic Components: general, hadoop Reporter: Nikita Vetoshkin Priority: Critical Labels: mesosphere, twitter This question came up in [dev mailing list|http://mail-archives.apache.org/mod_mbox/mesos-dev/201406.mbox/%3CCAK8jAgNDs9Fe011Sq1jeNr0h%3DE-tDD9rak6hAsap3PqHx1y%3DKQ%40mail.gmail.com%3E]. It seems reasonable for storage like services (e.g. HDFS or Cassandra) to use Mesos to manage it's instances. But right now if we'd like to restart instance (e.g. to spin up a new version) - all previous instance version sandbox filesystem resources will be recycled by slave's garbage collector. At the moment filesystem resources can be managed out of band - i.e. instances can save their data in some database specific placed, that various instances can share (e.g. {{/var/lib/cassandra}}). [~benjaminhindman] suggested an idea in the mailing list (though it still needs some fleshing out): {quote} The idea originally came about because, even today, if we allocate some file system space to a task/executor, and then that task/executor terminates, we haven't officially freed those file system resources until after we garbage collect the task/executor sandbox! (We keep the sandbox around so a user/operator can get the stdout/stderr or anything else left around from their task/executor.) To solve this problem we wanted to be able to let a task/executor terminate but not *give up* all of it's resources, hence: persistent resources. Pushing this concept even further you could imagine always reallocating resources to a framework that had already been allocated those resources for a previous task/executor. Looked at from another perspective, these are late-binding, or lazy, resource reservations. At one point in time we had considered just doing 'right-of-first-refusal' for allocations after a task/executor terminate. But this is really insufficient for supporting storage-like frameworks well (and likely even harder to reliably implement then 'persistent resources' IMHO). There are a ton of things that need to get worked out in this model, including (but not limited to), how should a file system (or disk) be exposed in order to be made persistent? How should persistent resources be returned to a master? How many persistent resources can a framework get allocated? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699895#comment-14699895 ] Jie Yu commented on MESOS-3050: --- Looking at the logs of those filesystem isolator tests, the 'exec' fails after pivot_root. Since we're exec-ing a '/bin/sh' binary, one explanation might be the binary (or some dependency of it) are not in the test root filesystem. Failing ROOT_ tests on CentOS 7.1 - Key: MESOS-3050 URL: https://issues.apache.org/jira/browse/MESOS-3050 Project: Mesos Issue Type: Bug Components: containerization, docker, test Affects Versions: 0.23.0 Environment: CentOS Linux release 7.1.1503 0.24.0 Reporter: Adam B Assignee: Timothy Chen Priority: Blocker Labels: mesosphere Attachments: ROOT_tests.log Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several several failures/errors: {code} [ RUN ] DockerTest.ROOT_DOCKER_CheckPortResource ../../src/tests/docker_tests.cpp:303: Failure (run).failure(): Container exited on error: exited with status 1 [ FAILED ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms) {code} ... {code} [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample ../../src/tests/isolator_tests.cpp:837: Failure isolator: Failed to create PerfEvent isolator, invalid events: { cycles, task-clock } [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms) [--] 1 test from PerfEventIsolatorTest (9 ms total) [--] 2 tests from SharedFilesystemIsolatorTest [ RUN ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac ../../src/tests/isolator_tests.cpp:1001: Failure Value of: os::exists(file) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms) [ RUN ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d ../../src/tests/isolator_tests.cpp:1086: Failure Value of: os::exists(path::join(containerPath, filename)) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms) {code} ... {code} [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user 'mesos.test.unprivileged.user' does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: Permission denied ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs: Permission denied -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of:
[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699911#comment-14699911 ] Jie Yu commented on MESOS-3050: --- OK, I think I know the problem. In centos7.1, 'sh' is under '/usr/bin/sh', while on centos6 (the system I've been using), 'sh' is under '/bin/sh'. Failing ROOT_ tests on CentOS 7.1 - Key: MESOS-3050 URL: https://issues.apache.org/jira/browse/MESOS-3050 Project: Mesos Issue Type: Bug Components: containerization, docker, test Affects Versions: 0.23.0 Environment: CentOS Linux release 7.1.1503 0.24.0 Reporter: Adam B Assignee: Timothy Chen Priority: Blocker Labels: mesosphere Attachments: ROOT_tests.log Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several several failures/errors: {code} [ RUN ] DockerTest.ROOT_DOCKER_CheckPortResource ../../src/tests/docker_tests.cpp:303: Failure (run).failure(): Container exited on error: exited with status 1 [ FAILED ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms) {code} ... {code} [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample ../../src/tests/isolator_tests.cpp:837: Failure isolator: Failed to create PerfEvent isolator, invalid events: { cycles, task-clock } [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms) [--] 1 test from PerfEventIsolatorTest (9 ms total) [--] 2 tests from SharedFilesystemIsolatorTest [ RUN ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac ../../src/tests/isolator_tests.cpp:1001: Failure Value of: os::exists(file) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms) [ RUN ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d ../../src/tests/isolator_tests.cpp:1086: Failure Value of: os::exists(path::join(containerPath, filename)) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms) {code} ... {code} [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user 'mesos.test.unprivileged.user' does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: Permission denied ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs: Permission denied -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ +
[jira] [Updated] (MESOS-3280) Master fails to access replicated log after network partition
[ https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-3280: -- Affects Version/s: 0.23.0 Environment: Zookeeper version 3.4.5--1 Master fails to access replicated log after network partition - Key: MESOS-3280 URL: https://issues.apache.org/jira/browse/MESOS-3280 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.23.0 Environment: Zookeeper version 3.4.5--1 Reporter: Bernd Mathiske Labels: mesosphere In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a network partition is forced, all the masters apparently lose access to their replicated log. The leading master halts. Unknown reasons, but presumably related to replicated log access. The others fail to recover from the replicated log. Unknown reasons. This could have to do with ZK setup, but it might also be a Mesos bug. This was observed in a Chronos test drive scenario described in detail here: https://github.com/mesos/chronos/issues/511 With setup instructions here: https://github.com/mesos/chronos/issues/508 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2769) Metric for cpu scheduling latency from all components
[ https://issues.apache.org/jira/browse/MESOS-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cong Wang updated MESOS-2769: - Sprint: Twitter Q2 Sprint 3, Twitter Mesos Q3 Sprint 3 (was: Twitter Q2 Sprint 3) Metric for cpu scheduling latency from all components - Key: MESOS-2769 URL: https://issues.apache.org/jira/browse/MESOS-2769 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.22.1 Reporter: Ian Downes Assignee: Cong Wang Labels: twitter The metric will provide statistics on the scheduling latency for processes/threads in a container, i.e., statistics on the delay before application code can run. This will be the aggregate effect of the normal scheduling period, contention from other threads/processes, both in the container and on the system, and any effects from the CFS bandwidth control (if enabled) or other CPU isolation strategies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1554) Persistent resources support for storage-like services
[ https://issues.apache.org/jira/browse/MESOS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Massenzio reassigned MESOS-1554: -- Assignee: Marco Massenzio Persistent resources support for storage-like services -- Key: MESOS-1554 URL: https://issues.apache.org/jira/browse/MESOS-1554 Project: Mesos Issue Type: Epic Components: general, hadoop Reporter: Nikita Vetoshkin Assignee: Marco Massenzio Priority: Critical Labels: mesosphere, twitter This question came up in [dev mailing list|http://mail-archives.apache.org/mod_mbox/mesos-dev/201406.mbox/%3CCAK8jAgNDs9Fe011Sq1jeNr0h%3DE-tDD9rak6hAsap3PqHx1y%3DKQ%40mail.gmail.com%3E]. It seems reasonable for storage like services (e.g. HDFS or Cassandra) to use Mesos to manage it's instances. But right now if we'd like to restart instance (e.g. to spin up a new version) - all previous instance version sandbox filesystem resources will be recycled by slave's garbage collector. At the moment filesystem resources can be managed out of band - i.e. instances can save their data in some database specific placed, that various instances can share (e.g. {{/var/lib/cassandra}}). [~benjaminhindman] suggested an idea in the mailing list (though it still needs some fleshing out): {quote} The idea originally came about because, even today, if we allocate some file system space to a task/executor, and then that task/executor terminates, we haven't officially freed those file system resources until after we garbage collect the task/executor sandbox! (We keep the sandbox around so a user/operator can get the stdout/stderr or anything else left around from their task/executor.) To solve this problem we wanted to be able to let a task/executor terminate but not *give up* all of it's resources, hence: persistent resources. Pushing this concept even further you could imagine always reallocating resources to a framework that had already been allocated those resources for a previous task/executor. Looked at from another perspective, these are late-binding, or lazy, resource reservations. At one point in time we had considered just doing 'right-of-first-refusal' for allocations after a task/executor terminate. But this is really insufficient for supporting storage-like frameworks well (and likely even harder to reliably implement then 'persistent resources' IMHO). There are a ton of things that need to get worked out in this model, including (but not limited to), how should a file system (or disk) be exposed in order to be made persistent? How should persistent resources be returned to a master? How many persistent resources can a framework get allocated? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3264) JVM can exit prematurely following framework teardown
[ https://issues.apache.org/jira/browse/MESOS-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699967#comment-14699967 ] Greg Mann commented on MESOS-3264: -- Thanks for having a look at this, [~haosd...@gmail.com]! I had explored the option of using similar shutdown hooks previously, and unfortunately it doesn't do the trick, I assume because the order of the shutdown hooks is unspecified? And since they are run concurrently, perhaps the JVM will continue on to its post-shutdownHook GC while the hooks are still executing. In any case, the tests continue to fail with such shutdown hooks placed in the constructors of the SchedulerDriver and/or the ExecutorDriver. If we define the {{close()}} method as {{public}} and call it explicitly in the body of {{main()}}, the tests do pass reliably. However, there seems to be some conventional wisdom saying that defining/calling a method that calls {{finalize()}} in that way is A Bad Thing. Any thoughts? If we decide that it is acceptable to define a public {{close()}} method that calls {{finalize()}} for the SchedulerDriver, similar to the one in your patch, and call it explicitly just before we call {{System.exit()}}, then that would solve this issue. JVM can exit prematurely following framework teardown - Key: MESOS-3264 URL: https://issues.apache.org/jira/browse/MESOS-3264 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.23.0, 0.24.0 Reporter: Greg Mann Priority: Minor Labels: java, tech-debt In Java frameworks, it is possible for the JVM to begin exiting the program - via {{System.exit()}}, for example - while teardown of native objects such as the SchedulerDriver and associated Executors is still in progress. {{SchedulerDriver::stop()}} will return after it has sent messages to other actors to begin their teardown, meanwhile the JVM is free to terminate the program and thus begin executing native object destructors while those objects are still in use, potentially leading to a segfault. This has manifested itself in flaky tests from the ExamplesTest suite (see MESOS-830 and MESOS-1013), as mutexes from glog are destroyed while the framework is still shutting down and attempting to log. Ideally, a mechanism would exist to block the Java code until a confirmation that framework teardown is complete has been received. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3281) Create a user doc for Scheduler HTTP API
Vinod Kone created MESOS-3281: - Summary: Create a user doc for Scheduler HTTP API Key: MESOS-3281 URL: https://issues.apache.org/jira/browse/MESOS-3281 Project: Mesos Issue Type: Documentation Reporter: Vinod Kone Assignee: Vinod Kone We need to convert the design doc into user doc that we can add to our docs folder. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3264) JVM can exit prematurely following framework teardown
[ https://issues.apache.org/jira/browse/MESOS-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1468#comment-1468 ] haosdent commented on MESOS-3264: - {code} I had explored the option of using similar shutdown hooks previously, and unfortunately it doesn't do the trick, I assume because the order of the shutdown hooks is unspecified? {code} Very interesting. Could you try this code snippet in you jvm? https://ideone.com/48o7SG The output should be {code} Enter main Before System.exit(0); Call finalize() Call finalize() {code} Notice I wrap finalize() with close() and call close() in the ShutdownHook. Call finalize() in the ShutdownHook thread is not work. JVM can exit prematurely following framework teardown - Key: MESOS-3264 URL: https://issues.apache.org/jira/browse/MESOS-3264 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.23.0, 0.24.0 Reporter: Greg Mann Priority: Minor Labels: java, tech-debt In Java frameworks, it is possible for the JVM to begin exiting the program - via {{System.exit()}}, for example - while teardown of native objects such as the SchedulerDriver and associated Executors is still in progress. {{SchedulerDriver::stop()}} will return after it has sent messages to other actors to begin their teardown, meanwhile the JVM is free to terminate the program and thus begin executing native object destructors while those objects are still in use, potentially leading to a segfault. This has manifested itself in flaky tests from the ExamplesTest suite (see MESOS-830 and MESOS-1013), as mutexes from glog are destroyed while the framework is still shutting down and attempting to log. Ideally, a mechanism would exist to block the Java code until a confirmation that framework teardown is complete has been received. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3276) Add Scrapinghub to the Powered By Mesos page
[ https://issues.apache.org/jira/browse/MESOS-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated MESOS-3276: - Description: Hello! At [Scrapinghub|http://scrapinghub.com/] we have been using mesos to run our core services in production for one year. Mesos is awesome and we really love it! We'd like to add our organization to the Powered By Mesos page. I've created a RB patch here: https://reviews.apache.org/r/37513/ Scrapinghub is the leading platform for deploying, running and scaling web crawlers. We've redesigned large part of our Paas based on mesos, and we plan to open source part of it in the future! Thanks! Shuai (+ Scrapinghub's Scrapy Cloud team) was: Hello! At [Scrapinghub|https://scrapinghub.com/] we have been using mesos to run our core services in production for one year. Mesos is awesome and we really love it! We'd like to add our organization to the Powered By Mesos page. I've created a RB patch here: https://reviews.apache.org/r/37513/ Scrapinghub is the leading platform for deploying, running and scaling web crawlers. We've redesigned large part of our Paas based on mesos, and we plan to open source part of it in the future! Thanks! Shuai (+ Scrapinghub's Scrapy Cloud team) Add Scrapinghub to the Powered By Mesos page Key: MESOS-3276 URL: https://issues.apache.org/jira/browse/MESOS-3276 Project: Mesos Issue Type: Wish Components: documentation Reporter: Shuai Lin Priority: Trivial Hello! At [Scrapinghub|http://scrapinghub.com/] we have been using mesos to run our core services in production for one year. Mesos is awesome and we really love it! We'd like to add our organization to the Powered By Mesos page. I've created a RB patch here: https://reviews.apache.org/r/37513/ Scrapinghub is the leading platform for deploying, running and scaling web crawlers. We've redesigned large part of our Paas based on mesos, and we plan to open source part of it in the future! Thanks! Shuai (+ Scrapinghub's Scrapy Cloud team) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3276) Add Scrapinghub to the Powered By Mesos page
[ https://issues.apache.org/jira/browse/MESOS-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699121#comment-14699121 ] Shuai Lin commented on MESOS-3276: -- It should be http, I made a typo there. The RB patch should be correct. Thanks for catching this! Add Scrapinghub to the Powered By Mesos page Key: MESOS-3276 URL: https://issues.apache.org/jira/browse/MESOS-3276 Project: Mesos Issue Type: Wish Components: documentation Reporter: Shuai Lin Priority: Trivial Hello! At [Scrapinghub|http://scrapinghub.com/] we have been using mesos to run our core services in production for one year. Mesos is awesome and we really love it! We'd like to add our organization to the Powered By Mesos page. I've created a RB patch here: https://reviews.apache.org/r/37513/ Scrapinghub is the leading platform for deploying, running and scaling web crawlers. We've redesigned large part of our Paas based on mesos, and we plan to open source part of it in the future! Thanks! Shuai (+ Scrapinghub's Scrapy Cloud team) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3278) Add the revocable metrics information in monitoring doc
Yong Qiao Wang created MESOS-3278: - Summary: Add the revocable metrics information in monitoring doc Key: MESOS-3278 URL: https://issues.apache.org/jira/browse/MESOS-3278 Project: Mesos Issue Type: Documentation Reporter: Yong Qiao Wang Assignee: Yong Qiao Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3278) Add the revocable metrics information in monitoring doc
[ https://issues.apache.org/jira/browse/MESOS-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699130#comment-14699130 ] Yong Qiao Wang commented on MESOS-3278: --- The related review request is: https://reviews.apache.org/r/37518/ Add the revocable metrics information in monitoring doc Key: MESOS-3278 URL: https://issues.apache.org/jira/browse/MESOS-3278 Project: Mesos Issue Type: Documentation Reporter: Yong Qiao Wang Assignee: Yong Qiao Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3273) EventCall Test Framework is flaky
[ https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-3273: - Assignee: Vinod Kone EventCall Test Framework is flaky - Key: MESOS-3273 URL: https://issues.apache.org/jira/browse/MESOS-3273 Project: Mesos Issue Type: Bug Affects Versions: 0.24.0 Environment: https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull Reporter: Vinod Kone Assignee: Vinod Kone Observed this on ASF CI. h/t [~haosd...@gmail.com] Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master. {code} [ RUN ] ExamplesTest.EventCallFramework Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx' I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the driver is aborted! Shutting down Sending SIGTERM to process tree at pid 26061 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26062 Shutting down Killing the following process trees: [ ] Sending SIGTERM to process tree at pid 26063 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26098 Killing the following process trees: [ ] Shutting down Sending SIGTERM to process tree at pid 26099 Killing the following process trees: [ ] WARNING: Logging before InitGoogleLogging() is written to STDERR I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 172.17.2.10:60249 for 16 cpus I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0 I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 8429ns I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the db in 4219ns I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status I0813 19:55:17.181970 26126 master.cpp:378] Master 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 172.17.2.10:60249 I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: --acls=permissive: false register_frameworks { principals { type: SOME values: test-principal } roles { type: SOME values: * } } run_tasks { principals { type: SOME values: test-principal } users { type: SOME values: mesos } } --allocation_interval=1secs --allocator=HierarchicalDRF --authenticate=false --authenticate_slaves=false --authenticators=crammd5 --credentials=/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials --framework_sorter=drf --help=false --initialize_driver_logging=true --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO --max_slave_ping_timeouts=5 --quiet=false --recovery_slave_removal_limit=100% --registry=replicated_log --registry_fetch_timeout=1mins --registry_store_timeout=5secs --registry_strict=false --root_submissions=true --slave_ping_timeout=15secs --slave_reregister_timeout=10mins --user_sorter=drf --version=false --webui_dir=/mesos/mesos-0.24.0/src/webui --work_dir=/tmp/mesos-II8Gua --zk_session_timeout=10secs I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated frameworks to register I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated slaves to register I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. It is recommended that your credentials file is NOT accessible by others. I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' authenticator I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL I0813 19:55:17.185104 26138 recover.cpp:195] Received a recover response from a replica in EMPTY status I0813 19:55:17.185972 26100 containerizer.cpp:143] Using isolation: posix/cpu,posix/mem,filesystem/posix I0813 19:55:17.186058 26135 recover.cpp:566] Updating replica status to STARTING I0813
[jira] [Commented] (MESOS-3273) EventCall Test Framework is flaky
[ https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700107#comment-14700107 ] Vinod Kone commented on MESOS-3273: --- While trying to repro this, observed another issue where the test hangs during termination. {code} I0817 19:34:33.166100 60834 master.cpp:3998] Received update of slave 20150817-193432-1828659978-51071-60794-S2 at slave(1)@10.35.255.108:51071 (smfd-atr-11-sr1.devel.twitter.com) with total oversubscribed resources I0817 19:34:33.166316 60834 hierarchical.hpp:600] Slave 20150817-193432-1828659978-51071-60794-S2 (smfd-atr-11-sr1.devel.twitter.com) updated with oversubscribed resources (total: cpus(*):2; mem(*):10240; disk(*):454767; ports(*):[31000-32000], allocated: cpus(*):2; mem(*):10240; disk(*):454767; ports(*):[31000-32000]) Received an UPDATE event Task 4 is in state TASK_FINISHED I0817 19:34:33.167793 60816 master.cpp:860] Master terminating I0817 19:34:33.168092 60836 hierarchical.hpp:571] Removed slave 20150817-193432-1828659978-51071-60794-S2 I0817 19:34:33.168654 60816 master.cpp:5673] Removing executor 'default' with resources of framework 20150817-193432-1828659978-51071-60794- on slave 20150817-193432-1828659978-51071-60794-S1 at slave(2)@10.35.255.108:51071 (smfd-atr-11-sr1.devel.twitter.com) I0817 19:34:33.168725 60819 hierarchical.hpp:571] Removed slave 20150817-193432-1828659978-51071-60794-S1 I0817 19:34:33.169075 60816 master.cpp:5644] Removing task 4 with resources cpus(*):1; mem(*):128 of framework 20150817-193432-1828659978-51071-60794- on slave 20150817-193432-1828659978-51071-60794-S0 at slave(3)@10.35.255.108:51071 (smfd-atr-11-sr1.devel.twitter.com) I0817 19:34:33.169153 60818 hierarchical.hpp:571] Removed slave 20150817-193432-1828659978-51071-60794-S0 I0817 19:34:33.169255 60816 master.cpp:5673] Removing executor 'default' with resources of framework 20150817-193432-1828659978-51071-60794- on slave 20150817-193432-1828659978-51071-60794-S0 at slave(3)@10.35.255.108:51071 (smfd-atr-11-sr1.devel.twitter.com) I0817 19:34:33.170186 60818 hierarchical.hpp:428] Removed framework 20150817-193432-1828659978-51071-60794- I0817 19:34:33.170919 60827 slave.cpp:3143] master@10.35.255.108:51071 exited I0817 19:34:33.170903 60817 slave.cpp:3143] master@10.35.255.108:51071 exited W0817 19:34:33.170959 60827 slave.cpp:3146] Master disconnected! Waiting for a new master to be elected W0817 19:34:33.170976 60817 slave.cpp:3146] Master disconnected! Waiting for a new master to be elected I0817 19:34:33.171083 60821 slave.cpp:3143] master@10.35.255.108:51071 exited W0817 19:34:33.171169 60821 slave.cpp:3146] Master disconnected! Waiting for a new master to be elected I0817 19:34:33.172170 60817 slave.cpp:564] Slave terminating I0817 19:34:33.174253 60794 slave.cpp:564] Slave terminating I0817 19:34:33.174424 60794 slave.cpp:1959] Asked to shut down framework 20150817-193432-1828659978-51071-60794- by @0.0.0.0:0 I0817 19:34:33.174473 60794 slave.cpp:1984] Shutting down framework 20150817-193432-1828659978-51071-60794- I0817 19:34:33.174665 60794 slave.cpp:3710] Shutting down executor 'default' of framework 20150817-193432-1828659978-51071-60794- I0817 19:34:33.175500 60926 exec.cpp:380] Executor asked to shutdown I0817 19:34:33.176652 60794 slave.cpp:564] Slave terminating I0817 19:34:33.176762 60794 slave.cpp:1959] Asked to shut down framework 20150817-193432-1828659978-51071-60794- by @0.0.0.0:0 I0817 19:34:33.176909 60794 slave.cpp:1984] Shutting down framework 20150817-193432-1828659978-51071-60794- I0817 19:34:33.176954 60794 slave.cpp:3710] Shutting down executor 'default' of framework 20150817-193432-1828659978-51071-60794- I0817 19:34:33.177781 60879 exec.cpp:380] Executor asked to shutdown I0817 19:34:33.178567 60822 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 13.870729ms I0817 19:34:33.178649 60822 replica.cpp:679] Persisted action at 8 I0817 19:34:33.179919 60815 replica.cpp:658] Replica received learned notice for position 8 I0817 19:34:33.195266 60815 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 15.299248ms I0817 19:34:33.195405 60815 leveldb.cpp:401] Deleting ~2 keys from leveldb took 29964ns I0817 19:34:33.195428 60815 replica.cpp:679] Persisted action at 8 I0817 19:34:33.195456 60815 replica.cpp:664] Replica learned TRUNCATE action at position 8 {code} gdb stack trace points to what looks like a deadlock {code} #0 0x7fd5df472019 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7fd5e332293c in std::condition_variable::wait(std::unique_lockstd::mutex) () from /home/vinod/mesos/build/src/.libs/libmesos-0.24.0.so #2 0x7fd5e248f3cc in synchronized_waitstd::condition_variable, std::mutex () from /home/vinod/mesos/build/src/.libs/libmesos-0.24.0.so #3 0x7fd5e3191d35 in arrive () from
[jira] [Updated] (MESOS-2466) Write documentation for all the LIBPROCESS_* environment variables.
[ https://issues.apache.org/jira/browse/MESOS-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-2466: - Sprint: Mesosphere Sprint 16 Write documentation for all the LIBPROCESS_* environment variables. --- Key: MESOS-2466 URL: https://issues.apache.org/jira/browse/MESOS-2466 Project: Mesos Issue Type: Documentation Reporter: Alexander Rojas Assignee: Greg Mann Labels: documentation, mesosphere libprocess uses a set of environment variables to modify its behaviour; however, these variables are not documented anywhere, nor it is defined where the documentation should be. What would be needed is a decision whether the environment variables should be documented (a new doc file or reusing an existing one), and then add the documentation there. After searching in the code, these are the variables which need to be documented: # {{LIBPROCESS_ENABLE_PROFILER}} # {{LIBPROCESS_IP}} # {{LIBPROCESS_PORT}} # {{LIBPROCESS_STATISTICS_WINDOW}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2466) Write documentation for all the LIBPROCESS_* environment variables.
[ https://issues.apache.org/jira/browse/MESOS-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-2466: - Sprint: (was: Mesosphere Sprint 16) Write documentation for all the LIBPROCESS_* environment variables. --- Key: MESOS-2466 URL: https://issues.apache.org/jira/browse/MESOS-2466 Project: Mesos Issue Type: Documentation Reporter: Alexander Rojas Assignee: Greg Mann Labels: documentation, mesosphere libprocess uses a set of environment variables to modify its behaviour; however, these variables are not documented anywhere, nor it is defined where the documentation should be. What would be needed is a decision whether the environment variables should be documented (a new doc file or reusing an existing one), and then add the documentation there. After searching in the code, these are the variables which need to be documented: # {{LIBPROCESS_ENABLE_PROFILER}} # {{LIBPROCESS_IP}} # {{LIBPROCESS_PORT}} # {{LIBPROCESS_STATISTICS_WINDOW}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2466) Write documentation for all the LIBPROCESS_* environment variables.
[ https://issues.apache.org/jira/browse/MESOS-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-2466: Assignee: Greg Mann Write documentation for all the LIBPROCESS_* environment variables. --- Key: MESOS-2466 URL: https://issues.apache.org/jira/browse/MESOS-2466 Project: Mesos Issue Type: Documentation Reporter: Alexander Rojas Assignee: Greg Mann Labels: documentation, mesosphere libprocess uses a set of environment variables to modify its behaviour; however, these variables are not documented anywhere, nor it is defined where the documentation should be. What would be needed is a decision whether the environment variables should be documented (a new doc file or reusing an existing one), and then add the documentation there. After searching in the code, these are the variables which need to be documented: # {{LIBPROCESS_ENABLE_PROFILER}} # {{LIBPROCESS_IP}} # {{LIBPROCESS_PORT}} # {{LIBPROCESS_STATISTICS_WINDOW}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3158) Libprocess Process: Join runqueue workers during finalization
[ https://issues.apache.org/jira/browse/MESOS-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-3158: Assignee: Greg Mann Libprocess Process: Join runqueue workers during finalization - Key: MESOS-3158 URL: https://issues.apache.org/jira/browse/MESOS-3158 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Joris Van Remoortere Assignee: Greg Mann Labels: beginner, libprocess, mesosphere, newbie The lack of synchronization between ProcessManager destruction and the thread pool threads running the queued processes means that the shared state that is part of the ProcessManager gets destroyed prematurely. Synchronizing the ProcessManager destructor with draining the work queues and stopping the workers will allow us to not require leaking the shared state to avoid use beyond destruction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3283) Improve batch allocations performance especially with large number of slaves and frameworks.
Mandeep Chadha created MESOS-3283: - Summary: Improve batch allocations performance especially with large number of slaves and frameworks. Key: MESOS-3283 URL: https://issues.apache.org/jira/browse/MESOS-3283 Project: Mesos Issue Type: Improvement Components: allocation Affects Versions: 0.23.0 Reporter: Mandeep Chadha Improve batch allocations performance especially with large number of slaves and frameworks. e.g. these are the allocation timings for 10K slaves and varying number of frameworks. Using 1 slaves and 1 frameworks Added 1 slaves in 14.50836112secs Updated 1 slaves in 18.665093703secs [ OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/12 (34983 ms) [ RUN ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13 Using 1 slaves and 50 frameworks Added 1 slaves in 51.534229549secs Updated 1 slaves in 57.131554303secs [ OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13 (110449 ms) [ RUN ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14 Using 1 slaves and 100 frameworks Added 1 slaves in 1.5891310434mins Updated 1 slaves in 1.80562078148333mins [ OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14 (205467 ms) [ RUN ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/15 Using 1 slaves and 200 frameworks Added 1 slaves in 3.0750647275mins Updated 1 slaves in 3.85846762096667mins -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700261#comment-14700261 ] Jie Yu commented on MESOS-3050: --- Pushed a fix. [~marco-mesos], let me know if the tests are still failing. commit 3ae937fb1c41bf858d7e37e5679da646fe93734b Author: Jie Yu yujie@gmail.com Date: Mon Aug 17 12:53:08 2015 -0700 Included /usr/bin/sh in the test root filesystem. Review: https://reviews.apache.org/r/37555 commit bd4332c68aea3aaf8eac3ef3a15b72541084e0c4 Author: Jie Yu yujie@gmail.com Date: Mon Aug 17 12:47:52 2015 -0700 Used execlp instead of execl to exec processes in Mesos. Review: https://reviews.apache.org/r/37547 commit d7d3b52122613f536bcffe41a5f26132e99728af Author: Jie Yu yujie@gmail.com Date: Mon Aug 17 12:47:41 2015 -0700 Used execlp instead of execl to exec processes in libprocess. Review: https://reviews.apache.org/r/37546 commit e70493a8acd3c6848bb9dbe7f7a72e694fe6cf07 Author: Jie Yu yujie@gmail.com Date: Mon Aug 17 12:47:31 2015 -0700 Used execlp instead of execl to exec processes in stout. Review: https://reviews.apache.org/r/37545 Failing ROOT_ tests on CentOS 7.1 - Key: MESOS-3050 URL: https://issues.apache.org/jira/browse/MESOS-3050 Project: Mesos Issue Type: Bug Components: containerization, docker, test Affects Versions: 0.23.0 Environment: CentOS Linux release 7.1.1503 0.24.0 Reporter: Adam B Assignee: Timothy Chen Priority: Blocker Labels: mesosphere Attachments: ROOT_tests.log Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several several failures/errors: {code} [ RUN ] DockerTest.ROOT_DOCKER_CheckPortResource ../../src/tests/docker_tests.cpp:303: Failure (run).failure(): Container exited on error: exited with status 1 [ FAILED ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms) {code} ... {code} [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample ../../src/tests/isolator_tests.cpp:837: Failure isolator: Failed to create PerfEvent isolator, invalid events: { cycles, task-clock } [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms) [--] 1 test from PerfEventIsolatorTest (9 ms total) [--] 2 tests from SharedFilesystemIsolatorTest [ RUN ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac ../../src/tests/isolator_tests.cpp:1001: Failure Value of: os::exists(file) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms) [ RUN ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume + mount -n --bind /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d ../../src/tests/isolator_tests.cpp:1086: Failure Value of: os::exists(path::join(containerPath, filename)) Actual: true Expected: false [ FAILED ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms) {code} ... {code} [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess userdel: user 'mesos.test.unprivileged.user' does not exist [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: Permission denied ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'echo $$ + path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs: Permission denied -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or directory mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: No such file or directory ../../src/tests/isolator_tests.cpp:1274: Failure Value of: os::system( su - + UNPRIVILEGED_USERNAME + -c 'mkdir + path::join(flags.cgroups_hierarchy, userCgroup) + ') Actual: 256 Expected: 0 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file or directory ../../src/tests/isolator_tests.cpp:1283: Failure Value of: os::system( su - +
[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.
[ https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699096#comment-14699096 ] Klaus Ma commented on MESOS-3070: - Regarding #4, it also check duplicated TaskTag; for killTask, all tasks with the same tag will be killed; and user can also use the generated ID to kill task. Regarding #3, it's similar to #4; #4 use UUID as unique TaskID, #3 use slaveId + taskId + frameworkId as unique TaskID Personally, prefer to #4; and documentation on new behaviour is necessary. Master CHECK failure if a framework uses duplicated task id. Key: MESOS-3070 URL: https://issues.apache.org/jira/browse/MESOS-3070 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.22.1 Reporter: Jie Yu Assignee: Klaus Ma We observed this in one of our testing cluster. One framework (under development) keeps launching tasks using the same task_id. We don't expect the master to crash even if the framework is not doing what it's supposed to do. However, under a series of events, this could happen and keeps crashing the master. 1) frameworkA launches task 'task_id_1' on slaveA 2) master fails over 3) slaveA has not re-registered yet 4) frameworkA re-registered and launches task 'task_id_1' on slaveB 5) slaveA re-registering and add task task_id_1' to frameworkA 6) CHECK failure in addTask {noformat} I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with resources cpus(*):4; mem(*):32768 on slave 20150417-232509-1735470090-5050-48870-S25 (hostname) ... ... F0716 21:52:50.760136 28805 master.hpp:362] Check failed: !tasks.contains(task-task_id()) Duplicate task 'task_id_1' of framework framework_id {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3245) The comments of DRFSorter::dirty is not correct
[ https://issues.apache.org/jira/browse/MESOS-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699084#comment-14699084 ] Qian Zhang edited comment on MESOS-3245 at 8/17/15 6:50 AM: RB link: https://reviews.apache.org/r/37289/ was (Author: qianzhang): RR link: https://reviews.apache.org/r/37289/ The comments of DRFSorter::dirty is not correct --- Key: MESOS-3245 URL: https://issues.apache.org/jira/browse/MESOS-3245 Project: Mesos Issue Type: Bug Components: allocation Reporter: Qian Zhang Assignee: Qian Zhang Priority: Minor The comment is: {code} // If true, start() will recalculate all shares. bool dirty; {code} But there is actually no start() method in class DRFSorter, I think the comment should be: {code} // If true, sort() will recalculate all shares. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3228) Some spelling error in slave help message
[ https://issues.apache.org/jira/browse/MESOS-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699083#comment-14699083 ] Yong Qiao Wang commented on MESOS-3228: --- The related reviewboard URL: https://reviews.apache.org/r/37208/ Some spelling error in slave help message - Key: MESOS-3228 URL: https://issues.apache.org/jira/browse/MESOS-3228 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.24.0 Reporter: Yong Qiao Wang Assignee: Yong Qiao Wang Priority: Minor Fix some spell error in the help message of slave component. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3245) The comments of DRFSorter::dirty is not correct
[ https://issues.apache.org/jira/browse/MESOS-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699084#comment-14699084 ] Qian Zhang edited comment on MESOS-3245 at 8/17/15 6:50 AM: RR link: https://reviews.apache.org/r/37289/ was (Author: qianzhang): RB link: https://reviews.apache.org/r/37289/ The comments of DRFSorter::dirty is not correct --- Key: MESOS-3245 URL: https://issues.apache.org/jira/browse/MESOS-3245 Project: Mesos Issue Type: Bug Components: allocation Reporter: Qian Zhang Assignee: Qian Zhang Priority: Minor The comment is: {code} // If true, start() will recalculate all shares. bool dirty; {code} But there is actually no start() method in class DRFSorter, I think the comment should be: {code} // If true, sort() will recalculate all shares. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3245) The comments of DRFSorter::dirty is not correct
[ https://issues.apache.org/jira/browse/MESOS-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699084#comment-14699084 ] Qian Zhang commented on MESOS-3245: --- RR link: https://reviews.apache.org/r/37289/ The comments of DRFSorter::dirty is not correct --- Key: MESOS-3245 URL: https://issues.apache.org/jira/browse/MESOS-3245 Project: Mesos Issue Type: Bug Components: allocation Reporter: Qian Zhang Assignee: Qian Zhang Priority: Minor The comment is: {code} // If true, start() will recalculate all shares. bool dirty; {code} But there is actually no start() method in class DRFSorter, I think the comment should be: {code} // If true, sort() will recalculate all shares. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3277) Implement basic security isolators such as linux/apparmor or linux/seccomp
Stephan Erb created MESOS-3277: -- Summary: Implement basic security isolators such as linux/apparmor or linux/seccomp Key: MESOS-3277 URL: https://issues.apache.org/jira/browse/MESOS-3277 Project: Mesos Issue Type: Story Components: containerization, isolation Reporter: Stephan Erb As an operator of a Mesos cluster, I would like to gain some control over what is happening inside launched containers. Specifically, I want to make it a little bit more difficult for untrusted code to escape its container confinement (e.g., prevent access to certain kernel features, raw devices, ...) Inspired by [LXC | https://github.com/lxc/lxc], Mesos could offer two new isolators: * *linux/apparmor*: Isolator which applies an AppArmor security profile to containers. A cluster-wide default profile could be similar to the [default shipped by LXC|https://github.com/lxc/lxc/blob/master/config/apparmor/abstractions/container-base]. * *linux/seccomp*: Isolator based on the [seccomp syscall filter|https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt]. Seccomp is a mechanism for minimizing the exposed kernel surface by reducing the set of allowed syscalls. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.
[ https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700154#comment-14700154 ] gyliu commented on MESOS-3070: -- I think that you should use v1 API for unit test now? Master CHECK failure if a framework uses duplicated task id. Key: MESOS-3070 URL: https://issues.apache.org/jira/browse/MESOS-3070 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.22.1 Reporter: Jie Yu Assignee: Klaus Ma We observed this in one of our testing cluster. One framework (under development) keeps launching tasks using the same task_id. We don't expect the master to crash even if the framework is not doing what it's supposed to do. However, under a series of events, this could happen and keeps crashing the master. 1) frameworkA launches task 'task_id_1' on slaveA 2) master fails over 3) slaveA has not re-registered yet 4) frameworkA re-registered and launches task 'task_id_1' on slaveB 5) slaveA re-registering and add task task_id_1' to frameworkA 6) CHECK failure in addTask {noformat} I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with resources cpus(*):4; mem(*):32768 on slave 20150417-232509-1735470090-5050-48870-S25 (hostname) ... ... F0716 21:52:50.760136 28805 master.hpp:362] Check failed: !tasks.contains(task-task_id()) Duplicate task 'task_id_1' of framework framework_id {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)