date:20150817


[ 
https://issues.apache.org/jira/browse/MESOS-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700748#comment-14700748
 ] 

Yong Qiao Wang commented on MESOS-3286:
---

Append the related review request: https://reviews.apache.org/r/37562/

 Revocable metrics information are missed for slave node
 ---

 Key: MESOS-3286
 URL: https://issues.apache.org/jira/browse/MESOS-3286
 Project: Mesos
  Issue Type: Documentation
Reporter: Yong Qiao Wang
Assignee: Yong Qiao Wang
Priority: Minor

 In bug 3278, the revocable metrics information of master node are added, but 
 I also found those information also are missed for slave node in monitoring 
 doc, fix it in this new patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-3279) mesos doesn't compatibable docker 1.8.1

2015-08-17 Thread Stream Liu (JIRA)

Stream Liu created MESOS-3279:
-

 Summary: mesos doesn't compatibable docker 1.8.1
 Key: MESOS-3279
 URL: https://issues.apache.org/jira/browse/MESOS-3279
 Project: Mesos
  Issue Type: Bug
Reporter: Stream Liu


Failed to create a containerizer: Could not create DockerContainerizer: 
Insufficient version of Docker! Please upgrade to = 1.0.0

docker version 1.8.1

but can work on 1.6.2

i using mesos 0.22.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2516) Move allocation-related types to mesos::master namespace

2015-08-17 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699225#comment-14699225
 ] 

Alexander Rukletsov commented on MESOS-2516:


Yep, sending a mail to the list is a good option.

 Move allocation-related types to mesos::master namespace
 

 Key: MESOS-2516
 URL: https://issues.apache.org/jira/browse/MESOS-2516
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Alexander Rukletsov
Assignee: José Guilherme Vanz
Priority: Minor
  Labels: easyfix, newbie

 {{Allocator}}, {{Sorter}} and {{Comaprator}} types live in 
 {{master::allocator}} namespace. This is not consistent with the rest of the 
 codebase: {{Isolator}}, {{Fetcher}}, {{Containerizer}} all live in {{slave}} 
 namespace. Namespace {{allocator}} should be killed for consistency.
 Since sorters are poorly named, they should be renamed (or namespaced) prior 
 to this change in order not to pollute {{master}} namespace. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1791) Introduce Master / Offer Resource Reservations aka Quota

2015-08-17 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-1791:
--

Assignee: Alexander Rukletsov

 Introduce Master / Offer Resource Reservations aka Quota
 

 Key: MESOS-1791
 URL: https://issues.apache.org/jira/browse/MESOS-1791
 Project: Mesos
  Issue Type: Epic
  Components: allocation, master, replicated log
Reporter: Tom Arnfeld
Assignee: Alexander Rukletsov
  Labels: mesosphere

 Currently Mesos supports the ability to reserve resources (for a given role) 
 on a per-slave basis, as introduced in MESOS-505. This allows you to almost 
 statically partition off a set of resources on a set of machines, to 
 guarantee certain types of frameworks get some resources.
 This is very useful, though it is also very useful to be able to control 
 these reservations through the master (instead of per-slave) for when I don't 
 care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of 
 (X,Y).
 I'm not sure what structure this could take, but apparently it has already 
 been discussed. Would this be a CLI flag? Could there be a (authenticated) 
 web interface to control these reservations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3284) JSON representation of Protobuf should use base64 encoding for 'bytes' fields.

2015-08-17 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-3284:
---
  Sprint: Twitter Mesos Q3 Sprint 3
Story Points: 3

 JSON representation of Protobuf should use base64 encoding for 'bytes' fields.
 --

 Key: MESOS-3284
 URL: https://issues.apache.org/jira/browse/MESOS-3284
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
  Labels: twitter

 Currently we encode 'bytes' fields as UTF-8 strings, which is lossy for 
 binary data due to invalid byte sequences! In order to encode binary data in 
 a lossless fashion, we can encode 'bytes' fields in base64.
 Note that this is also how proto3 does its encoding (see 
 [here|https://developers.google.com/protocol-buffers/docs/proto3?hl=en#json]),
  so this would make migration easier as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3273) EventCall Test Framework is flaky


[ 
https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700428#comment-14700428
 ] 

Vinod Kone commented on MESOS-3273:
---

Review for the first problem:

https://reviews.apache.org/r/37559/

 EventCall Test Framework is flaky
 -

 Key: MESOS-3273
 URL: https://issues.apache.org/jira/browse/MESOS-3273
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.24.0
 Environment: 
 https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull
Reporter: Vinod Kone
Assignee: Vinod Kone

 Observed this on ASF CI. h/t [~haosd...@gmail.com]
 Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master.
 {code}
 [ RUN  ] ExamplesTest.EventCallFramework
 Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx'
 I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the 
 driver is aborted!
 Shutting down
 Sending SIGTERM to process tree at pid 26061
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26062
 Shutting down
 Killing the following process trees:
 [ 
 ]
 Sending SIGTERM to process tree at pid 26063
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26098
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26099
 Killing the following process trees:
 [ 
 ]
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 
 172.17.2.10:60249 for 16 cpus
 I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR
 I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0
 I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms
 I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms
 I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns
 I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 
 8429ns
 I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 4219ns
 I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery
 I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status
 I0813 19:55:17.181970 26126 master.cpp:378] Master 
 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 
 172.17.2.10:60249
 I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: 
 --acls=permissive: false
 register_frameworks {
   principals {
 type: SOME
 values: test-principal
   }
   roles {
 type: SOME
 values: *
   }
 }
 run_tasks {
   principals {
 type: SOME
 values: test-principal
   }
   users {
 type: SOME
 values: mesos
   }
 }
  --allocation_interval=1secs --allocator=HierarchicalDRF 
 --authenticate=false --authenticate_slaves=false 
 --authenticators=crammd5 
 --credentials=/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials 
 --framework_sorter=drf --help=false --initialize_driver_logging=true 
 --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO 
 --max_slave_ping_timeouts=5 --quiet=false 
 --recovery_slave_removal_limit=100% --registry=replicated_log 
 --registry_fetch_timeout=1mins --registry_store_timeout=5secs 
 --registry_strict=false --root_submissions=true 
 --slave_ping_timeout=15secs --slave_reregister_timeout=10mins 
 --user_sorter=drf --version=false 
 --webui_dir=/mesos/mesos-0.24.0/src/webui --work_dir=/tmp/mesos-II8Gua 
 --zk_session_timeout=10secs
 I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated 
 frameworks to register
 I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated 
 slaves to register
 I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for 
 authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials'
 W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials 
 file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. 
 It is recommended that your credentials file is NOT accessible by others.
 I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' 
 authenticator
 I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL
 I0813 19:55:17.185104 26138 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0813 19:55:17.185972 26100 containerizer.cpp:143] Using isolation: 
 posix/cpu,posix/mem,filesystem/posix
 I0813

[jira] [Updated] (MESOS-3073) Introduce HTTP endpoints for Quota


 [ 
https://issues.apache.org/jira/browse/MESOS-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3073:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Introduce HTTP endpoints for Quota
 --

 Key: MESOS-3073
 URL: https://issues.apache.org/jira/browse/MESOS-3073
 Project: Mesos
  Issue Type: Improvement
Reporter: Joerg Schad
Assignee: Joerg Schad
  Labels: mesosphere

 We need to implement the HTTP endpoints for Quota as outlined in the Design 
 Doc: 
 (https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2949) Design generalized Authorizer interface


 [ 
https://issues.apache.org/jira/browse/MESOS-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2949:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Design generalized Authorizer interface
 ---

 Key: MESOS-2949
 URL: https://issues.apache.org/jira/browse/MESOS-2949
 Project: Mesos
  Issue Type: Task
  Components: master, security
Reporter: Alexander Rojas
Assignee: Alexander Rojas
  Labels: acl, mesosphere, security

 As mentioned in MESOS-2948 the current {{mesos::Authorizer}} interface is 
 rather inflexible if new _Actions_ or _Objects_ need to be added.
 A new API needs to be designed in a way that allows for arbitrary _Actions_ 
 and _Objects_ to be added to the authorization mechanism without having to 
 recompile mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3223) Implement token manager for docker registry


 [ 
https://issues.apache.org/jira/browse/MESOS-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3223:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Implement token manager for docker registry
 ---

 Key: MESOS-3223
 URL: https://issues.apache.org/jira/browse/MESOS-3223
 Project: Mesos
  Issue Type: Task
  Components: containerization, docker
 Environment: linux
Reporter: Jojy Varghese
Assignee: Jojy Varghese
  Labels: mesosphere

 Implement the following:
 - A component that fetches JSON web authorization token from a given registry.
 - Caches the token keyed on registry, service and scope
 - Validates the cache for expiry date
 Nice to have:
 - Cache gets pruned as tokens are aged beyond expiration time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2937) Initial design document for Quota support in Allocator.


 [ 
https://issues.apache.org/jira/browse/MESOS-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2937:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Initial design document for Quota support in Allocator.
 ---

 Key: MESOS-2937
 URL: https://issues.apache.org/jira/browse/MESOS-2937
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov
  Labels: mesosphere

 Create a design document for the Quota feature support in the built-in 
 Hierarchical DRF allocator to be shared with the Mesos community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3092) Configure Jenkins to run Docker tests


 [ 
https://issues.apache.org/jira/browse/MESOS-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3092:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Configure Jenkins to run Docker tests
 -

 Key: MESOS-3092
 URL: https://issues.apache.org/jira/browse/MESOS-3092
 Project: Mesos
  Issue Type: Improvement
  Components: docker
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesosphere

 Add a jenkin job to run the Docker tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3021) Implement Docker Image Provisioner Reference Store


 [ 
https://issues.apache.org/jira/browse/MESOS-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3021:
---
Sprint: Mesosphere Sprint 14, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 14, Mesosphere Sprint 16)

 Implement Docker Image Provisioner Reference Store
 --

 Key: MESOS-3021
 URL: https://issues.apache.org/jira/browse/MESOS-3021
 Project: Mesos
  Issue Type: Improvement
Reporter: Lily Chen
Assignee: Lily Chen
  Labels: mesosphere

 Create a comprehensive store to look up an image and tag's associated image 
 layer ID. Implement add, remove, save, and update images and their associated 
 tags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3062) Add authorization for dynamic reservation


 [ 
https://issues.apache.org/jira/browse/MESOS-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3062:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Add authorization for dynamic reservation
 -

 Key: MESOS-3062
 URL: https://issues.apache.org/jira/browse/MESOS-3062
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 Dynamic reservations should be authorized with the {{principal}} of the 
 reserving entity (framework or master). The idea is to introduce {{Reserve}} 
 and {{Unreserve}} into the ACL.
 {code}
   message Reserve {
 // Subjects.
 required Entity principals = 1;
 // Objects.  MVP: Only possible values = ANY, NONE
 required Entity resources = 1;
   }
   message Unreserve {
 // Subjects.
 required Entity principals = 1;
 // Objects.
 required Entity reserver_principals = 2;
   }
 {code}
 When a framework/operator reserves resources, reserve ACLs are checked to 
 see if the framework ({{FrameworkInfo.principal}}) or the operator 
 ({{Credential.user}}) is authorized to reserve the specified resources. If 
 not authorized, the reserve operation is rejected.
 When a framework/operator unreserves resources, unreserve ACLs are checked 
 to see if the framework ({{FrameworkInfo.principal}}) or the operator 
 ({{Credential.user}}) is authorized to unreserve the resources reserved by a 
 framework or operator ({{Resource.ReservationInfo.principal}}). If not 
 authorized, the unreserve operation is rejected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2066) Add optional 'Unavailability' to resource offers to provide maintenance awareness.


 [ 
https://issues.apache.org/jira/browse/MESOS-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2066:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Add optional 'Unavailability' to resource offers to provide maintenance 
 awareness.
 --

 Key: MESOS-2066
 URL: https://issues.apache.org/jira/browse/MESOS-2066
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler
Assignee: Joseph Wu
  Labels: mesosphere

 In order to inform frameworks about upcoming maintenance on offered 
 resources, per MESOS-1474, we'd like to add an optional 'Unavailability' 
 information to offers:
 {code}
 message Interval {
   optional double start = 1; // Time, in seconds since the Epoch.
   optional double duration = 2; // Time, in seconds.
 }
 message Offer {
   // Existing fields
   ...
  
   // Signifies that the resources in this Offer are part of a planned
   // maintenance schedule in the specified window.  Any tasks launched
   // using these resources may be killed when the window arrives.
   // This field gives additional information about the maintenance.
   // The maintenance may not necessarily start at exactly at this interval,
   // nor last for exactly the duration of this interval.
   optional Interval unavailability = 9;
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2061) Add InverseOffer protobuf message.


 [ 
https://issues.apache.org/jira/browse/MESOS-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2061:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Add InverseOffer protobuf message.
 --

 Key: MESOS-2061
 URL: https://issues.apache.org/jira/browse/MESOS-2061
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler
Assignee: Joseph Wu
  Labels: mesosphere

 InverseOffer was defined as part of the maintenance work in MESOS-1474, 
 design doc here: 
 https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing
 {code}
 /**
  * A request to return some resources occupied by a framework.
  */
 message InverseOffer {
   required OfferID id = 1;
   required FrameworkID framework_id = 2;
   // A list of resources being requested back from the framework.
   repeated Resource resources = 3;
   // Specified if the resources need to be released from a particular slave.
   optional SlaveID slave_id = 4;
   // The resources in this InverseOffer are part of a planned maintenance
   // schedule in the specified window.  Any tasks running using these
   // resources may be killed when the window arrives.
   optional Interval unavailability = 5;
 }
 {code}
 This ticket is to capture the addition of the InverseOffer protobuf to 
 mesos.proto, the necessary API changes for Event/Call and the language 
 bindings will be tracked separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2200) bogus docker images result in bad error message to scheduler


 [ 
https://issues.apache.org/jira/browse/MESOS-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2200:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 bogus docker images result in bad error message to scheduler
 

 Key: MESOS-2200
 URL: https://issues.apache.org/jira/browse/MESOS-2200
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Reporter: Jay Buffington
Assignee: Joerg Schad
  Labels: mesosphere

 When a scheduler specifies a bogus image in ContainerInfo mesos doesn't tell 
 the scheduler that the docker pull failed or why.
 This error is logged in the mesos-slave log, but it isn't given to the 
 scheduler (as far as I can tell):
 {noformat}
 E1218 23:50:55.406230  8123 slave.cpp:2730] Container 
 '8f70784c-3e40-4072-9ca2-9daed23f15ff' for executor 
 'thermos-1418946354013-xxx-xxx-curl-0-f500cc41-dd0a-4338-8cbc-d631cb588bb1' 
 of framework '20140522-213145-1749004561-5050-29512-' failed to start: 
 Failed to 'docker pull 
 docker-registry.example.com/doesntexist/hello1.1:latest': exit status = 
 exited with status 1 stderr = 2014/12/18 23:50:55 Error: image 
 doesntexist/hello1.1 not found
 {noformat}
 If the docker image is not in the registry, the scheduler should give the 
 user an error message.  If docker pull failed because of networking issues, 
 it should be retried.  Mesos should give the scheduler enough information to 
 be able to make that decision.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3227) Implement image chroot support into command executor


 [ 
https://issues.apache.org/jira/browse/MESOS-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3227:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Implement image chroot support into command executor
 

 Key: MESOS-3227
 URL: https://issues.apache.org/jira/browse/MESOS-3227
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesosphere





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2455) Add operator endpoint to destroy persistent volumes.


 [ 
https://issues.apache.org/jira/browse/MESOS-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2455:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Add operator endpoint to destroy persistent volumes.
 

 Key: MESOS-2455
 URL: https://issues.apache.org/jira/browse/MESOS-2455
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Michael Park
Priority: Critical
  Labels: mesosphere

 Persistent volumes will not be released automatically.
 So we probably need an endpoint for operators to forcefully release 
 persistent volumes. We probably need to add principal to Persistence struct 
 and use ACLs to control who can release what.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3251) http::get API evaluates host wrongly


 [ 
https://issues.apache.org/jira/browse/MESOS-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3251:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 http::get API evaluates host wrongly
 --

 Key: MESOS-3251
 URL: https://issues.apache.org/jira/browse/MESOS-3251
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Jojy Varghese
Assignee: Jojy Varghese
  Labels: mesosphere

 Currently libprocess http API sets the Host header field from the peer 
 socket address (IP:port). The problem is that socket address might not be 
 right HTTP server and might be just a proxy. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3042) Master/Allocator does not send InverseOffers to resources to be maintained


 [ 
https://issues.apache.org/jira/browse/MESOS-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3042:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Master/Allocator does not send InverseOffers to resources to be maintained
 --

 Key: MESOS-3042
 URL: https://issues.apache.org/jira/browse/MESOS-3042
 Project: Mesos
  Issue Type: Task
  Components: allocation, master
Reporter: Joseph Wu
Assignee: Joris Van Remoortere
  Labels: mesosphere

 Offers are currently sent from master/allocator to framework via 
 ResourceOffersMessage's.  InverseOffers, which are roughly equivalent to 
 negative Offers, can be sent in the same package.
 In src/messages/messages.proto
 {code}
 message ResourceOffersMessage {
   repeated Offer offers = 1;
   repeated string pids = 2;
   // New field with InverseOffers
   repeated InverseOffer inverseOffers = 3;
 }
 {code}
 Sent InverseOffers can be tracked in the master's local state:
 i.e. In src/master/master.hpp:
 {code}
 struct Slave {
   ... // Existing fields.
   // Active InverseOffers on this slave.
   // Similar pattern to the offers field
   hashsetInverseOffer* inverseOffers;
 }
 {code}
 One actor (master or allocator) should populate the new InverseOffers field.
 * In master (src/master/master.cpp)
 ** Master::offer is where the ResourceOffersMessage and Offer object is 
 constructed.
 ** The same method could also check for maintenance and send InverseOffers.
 * In the allocator (src/master/allocator/mesos/hierarchical.hpp)
 ** HierarchicalAllocatorProcess::allocate is where slave resources are 
 aggregated an sent off to the frameworks.
 ** InverseOffers (i.e. negative resources) allocation could be calculated in 
 this method.
 ** A change to Master::offer (i.e. the offerCallback) may be necessary to 
 account for the negative resources.
 Possible test(s):
 * InverseOfferTest
 ** Start master, slave, framework.
 ** Accept resource offer, start task.
 ** Set maintenance schedule to the future.
 ** Check that InverseOffer(s) are sent to the framework.
 ** Decline InverseOffer.
 ** Check that more InverseOffer(s) are sent.
 ** Accept InverseOffer.
 ** Check that more InverseOffer(s) are sent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3066) Replicated registry does not have a representation of maintenance schedules


 [ 
https://issues.apache.org/jira/browse/MESOS-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3066:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Replicated registry does not have a representation of maintenance schedules
 ---

 Key: MESOS-3066
 URL: https://issues.apache.org/jira/browse/MESOS-3066
 Project: Mesos
  Issue Type: Task
  Components: master, replicated log
Reporter: Joseph Wu
Assignee: Joseph Wu
  Labels: mesosphere

 In order to persist maintenance schedules across failovers of the master, the 
 schedule information must be kept in the replicated registry.
 This means adding an additional message in the Registry protobuf in 
 src/master/registry.proto.  The status of each individual slave's maintenance 
 will also be persisted in this way.
 {code}
 message Maintenance {
   message HostStatus {
 required string hostname = 1;
 // True if the slave is deactivated for maintenance.
 // False if the slave is draining in preparation for maintenance.
 required bool is_down = 2;  // Or an enum
   }
   message Schedule {
 // The set of affected slave(s).
 repeated HostStatus hosts = 1;
 // Interval in which this set of slaves is expected to be down for.
 optional Unavailability interval = 2;
   }
   message Schedules {
 repeated Schedule schedules;
   }
   optional Schedules schedules = 1;
 }
 {code}
 Note: There can be multiple SlaveID's attached to a single hostname.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3064) Add 'principal' field to 'Resource.DiskInfo'


 [ 
https://issues.apache.org/jira/browse/MESOS-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3064:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Add 'principal' field to 'Resource.DiskInfo'
 

 Key: MESOS-3064
 URL: https://issues.apache.org/jira/browse/MESOS-3064
 Project: Mesos
  Issue Type: Task
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 In order to support authorization for persistent volumes, we should add the 
 {{principal}} to {{Resource.DiskInfo}}, analogous to 
 {{Resource.ReservationInfo.principal}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2964) libprocess io does not support peek()


 [ 
https://issues.apache.org/jira/browse/MESOS-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2964:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 libprocess io does not support peek()
 -

 Key: MESOS-2964
 URL: https://issues.apache.org/jira/browse/MESOS-2964
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Artem Harutyunyan
Assignee: Artem Harutyunyan
Priority: Minor
  Labels: beginner, mesosphere, newbie

 Finally, I so wish we could just do:
 {code}
 io::peek(request-socket, 6)
   .then([request](const string data) {
 // Comment about the rules ...
 if (data.length()  2) { // Rule 1
 
 } else if (...) { // Rule 2.
 
 } else if (...) { // Rule 3.
 
 }
 
 if (ssl) {
   accept_SSL_callback(request);
 } else {
   ...;
 }
   });
 {code}
 from:
 https://reviews.apache.org/r/31207/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3086) Create cgroups TasksKiller for non freeze subsystems.


 [ 
https://issues.apache.org/jira/browse/MESOS-3086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3086:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Create cgroups TasksKiller for non freeze subsystems.
 -

 Key: MESOS-3086
 URL: https://issues.apache.org/jira/browse/MESOS-3086
 Project: Mesos
  Issue Type: Bug
Reporter: Joerg Schad
Assignee: Joerg Schad
  Labels: mesosphere

 We have a number of test issues when we cannot remove cgroups (in case there 
 are still related tasks running) in cases where the freezer subsystem is not 
 available. 
 In the current code 
 (https://github.com/apache/mesos/blob/0.22.1/src/linux/cgroups.cpp#L1728)  we 
 will fallback to a very simple mechnism of recursivly trying to remove the 
 cgroups which fails if there are still tasks running. 
 Therefore we need an additional  (NonFreeze)TasksKiller which doesn't  rely 
 on the freezer subsystem.
 This problem caused issues when running 'sudo make check' during 0.23 release 
 testing, where BenH provided already a better error message with 
 b1a23d6a52c31b8c5c840ab01902dbe00cb1feef / https://reviews.apache.org/r/36604.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3074) Check satisfiability of quota requests in Master


 [ 
https://issues.apache.org/jira/browse/MESOS-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3074:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Check satisfiability of quota requests in Master
 

 Key: MESOS-3074
 URL: https://issues.apache.org/jira/browse/MESOS-3074
 Project: Mesos
  Issue Type: Improvement
Reporter: Joerg Schad
Assignee: Alexander Rukletsov
  Labels: mesosphere

 We need to to validate and quota requests in the Mesos Master as outlined in 
 the Design Doc: 
 https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I
 This ticket aims to validate satisfiability (in terms of available resources) 
 of a quota request using a heuristic algorithm in the Mesos Master, rather 
 than validating the syntax of the request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1


 [ 
https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3050:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Failing ROOT_ tests on CentOS 7.1
 -

 Key: MESOS-3050
 URL: https://issues.apache.org/jira/browse/MESOS-3050
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, test
Affects Versions: 0.23.0
 Environment: CentOS Linux release 7.1.1503
 0.24.0
Reporter: Adam B
Assignee: Timothy Chen
Priority: Blocker
  Labels: mesosphere
 Attachments: ROOT_tests.log


 Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several 
 several failures/errors:
 {code}
 [ RUN  ] DockerTest.ROOT_DOCKER_CheckPortResource
 ../../src/tests/docker_tests.cpp:303: Failure
 (run).failure(): Container exited on error: exited with status 1
 [  FAILED  ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms)
 {code}
 ...
 {code}
 [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 ../../src/tests/isolator_tests.cpp:837: Failure
 isolator: Failed to create PerfEvent isolator, invalid events: { cycles, 
 task-clock }
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms)
 [--] 1 test from PerfEventIsolatorTest (9 ms total)

 [--] 2 tests from SharedFilesystemIsolatorTest
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp
 + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac
 ../../src/tests/isolator_tests.cpp:1001: Failure
 Value of: os::exists(file)
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms)
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp
 + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d
 ../../src/tests/isolator_tests.cpp:1086: Failure
 Value of: os::exists(path::join(containerPath, filename))
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms)
 {code}
 ...
 {code}
 [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
 mesos::internal::slave::CgroupsMemIsolatorProcess
 userdel: user 'mesos.test.unprivileged.user' does not exist
 [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
 -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied
 mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: 
 Permission denied
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or 
 directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs:
  Permission denied
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or 
 directory
 mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: 
 No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file 
 or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs:
  No such file or directory
 mkdir: cannot create directory 
 ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 [  FAILED  ]

[jira] [Updated] (MESOS-3015) Add hooks for Slave exits


 [ 
https://issues.apache.org/jira/browse/MESOS-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3015:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Add hooks for Slave exits
 -

 Key: MESOS-3015
 URL: https://issues.apache.org/jira/browse/MESOS-3015
 Project: Mesos
  Issue Type: Task
Reporter: Kapil Arya
Assignee: Kapil Arya
  Labels: mesosphere

 The hook will be triggered on slave exits. A master hook module can use this 
 to do Slave-specific cleanups.
 In our particular use case, the hook would trigger cleanup of IPs assigned to 
 the given Slave (see the [design doc | 
 https://docs.google.com/document/d/17mXtAmdAXcNBwp_JfrxmZcQrs7EO6ancSbejrqjLQ0g/edit#]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2600) Add /reserve and /unreserve endpoints on the master for dynamic reservation


 [ 
https://issues.apache.org/jira/browse/MESOS-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2600:
---
Sprint: Mesosphere Sprint 10, Mesosphere Sprint 11, Mesosphere Sprint 15, 
Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 10, 
Mesosphere Sprint 11, Mesosphere Sprint 15, Mesosphere Sprint 16)

 Add /reserve and /unreserve endpoints on the master for dynamic reservation
 ---

 Key: MESOS-2600
 URL: https://issues.apache.org/jira/browse/MESOS-2600
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Michael Park
Assignee: Michael Park
Priority: Critical
  Labels: mesosphere

 Enable operators to manage dynamic reservations by Introducing the 
 {{/reserve}} and {{/unreserve}} HTTP endpoints on the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3069) Registry operations do not exist for manipulating maintanence schedules


 [ 
https://issues.apache.org/jira/browse/MESOS-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3069:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Registry operations do not exist for manipulating maintanence schedules
 ---

 Key: MESOS-3069
 URL: https://issues.apache.org/jira/browse/MESOS-3069
 Project: Mesos
  Issue Type: Task
  Components: master, replicated log
Reporter: Joseph Wu
Assignee: Joseph Wu
  Labels: mesosphere

 In order to modify the maintenance schedule in the replicated registry, we 
 will need Operations (src/master/registrar.hpp).
 The operations will likely correspond to the HTTP API:
 * UpdateMaintenanceSchedule: Given a blob representing a maintenance 
 schedule, perform some verification on the blob.  Write the blob to the 
 registry.  
 * StartMaintenance:  Given a set of machines, verify then transition machines 
 from Draining to Deactivated.
 * StopMaintenance:  Given a set of machines, verify then transition machines 
 from Deactivated to Normal.  Remove affected machines from the schedule(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3164) Introduce QuotaInfo message


 [ 
https://issues.apache.org/jira/browse/MESOS-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3164:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16, Mesosphere Sprint 17  
(was: Mesosphere Sprint 15, Mesosphere Sprint 16)

 Introduce QuotaInfo message
 ---

 Key: MESOS-3164
 URL: https://issues.apache.org/jira/browse/MESOS-3164
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Alexander Rukletsov
Assignee: Joerg Schad
  Labels: mesosphere

 A {{QuotaInfo}} protobuf message is internal representation for quota related 
 information (e.g. for persisting quota). The protobuf message should be 
 extendable for future needs and allows for easy aggregation across roles and 
 operator principals. It may also be used to pass quota information to 
 allocators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2067) Add HTTP API to the master for maintenance operations.

2015-08-17 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700461#comment-14700461
 ] 

Joseph Wu commented on MESOS-2067:
--

Just realized it might be useful to have an endpoint which retrieves the latest 
accept/decline info of the maintenance schedule.
(Updated description)

 Add HTTP API to the master for maintenance operations.
 --

 Key: MESOS-2067
 URL: https://issues.apache.org/jira/browse/MESOS-2067
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Benjamin Mahler
Assignee: Joseph Wu
  Labels: mesosphere, twitter

 Based on MESOS-1474, we'd like to provide an HTTP API on the master for the 
 maintenance primitives in mesos.
 For the MVP, we'll want something like this for manipulating the schedule:
 {code}
 /maintenance/schedule
   GET - returns the schedule, which will include the various maintenance 
 windows.
   POST - create or update the schedule with a JSON blob (see below).
 /maintenance/status
   GET - returns a list of machines and their maintenance mode.
 /maintenance/start
   POST - Transition a set of machines from Draining into Deactivated mode.
 /maintenance/stop
   POST - Transition a set of machines from Deactivated into Normal mode.
 /maintenance/consensus - (Not sure what the right name is.  matrix?  
 acceptance?)
   GET - Returns the latest info on which frameworks have accepted or declined 
 the maintenance schedule.
 {code}
 (Note: The slashes in URLs might not be supported yet.)
 A schedule might look like:
 {code}
 {
   windows : [
 {
   machines : [
   { ip : 192.168.0.1 },
   { hostname : localhost },
   ...
 ], 
   unavailability : {
 start : 12345, // Epoch seconds.
 duration : 1000 // Seconds.
   }
 },
 ...
   ]
 }
 {code}
 There should be firewall settings such that only those with access to master 
 can use these endpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2067) Add HTTP API to the master for maintenance operations.

2015-08-17 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-2067:
-
Description: 
Based on MESOS-1474, we'd like to provide an HTTP API on the master for the 
maintenance primitives in mesos.

For the MVP, we'll want something like this for manipulating the schedule:
{code}
/maintenance/schedule
  GET - returns the schedule, which will include the various maintenance 
windows.
  POST - create or update the schedule with a JSON blob (see below).

/maintenance/status
  GET - returns a list of machines and their maintenance mode.

/maintenance/start
  POST - Transition a set of machines from Draining into Deactivated mode.

/maintenance/stop
  POST - Transition a set of machines from Deactivated into Normal mode.

/maintenance/consensus - (Not sure what the right name is.  matrix?  
acceptance?)
  GET - Returns the latest info on which frameworks have accepted or declined 
the maintenance schedule.
{code}
(Note: The slashes in URLs might not be supported yet.)

A schedule might look like:
{code}
{
  windows : [
{
  machines : [
  { ip : 192.168.0.1 },
  { hostname : localhost },
  ...
], 
  unavailability : {
start : 12345, // Epoch seconds.
duration : 1000 // Seconds.
  }
},
...
  ]
}
{code}

There should be firewall settings such that only those with access to master 
can use these endpoints.

  was:
Based on MESOS-1474, we'd like to provide an HTTP API on the master for the 
maintenance primitives in mesos.

For the MVP, we'll want something like this for manipulating the schedule:
{code}
/maintenance/schedule
  GET - returns the schedule, which will include the various maintenance 
windows.
  POST - create or update the schedule with a JSON blob (see below).

/maintenance/status
  GET - returns a list of machines and their maintenance mode.

/maintenance/start
  POST - Transition a set of machines from Draining into Deactivated mode.

/maintenance/stop
  POST - Transition a set of machines from Deactivated into Normal mode.
{code}
(Note: The slashes in URLs might not be supported yet.)

A schedule might look like:
{code}
{
  windows : [
{
  machines : [
  { ip : 192.168.0.1 },
  { hostname : localhost },
  ...
], 
  unavailability : {
start : 12345, // Epoch seconds.
duration : 1000 // Seconds.
  }
},
...
  ]
}
{code}

There should be firewall settings such that only those with access to master 
can use these endpoints.


 Add HTTP API to the master for maintenance operations.
 --

 Key: MESOS-2067
 URL: https://issues.apache.org/jira/browse/MESOS-2067
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Benjamin Mahler
Assignee: Joseph Wu
  Labels: mesosphere, twitter

 Based on MESOS-1474, we'd like to provide an HTTP API on the master for the 
 maintenance primitives in mesos.
 For the MVP, we'll want something like this for manipulating the schedule:
 {code}
 /maintenance/schedule
   GET - returns the schedule, which will include the various maintenance 
 windows.
   POST - create or update the schedule with a JSON blob (see below).
 /maintenance/status
   GET - returns a list of machines and their maintenance mode.
 /maintenance/start
   POST - Transition a set of machines from Draining into Deactivated mode.
 /maintenance/stop
   POST - Transition a set of machines from Deactivated into Normal mode.
 /maintenance/consensus - (Not sure what the right name is.  matrix?  
 acceptance?)
   GET - Returns the latest info on which frameworks have accepted or declined 
 the maintenance schedule.
 {code}
 (Note: The slashes in URLs might not be supported yet.)
 A schedule might look like:
 {code}
 {
   windows : [
 {
   machines : [
   { ip : 192.168.0.1 },
   { hostname : localhost },
   ...
 ], 
   unavailability : {
 start : 12345, // Epoch seconds.
 duration : 1000 // Seconds.
   }
 },
 ...
   ]
 }
 {code}
 There should be firewall settings such that only those with access to master 
 can use these endpoints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2769) Metric for cpu scheduling latency from all components

2015-08-17 Thread Cong Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700359#comment-14700359
 ] 

Cong Wang commented on MESOS-2769:
--

https://reviews.apache.org/r/37540/
https://reviews.apache.org/r/37541/


 Metric for cpu scheduling latency from all components
 -

 Key: MESOS-2769
 URL: https://issues.apache.org/jira/browse/MESOS-2769
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Affects Versions: 0.22.1
Reporter: Ian Downes
Assignee: Cong Wang
  Labels: twitter

 The metric will provide statistics on the scheduling latency for 
 processes/threads in a container, i.e., statistics on the delay before 
 application code can run. This will be the aggregate effect of the normal 
 scheduling period, contention from other threads/processes, both in the 
 container and on the system, and any effects from the CFS bandwidth control 
 (if enabled) or other CPU isolation strategies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3065) Add authorization for persistent volume


 [ 
https://issues.apache.org/jira/browse/MESOS-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3065:
---
Sprint: Mesosphere Sprint 16  (was: Mesosphere Sprint 16, Mesosphere Sprint 
17)

 Add authorization for persistent volume
 ---

 Key: MESOS-3065
 URL: https://issues.apache.org/jira/browse/MESOS-3065
 Project: Mesos
  Issue Type: Task
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 Persistent volume should be authorized with the {{principal}} of the 
 reserving entity (framework or master). The idea is to introduce {{Create}} 
 and {{Destroy}} into the ACL.
 {code}
   message Create {
 // Subjects.
 required Entity principals = 1;
 // Objects? Perhaps the kind of volume? allowed permissions?
   }
   message Unreserve {
 // Subjects.
 required Entity principals = 1;
 // Objects.
 required Entity creator_principals = 2;
   }
 {code}
 When a framework/operator creates a persistent volume, create ACLs are 
 checked to see if the framework (FrameworkInfo.principal) or the operator 
 (Credential.user) is authorized to create persistent volumes. If not 
 authorized, the create operation is rejected.
 When a framework/operator destroys a persistent volume, destroy ACLs are 
 checked to see if the framework (FrameworkInfo.principal) or the operator 
 (Credential.user) is authorized to destroy the persistent volume created by a 
 framework or operator (Resource.DiskInfo.principal). If not authorized, the 
 destroy operation is rejected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3015) Add hooks for Slave exits


 [ 
https://issues.apache.org/jira/browse/MESOS-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3015:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16  (was: Mesosphere Sprint 
15, Mesosphere Sprint 16, Mesosphere Sprint 17)

 Add hooks for Slave exits
 -

 Key: MESOS-3015
 URL: https://issues.apache.org/jira/browse/MESOS-3015
 Project: Mesos
  Issue Type: Task
Reporter: Kapil Arya
Assignee: Kapil Arya
  Labels: mesosphere

 The hook will be triggered on slave exits. A master hook module can use this 
 to do Slave-specific cleanups.
 In our particular use case, the hook would trigger cleanup of IPs assigned to 
 the given Slave (see the [design doc | 
 https://docs.google.com/document/d/17mXtAmdAXcNBwp_JfrxmZcQrs7EO6ancSbejrqjLQ0g/edit#]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1


 [ 
https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3050:
---
Sprint: Mesosphere Sprint 16  (was: Mesosphere Sprint 16, Mesosphere Sprint 
17)

 Failing ROOT_ tests on CentOS 7.1
 -

 Key: MESOS-3050
 URL: https://issues.apache.org/jira/browse/MESOS-3050
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, test
Affects Versions: 0.23.0
 Environment: CentOS Linux release 7.1.1503
 0.24.0
Reporter: Adam B
Assignee: Timothy Chen
Priority: Blocker
  Labels: mesosphere
 Attachments: ROOT_tests.log


 Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several 
 several failures/errors:
 {code}
 [ RUN  ] DockerTest.ROOT_DOCKER_CheckPortResource
 ../../src/tests/docker_tests.cpp:303: Failure
 (run).failure(): Container exited on error: exited with status 1
 [  FAILED  ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms)
 {code}
 ...
 {code}
 [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 ../../src/tests/isolator_tests.cpp:837: Failure
 isolator: Failed to create PerfEvent isolator, invalid events: { cycles, 
 task-clock }
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms)
 [--] 1 test from PerfEventIsolatorTest (9 ms total)

 [--] 2 tests from SharedFilesystemIsolatorTest
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp
 + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac
 ../../src/tests/isolator_tests.cpp:1001: Failure
 Value of: os::exists(file)
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms)
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp
 + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d
 ../../src/tests/isolator_tests.cpp:1086: Failure
 Value of: os::exists(path::join(containerPath, filename))
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms)
 {code}
 ...
 {code}
 [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
 mesos::internal::slave::CgroupsMemIsolatorProcess
 userdel: user 'mesos.test.unprivileged.user' does not exist
 [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
 -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied
 mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: 
 Permission denied
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or 
 directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs:
  Permission denied
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or 
 directory
 mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: 
 No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file 
 or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs:
  No such file or directory
 mkdir: cannot create directory 
 ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 [  FAILED  ]

[jira] [Commented] (MESOS-3217) Replace boost unordered_{set,map} and hash with std versions.


[ 
https://issues.apache.org/jira/browse/MESOS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700446#comment-14700446
 ] 

Marco Massenzio commented on MESOS-3217:


Has there been any progress on this one?
There are 3x reviews out for 10 days now, any expected resolution?

 Replace boost unordered_{set,map} and hash with std versions.
 -

 Key: MESOS-3217
 URL: https://issues.apache.org/jira/browse/MESOS-3217
 Project: Mesos
  Issue Type: Task
  Components: stout
Reporter: Michael Park
Assignee: Jan Schlicht
  Labels: mesosphere

 As part of C++11 upgrade, we should replace boost {{unordered_\{set,map\}}} 
 and {{hash}} with their standard counterparts. Aside from reducing the 
 dependency on {{boost}} from Mesos internals, this is also beneficial in 
 removing (at least, reducing) the dependency on {{boost}} in our public 
 header files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-3284) JSON representation of Protobuf should use base64 encoding for 'bytes' fields.

2015-08-17 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-3284:
--

 Summary: JSON representation of Protobuf should use base64 
encoding for 'bytes' fields.
 Key: MESOS-3284
 URL: https://issues.apache.org/jira/browse/MESOS-3284
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


Currently we encode 'bytes' fields as UTF-8 strings, which is lossy for binary 
data due to invalid byte sequences! In order to encode binary data in a 
lossless fashion, we can encode 'bytes' fields in base64.

Note that this is also how proto3 does its encoding (see 
[here|https://developers.google.com/protocol-buffers/docs/proto3?hl=en#json]), 
so this would make migration easier as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3136) COMMAND health checks with Marathon 0.10.0 are broken


[ 
https://issues.apache.org/jira/browse/MESOS-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700438#comment-14700438
 ] 

Marco Massenzio commented on MESOS-3136:


Just noticed this entirely randomly...

I would strongly suggest to avoid skipping a version between 0.22 / 0.24 as the 
Leader Election would be terminally broken: we transitioned to JSON in ZK for 
{{MasterInfo}} and while the 0.22 -- 0.23 -- 0.24 chain all works just fine, 
skipping 0.23 would create no end of grief.

(I'm almost sure other stuff around HTTP API would break, but not sure there).

My 2c

 COMMAND health checks with Marathon 0.10.0 are broken
 -

 Key: MESOS-3136
 URL: https://issues.apache.org/jira/browse/MESOS-3136
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Dr. Stefan Schimanski
Assignee: haosdent
Priority: Critical
  Labels: mesosphere

 When deploying Mesos 0.23rc4 with latest Marathon 0.10.0 RC3 command health 
 check stop working. Rolling back to Mesos 0.22.1 fixes the problem.
 Containerizer is Docker.
 All packages are from official Mesosphere Ubuntu 14.04 sources.
 The issue must be analyzed further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3136) COMMAND health checks with Marathon 0.10.0 are broken


[ 
https://issues.apache.org/jira/browse/MESOS-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700440#comment-14700440
 ] 

Marco Massenzio commented on MESOS-3136:


[~vinodkone] does this need to be fixed before 0.24 is out?

 COMMAND health checks with Marathon 0.10.0 are broken
 -

 Key: MESOS-3136
 URL: https://issues.apache.org/jira/browse/MESOS-3136
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Dr. Stefan Schimanski
Assignee: haosdent
Priority: Critical
  Labels: mesosphere

 When deploying Mesos 0.23rc4 with latest Marathon 0.10.0 RC3 command health 
 check stop working. Rolling back to Mesos 0.22.1 fixes the problem.
 Containerizer is Docker.
 All packages are from official Mesosphere Ubuntu 14.04 sources.
 The issue must be analyzed further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3227) Implement image chroot support into command executor


 [ 
https://issues.apache.org/jira/browse/MESOS-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3227:
---
Sprint: Mesosphere Sprint 16  (was: Mesosphere Sprint 16, Mesosphere Sprint 
17)

 Implement image chroot support into command executor
 

 Key: MESOS-3227
 URL: https://issues.apache.org/jira/browse/MESOS-3227
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Timothy Chen
Assignee: Timothy Chen
  Labels: mesosphere





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3064) Add 'principal' field to 'Resource.DiskInfo'


 [ 
https://issues.apache.org/jira/browse/MESOS-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3064:
---
Sprint: Mesosphere Sprint 16  (was: Mesosphere Sprint 16, Mesosphere Sprint 
17)

 Add 'principal' field to 'Resource.DiskInfo'
 

 Key: MESOS-3064
 URL: https://issues.apache.org/jira/browse/MESOS-3064
 Project: Mesos
  Issue Type: Task
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 In order to support authorization for persistent volumes, we should add the 
 {{principal}} to {{Resource.DiskInfo}}, analogous to 
 {{Resource.ReservationInfo.principal}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2455) Add operator endpoint to destroy persistent volumes.


 [ 
https://issues.apache.org/jira/browse/MESOS-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2455:
---
Sprint: Mesosphere Sprint 16  (was: Mesosphere Sprint 16, Mesosphere Sprint 
17)

 Add operator endpoint to destroy persistent volumes.
 

 Key: MESOS-2455
 URL: https://issues.apache.org/jira/browse/MESOS-2455
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Michael Park
Priority: Critical
  Labels: mesosphere

 Persistent volumes will not be released automatically.
 So we probably need an endpoint for operators to forcefully release 
 persistent volumes. We probably need to add principal to Persistence struct 
 and use ACLs to control who can release what.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3199) Validate Quota Requests.


 [ 
https://issues.apache.org/jira/browse/MESOS-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3199:
---
Sprint: Mesosphere Sprint 16  (was: Mesosphere Sprint 16, Mesosphere Sprint 
17)

 Validate Quota Requests.
 

 Key: MESOS-3199
 URL: https://issues.apache.org/jira/browse/MESOS-3199
 Project: Mesos
  Issue Type: Task
Reporter: Joerg Schad
Assignee: Joerg Schad
  Labels: mesosphere

 We need to validate quota requests in terms of syntax correctness, update 
 Master bookkeeping structures, and persist quota requests in the {{Registry}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2200) bogus docker images result in bad error message to scheduler


 [ 
https://issues.apache.org/jira/browse/MESOS-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2200:
---
Sprint: Mesosphere Sprint 15, Mesosphere Sprint 16  (was: Mesosphere Sprint 
15, Mesosphere Sprint 16, Mesosphere Sprint 17)

 bogus docker images result in bad error message to scheduler
 

 Key: MESOS-2200
 URL: https://issues.apache.org/jira/browse/MESOS-2200
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Reporter: Jay Buffington
Assignee: Joerg Schad
  Labels: mesosphere

 When a scheduler specifies a bogus image in ContainerInfo mesos doesn't tell 
 the scheduler that the docker pull failed or why.
 This error is logged in the mesos-slave log, but it isn't given to the 
 scheduler (as far as I can tell):
 {noformat}
 E1218 23:50:55.406230  8123 slave.cpp:2730] Container 
 '8f70784c-3e40-4072-9ca2-9daed23f15ff' for executor 
 'thermos-1418946354013-xxx-xxx-curl-0-f500cc41-dd0a-4338-8cbc-d631cb588bb1' 
 of framework '20140522-213145-1749004561-5050-29512-' failed to start: 
 Failed to 'docker pull 
 docker-registry.example.com/doesntexist/hello1.1:latest': exit status = 
 exited with status 1 stderr = 2014/12/18 23:50:55 Error: image 
 doesntexist/hello1.1 not found
 {noformat}
 If the docker image is not in the registry, the scheduler should give the 
 user an error message.  If docker pull failed because of networking issues, 
 it should be retried.  Mesos should give the scheduler enough information to 
 be able to make that decision.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2937) Initial design document for Quota support in Allocator.


 [ 
https://issues.apache.org/jira/browse/MESOS-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-2937:
---
Sprint: Mesosphere Sprint 16  (was: Mesosphere Sprint 16, Mesosphere Sprint 
17)

 Initial design document for Quota support in Allocator.
 ---

 Key: MESOS-2937
 URL: https://issues.apache.org/jira/browse/MESOS-2937
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov
  Labels: mesosphere

 Create a design document for the Quota feature support in the built-in 
 Hierarchical DRF allocator to be shared with the Mesos community.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3273) EventCall Test Framework is flaky


[ 
https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700521#comment-14700521
 ] 

Vinod Kone commented on MESOS-3273:
---

commit c532490bfcb4470d0614640031ff854af8876ef6
Author: Vinod Kone vinodk...@gmail.com
Date:   Mon Aug 17 15:45:51 2015 -0700

Fixed mutex deadlock issue in ~scheduler::Mesos().

Review: https://reviews.apache.org/r/37559


 EventCall Test Framework is flaky
 -

 Key: MESOS-3273
 URL: https://issues.apache.org/jira/browse/MESOS-3273
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.24.0
 Environment: 
 https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull
Reporter: Vinod Kone
Assignee: Vinod Kone

 Observed this on ASF CI. h/t [~haosd...@gmail.com]
 Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master.
 {code}
 [ RUN  ] ExamplesTest.EventCallFramework
 Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx'
 I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the 
 driver is aborted!
 Shutting down
 Sending SIGTERM to process tree at pid 26061
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26062
 Shutting down
 Killing the following process trees:
 [ 
 ]
 Sending SIGTERM to process tree at pid 26063
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26098
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26099
 Killing the following process trees:
 [ 
 ]
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 
 172.17.2.10:60249 for 16 cpus
 I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR
 I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0
 I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms
 I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms
 I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns
 I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 
 8429ns
 I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 4219ns
 I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery
 I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status
 I0813 19:55:17.181970 26126 master.cpp:378] Master 
 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 
 172.17.2.10:60249
 I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: 
 --acls=permissive: false
 register_frameworks {
   principals {
 type: SOME
 values: test-principal
   }
   roles {
 type: SOME
 values: *
   }
 }
 run_tasks {
   principals {
 type: SOME
 values: test-principal
   }
   users {
 type: SOME
 values: mesos
   }
 }
  --allocation_interval=1secs --allocator=HierarchicalDRF 
 --authenticate=false --authenticate_slaves=false 
 --authenticators=crammd5 
 --credentials=/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials 
 --framework_sorter=drf --help=false --initialize_driver_logging=true 
 --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO 
 --max_slave_ping_timeouts=5 --quiet=false 
 --recovery_slave_removal_limit=100% --registry=replicated_log 
 --registry_fetch_timeout=1mins --registry_store_timeout=5secs 
 --registry_strict=false --root_submissions=true 
 --slave_ping_timeout=15secs --slave_reregister_timeout=10mins 
 --user_sorter=drf --version=false 
 --webui_dir=/mesos/mesos-0.24.0/src/webui --work_dir=/tmp/mesos-II8Gua 
 --zk_session_timeout=10secs
 I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated 
 frameworks to register
 I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated 
 slaves to register
 I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for 
 authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials'
 W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials 
 file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. 
 It is recommended that your credentials file is NOT accessible by others.
 I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' 
 authenticator
 I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL
 I0813 19:55:17.185104 26138 recover.cpp:195] Received a

[jira] [Created] (MESOS-3285) Master should not accept /scheduler calls when not elected / recovered.

2015-08-17 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-3285:
--

 Summary: Master should not accept /scheduler calls when not 
elected / recovered.
 Key: MESOS-3285
 URL: https://issues.apache.org/jira/browse/MESOS-3285
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Benjamin Mahler
Priority: Blocker


The master currently drops all MessageEvents when it is non-leading or hasn't 
finished recovering from the registrar (see 
[here|https://github.com/apache/mesos/blob/0.23.0/src/master/master.cpp#L1076]).

The /scheduler HttpEvents should also be dropped in these cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1010) Python extension build is broken if gflags-dev is installed


 [ 
https://issues.apache.org/jira/browse/MESOS-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-1010:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Python extension build is broken if gflags-dev is installed
 ---

 Key: MESOS-1010
 URL: https://issues.apache.org/jira/browse/MESOS-1010
 Project: Mesos
  Issue Type: Bug
  Components: build, python api
 Environment: Fedora 20, amd64, GCC: 4.8.2; OSX 10.10.4, Apple LLVM 
 6.1.0 (~LLVM 3.6.0)
Reporter: Nikita Vetoshkin
Assignee: Greg Mann
  Labels: flaky-test, mesosphere

 In my environment mesos build from master results in broken python api module 
 {{_mesos.so}}:
 {noformat}
 nekto0n@ya-darkstar ~/workspace/mesos/src/python $ 
 PYTHONPATH=build/lib.linux-x86_64-2.7/ python -c import _mesos
 Traceback (most recent call last):
   File string, line 1, in module
 ImportError: 
 /home/nekto0n/workspace/mesos/src/python/build/lib.linux-x86_64-2.7/_mesos.so:
  undefined symbol: _ZN6google14FlagRegistererC1EPKcS2_S2_S2_PvS3_
 {noformat}
 Unmangled version of symbol looks like this:
 {noformat}
 google::FlagRegisterer::FlagRegisterer(char const*, char const*, char const*, 
 char const*, void*, void*)
 {noformat}
 During {{./configure}} step {{glog}} finds {{gflags}} development files and 
 starts using them, thus *implicitly* adding dependency on {{libgflags.so}}. 
 This breaks Python extensions module and perhaps can break other mesos 
 subsystems when moved to hosts without {{gflags}} installed.
 This task is done when the ExamplesTest.PythonFramework test will pass on a 
 system with gflags installed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3065) Add authorization for persistent volume


 [ 
https://issues.apache.org/jira/browse/MESOS-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Massenzio updated MESOS-3065:
---
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 17  (was: Mesosphere Sprint 
16)

 Add authorization for persistent volume
 ---

 Key: MESOS-3065
 URL: https://issues.apache.org/jira/browse/MESOS-3065
 Project: Mesos
  Issue Type: Task
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 Persistent volume should be authorized with the {{principal}} of the 
 reserving entity (framework or master). The idea is to introduce {{Create}} 
 and {{Destroy}} into the ACL.
 {code}
   message Create {
 // Subjects.
 required Entity principals = 1;
 // Objects? Perhaps the kind of volume? allowed permissions?
   }
   message Unreserve {
 // Subjects.
 required Entity principals = 1;
 // Objects.
 required Entity creator_principals = 2;
   }
 {code}
 When a framework/operator creates a persistent volume, create ACLs are 
 checked to see if the framework (FrameworkInfo.principal) or the operator 
 (Credential.user) is authorized to create persistent volumes. If not 
 authorized, the create operation is rejected.
 When a framework/operator destroys a persistent volume, destroy ACLs are 
 checked to see if the framework (FrameworkInfo.principal) or the operator 
 (Credential.user) is authorized to destroy the persistent volume created by a 
 framework or operator (Resource.DiskInfo.principal). If not authorized, the 
 destroy operation is rejected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3230) Create a HTTP based Authentication design doc


[ 
https://issues.apache.org/jira/browse/MESOS-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700443#comment-14700443
 ] 

Marco Massenzio commented on MESOS-3230:


[~arojas] - does your comment mean this story can be Resolved?

 Create a HTTP based Authentication design doc
 -

 Key: MESOS-3230
 URL: https://issues.apache.org/jira/browse/MESOS-3230
 Project: Mesos
  Issue Type: Task
  Components: security
Reporter: Alexander Rojas
Assignee: Alexander Rojas
  Labels: mesosphere

 Since most of the communication between mesosphere components will happen 
 through HTTP with the arrival of the [HTTP 
 API|https://issues.apache.org/jira/browse/MESOS-2288], it makes sense to use 
 HTTP standard mechanisms to authenticate this communication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2986) Docker version output is not compatible with Mesos

2015-08-17 Thread Steve Hoffman (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698534#comment-14698534
 ] 

Steve Hoffman edited comment on MESOS-2986 at 8/17/15 1:38 PM:
---

Yea, the 0.22.1 version of this code while ugly, just checked the major version 
number rather than creating a Version class which assumes there are just 3 
number digits -- which clearly everybody doesn't follow (as in the FC case)
{code}
  foreach (string line, strings::split(output.get(), \n)) {
line = strings::trim(line);
if (strings::startsWith(line, Client version: )) {
  line = line.substr(strlen(Client version: ));
  vectorstring version = strings::split(line, .);
  if (version.size()  1) {
return Error(Failed to parse Docker version ' + line + ');
  }
  Tryint major = numifyint(version[0]);
  if (major.isError()) {
return Error(Failed to parse Docker major version ' +
 version[0] + ');
  } else if (major.get()  1) {
break;
  }
  return new Docker(path);
}
  }
{code}
At this point in time do we still need a check here?  Would anybody be using 
pre 1.0 docker with mesos?  You could just dump the check outright...

Also,when this is fixed, can we get a patch to the 0.22.1 RPM?


was (Author: hoffman60613):
Yea, the 0.22.1 version of this code while ugly, just checked the major version 
number rather than creating a Version class which assumes there are just 3 
number digits -- which clearly everybody doesn't follow (as in the FC case)
{code}
  foreach (string line, strings::split(output.get(), \n)) {
line = strings::trim(line);
if (strings::startsWith(line, Client version: )) {
  line = line.substr(strlen(Client version: ));
  vectorstring version = strings::split(line, .);
  if (version.size()  1) {
return Error(Failed to parse Docker version ' + line + ');
  }
  Tryint major = numifyint(version[0]);
  if (major.isError()) {
return Error(Failed to parse Docker major version ' +
 version[0] + ');
  } else if (major.get()  1) {
break;
  }
  return new Docker(path);
}
  }
{code}
At this point in time do we still need a check here?  Would anybody be using 
pre 1.0 docker with mesos?  You could just dump the check outright...

 Docker version output is not compatible with Mesos
 --

 Key: MESOS-2986
 URL: https://issues.apache.org/jira/browse/MESOS-2986
 Project: Mesos
  Issue Type: Bug
  Components: docker
Reporter: Isabel Jimenez
Assignee: Isabel Jimenez
  Labels: mesosphere
 Fix For: 0.23.0


 We currently use docker version to get Docker version, in Docker master 
 branch and soon in Docker 1.8 [1] the output for this command changes. The 
 solution for now will be to use the unchanged docker --version output, in the 
 long term we should consider stop using the cli and use the API instead. 
 [1] https://github.com/docker/docker/pull/14047



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3037) Add a QUIESCE call to the scheduler

2015-08-17 Thread gyliu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699548#comment-14699548
 ] 

gyliu commented on MESOS-3037:
--

QUIESCE call https://reviews.apache.org/r/37532/

 Add a QUIESCE call to the scheduler
 ---

 Key: MESOS-3037
 URL: https://issues.apache.org/jira/browse/MESOS-3037
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
Assignee: gyliu

 SUPPRESS call is the complement to the current REVIVE call i.e., it will 
 inform Mesos to stop sending offers to the framework. 
 For the scheduler driver to send only Call messages (MESOS-2913), 
 DeactivateFrameworkMessage needs to be converted to Call(s). We can implement 
 this by having the driver send a SUPPRESS call followed by a DECLINE call for 
 outstanding offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2986) Docker version output is not compatible with Mesos

2015-08-17 Thread Steve Hoffman (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698534#comment-14698534
 ] 

Steve Hoffman edited comment on MESOS-2986 at 8/17/15 1:23 PM:
---

Yea, the 0.22.1 version of this code while ugly, just checked the major version 
number rather than creating a Version class which assumes there are just 3 
number digits -- which clearly everybody doesn't follow (as in the FC case)
{code}
  foreach (string line, strings::split(output.get(), \n)) {
line = strings::trim(line);
if (strings::startsWith(line, Client version: )) {
  line = line.substr(strlen(Client version: ));
  vectorstring version = strings::split(line, .);
  if (version.size()  1) {
return Error(Failed to parse Docker version ' + line + ');
  }
  Tryint major = numifyint(version[0]);
  if (major.isError()) {
return Error(Failed to parse Docker major version ' +
 version[0] + ');
  } else if (major.get()  1) {
break;
  }
  return new Docker(path);
}
  }
{code}
At this point in time do we still need a check here?  Would anybody be using 
pre 1.0 docker with mesos?  You could just dump the check outright...


was (Author: hoffman60613):
Yea, the 0.22.1 version of this code while ugly, just checked the major version 
number rather than creating a Version class which assumes there are just 1 
numbers -- which clearly everybody doesn't follow.
{code}
  foreach (string line, strings::split(output.get(), \n)) {
line = strings::trim(line);
if (strings::startsWith(line, Client version: )) {
  line = line.substr(strlen(Client version: ));
  vectorstring version = strings::split(line, .);
  if (version.size()  1) {
return Error(Failed to parse Docker version ' + line + ');
  }
  Tryint major = numifyint(version[0]);
  if (major.isError()) {
return Error(Failed to parse Docker major version ' +
 version[0] + ');
  } else if (major.get()  1) {
break;
  }
  return new Docker(path);
}
  }
{code}
At this point in time do we still need a check here?  Would anybody be using 
pre 1.0 docker with mesos?  You could just dump the check outright...

 Docker version output is not compatible with Mesos
 --

 Key: MESOS-2986
 URL: https://issues.apache.org/jira/browse/MESOS-2986
 Project: Mesos
  Issue Type: Bug
  Components: docker
Reporter: Isabel Jimenez
Assignee: Isabel Jimenez
  Labels: mesosphere
 Fix For: 0.23.0


 We currently use docker version to get Docker version, in Docker master 
 branch and soon in Docker 1.8 [1] the output for this command changes. The 
 solution for now will be to use the unchanged docker --version output, in the 
 long term we should consider stop using the cli and use the API instead. 
 [1] https://github.com/docker/docker/pull/14047



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2986) Docker version output is not compatible with Mesos

2015-08-17 Thread Steve Hoffman (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698534#comment-14698534
 ] 

Steve Hoffman edited comment on MESOS-2986 at 8/17/15 1:22 PM:
---

Yea, the 0.22.1 version of this code while ugly, just checked the major version 
number rather than creating a Version class which assumes there are just 1 
numbers -- which clearly everybody doesn't follow.
{code}
  foreach (string line, strings::split(output.get(), \n)) {
line = strings::trim(line);
if (strings::startsWith(line, Client version: )) {
  line = line.substr(strlen(Client version: ));
  vectorstring version = strings::split(line, .);
  if (version.size()  1) {
return Error(Failed to parse Docker version ' + line + ');
  }
  Tryint major = numifyint(version[0]);
  if (major.isError()) {
return Error(Failed to parse Docker major version ' +
 version[0] + ');
  } else if (major.get()  1) {
break;
  }
  return new Docker(path);
}
  }
{code}
At this point in time do we still need a check here?  Would anybody be using 
pre 1.0 docker with mesos?  You could just dump the check outright...


was (Author: hoffman60613):
Yea, the 0.22.1 version of this code while ugly, just checked the major version 
number rather than creating a Version class which assumes there are just 1 
numbers -- which clearly everybody doesn't follow.
{code}
  foreach (string line, strings::split(output.get(), \n)) {
line = strings::trim(line);
if (strings::startsWith(line, Client version: )) {
  line = line.substr(strlen(Client version: ));
  vectorstring version = strings::split(line, .);
  if (version.size()  1) {
return Error(Failed to parse Docker version ' + line + ');
  }
  Tryint major = numifyint(version[0]);
  if (major.isError()) {
return Error(Failed to parse Docker major version ' +
 version[0] + ');
  } else if (major.get()  1) {
break;
  }
  return new Docker(path);
}
  }
{code}
At this point do still need a check?  Would anybody be using pre 1.0 docker 
with mesos?  You could just dump the check outright...

 Docker version output is not compatible with Mesos
 --

 Key: MESOS-2986
 URL: https://issues.apache.org/jira/browse/MESOS-2986
 Project: Mesos
  Issue Type: Bug
  Components: docker
Reporter: Isabel Jimenez
Assignee: Isabel Jimenez
  Labels: mesosphere
 Fix For: 0.23.0


 We currently use docker version to get Docker version, in Docker master 
 branch and soon in Docker 1.8 [1] the output for this command changes. The 
 solution for now will be to use the unchanged docker --version output, in the 
 long term we should consider stop using the cli and use the API instead. 
 [1] https://github.com/docker/docker/pull/14047



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.

2015-08-17 Thread Klaus Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699560#comment-14699560
 ] 

Klaus Ma commented on MESOS-3070:
-

[~vinodkone], the draft code diff of #4 was post at 
https://reviews.apache.org/r/37531/ to show the overall idea of it. The code is 
not completed: no code diff on GUI (show TaskTag in GUI), no UT case for it 
(will update it later), did not update other UT on task_id check. 

And, maybe add uid instead of TaskTag is better; so user did not need behavior 
changes, and not UT on task_id check will be broken.

 If any comments, please let me know. 

 Master CHECK failure if a framework uses duplicated task id.
 

 Key: MESOS-3070
 URL: https://issues.apache.org/jira/browse/MESOS-3070
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.22.1
Reporter: Jie Yu
Assignee: Klaus Ma

 We observed this in one of our testing cluster.
 One framework (under development) keeps launching tasks using the same 
 task_id. We don't expect the master to crash even if the framework is not 
 doing what it's supposed to do. However, under a series of events, this could 
 happen and keeps crashing the master.
 1) frameworkA launches task 'task_id_1' on slaveA
 2) master fails over
 3) slaveA has not re-registered yet
 4) frameworkA re-registered and launches task 'task_id_1' on slaveB
 5) slaveA re-registering and add task task_id_1' to frameworkA
 6) CHECK failure in addTask
 {noformat}
 I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with 
 resources cpus(*):4; mem(*):32768 on slave 
 20150417-232509-1735470090-5050-48870-S25 (hostname)
 ...
 ...
 F0716 21:52:50.760136 28805 master.hpp:362] Check failed: 
 !tasks.contains(task-task_id()) Duplicate task 'task_id_1' of framework 
 framework_id
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3273) EventCall Test Framework is flaky


 [ 
https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-3273:
--
Target Version/s:   (was: 0.24.0)

Neither me nor benh were able to repro this tonight after running 1K iterations 
each. Seems like a very rare deadlock.

I'm removing this as a blocker for 0.24.0 release but will keep the ticket open.

 EventCall Test Framework is flaky
 -

 Key: MESOS-3273
 URL: https://issues.apache.org/jira/browse/MESOS-3273
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.24.0
 Environment: 
 https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull
Reporter: Vinod Kone
Assignee: Vinod Kone

 Observed this on ASF CI. h/t [~haosd...@gmail.com]
 Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master.
 {code}
 [ RUN  ] ExamplesTest.EventCallFramework
 Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx'
 I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the 
 driver is aborted!
 Shutting down
 Sending SIGTERM to process tree at pid 26061
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26062
 Shutting down
 Killing the following process trees:
 [ 
 ]
 Sending SIGTERM to process tree at pid 26063
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26098
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26099
 Killing the following process trees:
 [ 
 ]
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 
 172.17.2.10:60249 for 16 cpus
 I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR
 I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0
 I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms
 I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms
 I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns
 I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 
 8429ns
 I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 4219ns
 I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery
 I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status
 I0813 19:55:17.181970 26126 master.cpp:378] Master 
 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 
 172.17.2.10:60249
 I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: 
 --acls=permissive: false
 register_frameworks {
   principals {
 type: SOME
 values: test-principal
   }
   roles {
 type: SOME
 values: *
   }
 }
 run_tasks {
   principals {
 type: SOME
 values: test-principal
   }
   users {
 type: SOME
 values: mesos
   }
 }
  --allocation_interval=1secs --allocator=HierarchicalDRF 
 --authenticate=false --authenticate_slaves=false 
 --authenticators=crammd5 
 --credentials=/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials 
 --framework_sorter=drf --help=false --initialize_driver_logging=true 
 --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO 
 --max_slave_ping_timeouts=5 --quiet=false 
 --recovery_slave_removal_limit=100% --registry=replicated_log 
 --registry_fetch_timeout=1mins --registry_store_timeout=5secs 
 --registry_strict=false --root_submissions=true 
 --slave_ping_timeout=15secs --slave_reregister_timeout=10mins 
 --user_sorter=drf --version=false 
 --webui_dir=/mesos/mesos-0.24.0/src/webui --work_dir=/tmp/mesos-II8Gua 
 --zk_session_timeout=10secs
 I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated 
 frameworks to register
 I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated 
 slaves to register
 I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for 
 authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials'
 W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials 
 file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. 
 It is recommended that your credentials file is NOT accessible by others.
 I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' 
 authenticator
 I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL
 I0813 19:55:17.185104 26138 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status

[jira] [Created] (MESOS-3286) Revocable metrics information are missed for slave node

Yong Qiao Wang created MESOS-3286:
-

 Summary: Revocable metrics information are missed for slave node
 Key: MESOS-3286
 URL: https://issues.apache.org/jira/browse/MESOS-3286
 Project: Mesos
  Issue Type: Documentation
Reporter: Yong Qiao Wang
Assignee: Yong Qiao Wang
Priority: Minor


In bug 3278, the revocable metrics information of master node are added, but I 
also found those information also are missed for slave node in monitoring doc, 
fix it in this new patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-3280) Master fails to access replicated log after network partition

2015-08-17 Thread Bernd Mathiske (JIRA)

Bernd Mathiske created MESOS-3280:
-

 Summary: Master fails to access replicated log after network 
partition
 Key: MESOS-3280
 URL: https://issues.apache.org/jira/browse/MESOS-3280
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Bernd Mathiske


In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
network partition is forced, all the masters apparently lose access to their 
replicated log. The leading master halts. Unknown reasons, but presumably 
related to replicated log access. The others fail to recover from the 
replicated log. Unknown reasons. This could have to do with ZK setup, but it 
might also be a Mesos bug. 

This was observed in a Chronos test drive scenario described in detail here:
https://github.com/mesos/chronos/issues/511

With setup instructions here:
https://github.com/mesos/chronos/issues/508





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1


[ 
https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699881#comment-14699881
 ] 

Jie Yu commented on MESOS-3050:
---

Marco, can you re-run the failing tests using --verbose and paste the results 
here? Thanks

 Failing ROOT_ tests on CentOS 7.1
 -

 Key: MESOS-3050
 URL: https://issues.apache.org/jira/browse/MESOS-3050
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, test
Affects Versions: 0.23.0
 Environment: CentOS Linux release 7.1.1503
 0.24.0
Reporter: Adam B
Assignee: Timothy Chen
Priority: Blocker
  Labels: mesosphere
 Attachments: ROOT_tests.log


 Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several 
 several failures/errors:
 {code}
 [ RUN  ] DockerTest.ROOT_DOCKER_CheckPortResource
 ../../src/tests/docker_tests.cpp:303: Failure
 (run).failure(): Container exited on error: exited with status 1
 [  FAILED  ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms)
 {code}
 ...
 {code}
 [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 ../../src/tests/isolator_tests.cpp:837: Failure
 isolator: Failed to create PerfEvent isolator, invalid events: { cycles, 
 task-clock }
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms)
 [--] 1 test from PerfEventIsolatorTest (9 ms total)

 [--] 2 tests from SharedFilesystemIsolatorTest
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp
 + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac
 ../../src/tests/isolator_tests.cpp:1001: Failure
 Value of: os::exists(file)
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms)
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp
 + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d
 ../../src/tests/isolator_tests.cpp:1086: Failure
 Value of: os::exists(path::join(containerPath, filename))
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms)
 {code}
 ...
 {code}
 [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
 mesos::internal::slave::CgroupsMemIsolatorProcess
 userdel: user 'mesos.test.unprivileged.user' does not exist
 [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
 -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied
 mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: 
 Permission denied
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or 
 directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs:
  Permission denied
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or 
 directory
 mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: 
 No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file 
 or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs:
  No such file or directory
 mkdir: cannot create directory 
 ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256

[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1


[ 
https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699929#comment-14699929
 ] 

Marco Massenzio commented on MESOS-3050:


hrumpf... great investigation [~jieyu]!
Is there an easy fix or does this require to 'introspect' the system at runtime?

 Failing ROOT_ tests on CentOS 7.1
 -

 Key: MESOS-3050
 URL: https://issues.apache.org/jira/browse/MESOS-3050
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, test
Affects Versions: 0.23.0
 Environment: CentOS Linux release 7.1.1503
 0.24.0
Reporter: Adam B
Assignee: Timothy Chen
Priority: Blocker
  Labels: mesosphere
 Attachments: ROOT_tests.log


 Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several 
 several failures/errors:
 {code}
 [ RUN  ] DockerTest.ROOT_DOCKER_CheckPortResource
 ../../src/tests/docker_tests.cpp:303: Failure
 (run).failure(): Container exited on error: exited with status 1
 [  FAILED  ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms)
 {code}
 ...
 {code}
 [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 ../../src/tests/isolator_tests.cpp:837: Failure
 isolator: Failed to create PerfEvent isolator, invalid events: { cycles, 
 task-clock }
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms)
 [--] 1 test from PerfEventIsolatorTest (9 ms total)

 [--] 2 tests from SharedFilesystemIsolatorTest
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp
 + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac
 ../../src/tests/isolator_tests.cpp:1001: Failure
 Value of: os::exists(file)
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms)
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp
 + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d
 ../../src/tests/isolator_tests.cpp:1086: Failure
 Value of: os::exists(path::join(containerPath, filename))
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms)
 {code}
 ...
 {code}
 [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
 mesos::internal::slave::CgroupsMemIsolatorProcess
 userdel: user 'mesos.test.unprivileged.user' does not exist
 [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
 -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied
 mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: 
 Permission denied
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or 
 directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs:
  Permission denied
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or 
 directory
 mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: 
 No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file 
 or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs:
  No such file or directory
 mkdir: cannot create directory 
 ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy,

[jira] [Updated] (MESOS-1554) Persistent resources support for storage-like services

[
https://issues.apache.org/jira/browse/MESOS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marco Massenzio updated MESOS-1554:
---
Assignee: Michael Park (was: Marco Massenzio)

Persistent resources support for storage-like services
--

Key: MESOS-1554
URL: https://issues.apache.org/jira/browse/MESOS-1554
Project: Mesos
Issue Type: Epic
Components: general, hadoop
Reporter: Nikita Vetoshkin
Assignee: Michael Park
Priority: Critical
Labels: mesosphere, twitter

This question came up in [dev mailing
list|http://mail-archives.apache.org/mod_mbox/mesos-dev/201406.mbox/%3CCAK8jAgNDs9Fe011Sq1jeNr0h%3DE-tDD9rak6hAsap3PqHx1y%3DKQ%40mail.gmail.com%3E].
It seems reasonable for storage like services (e.g. HDFS or Cassandra) to use
Mesos to manage it's instances. But right now if we'd like to restart
instance (e.g. to spin up a new version) - all previous instance version
sandbox filesystem resources will be recycled by slave's garbage collector.
At the moment filesystem resources can be managed out of band - i.e.
instances can save their data in some database specific placed, that various
instances can share (e.g. {{/var/lib/cassandra}}).
[~benjaminhindman] suggested an idea in the mailing list (though it still
needs some fleshing out):
{quote}
The idea originally came about because, even today, if we allocate some
file system space to a task/executor, and then that task/executor
terminates, we haven't officially freed those file system resources until
after we garbage collect the task/executor sandbox! (We keep the sandbox
around so a user/operator can get the stdout/stderr or anything else left
around from their task/executor.)
To solve this problem we wanted to be able to let a task/executor terminate
but not *give up* all of it's resources, hence: persistent resources.
Pushing this concept even further you could imagine always reallocating
resources to a framework that had already been allocated those resources
for a previous task/executor. Looked at from another perspective, these are
late-binding, or lazy, resource reservations.
At one point in time we had considered just doing 'right-of-first-refusal'
for allocations after a task/executor terminate. But this is really
insufficient for supporting storage-like frameworks well (and likely even
harder to reliably implement then 'persistent resources' IMHO).
There are a ton of things that need to get worked out in this model,
including (but not limited to), how should a file system (or disk) be
exposed in order to be made persistent? How should persistent resources be
returned to a master? How many persistent resources can a framework get
allocated?
{quote}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1554) Persistent resources support for storage-like services

[
https://issues.apache.org/jira/browse/MESOS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marco Massenzio updated MESOS-1554:
---
Labels: mesosphere twitter (was: twitter)

Persistent resources support for storage-like services
--

Key: MESOS-1554
URL: https://issues.apache.org/jira/browse/MESOS-1554
Project: Mesos
Issue Type: Epic
Components: general, hadoop
Reporter: Nikita Vetoshkin
Priority: Critical
Labels: mesosphere, twitter

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1


[ 
https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699895#comment-14699895
 ] 

Jie Yu commented on MESOS-3050:
---

Looking at the logs of those filesystem isolator tests, the 'exec' fails after 
pivot_root. Since we're exec-ing a '/bin/sh' binary, one explanation might be 
the binary (or some dependency of it) are not in the test root filesystem.

 Failing ROOT_ tests on CentOS 7.1
 -

 Key: MESOS-3050
 URL: https://issues.apache.org/jira/browse/MESOS-3050
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, test
Affects Versions: 0.23.0
 Environment: CentOS Linux release 7.1.1503
 0.24.0
Reporter: Adam B
Assignee: Timothy Chen
Priority: Blocker
  Labels: mesosphere
 Attachments: ROOT_tests.log


 Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several 
 several failures/errors:
 {code}
 [ RUN  ] DockerTest.ROOT_DOCKER_CheckPortResource
 ../../src/tests/docker_tests.cpp:303: Failure
 (run).failure(): Container exited on error: exited with status 1
 [  FAILED  ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms)
 {code}
 ...
 {code}
 [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 ../../src/tests/isolator_tests.cpp:837: Failure
 isolator: Failed to create PerfEvent isolator, invalid events: { cycles, 
 task-clock }
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms)
 [--] 1 test from PerfEventIsolatorTest (9 ms total)

 [--] 2 tests from SharedFilesystemIsolatorTest
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp
 + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac
 ../../src/tests/isolator_tests.cpp:1001: Failure
 Value of: os::exists(file)
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms)
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp
 + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d
 ../../src/tests/isolator_tests.cpp:1086: Failure
 Value of: os::exists(path::join(containerPath, filename))
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms)
 {code}
 ...
 {code}
 [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
 mesos::internal::slave::CgroupsMemIsolatorProcess
 userdel: user 'mesos.test.unprivileged.user' does not exist
 [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
 -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied
 mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: 
 Permission denied
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or 
 directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs:
  Permission denied
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or 
 directory
 mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: 
 No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file 
 or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs:
  No such file or directory
 mkdir: cannot create directory 
 ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of:

[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1


[ 
https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699911#comment-14699911
 ] 

Jie Yu commented on MESOS-3050:
---

OK, I think I know the problem. In centos7.1, 'sh' is under '/usr/bin/sh', 
while on centos6 (the system I've been using), 'sh' is under '/bin/sh'.

 Failing ROOT_ tests on CentOS 7.1
 -

 Key: MESOS-3050
 URL: https://issues.apache.org/jira/browse/MESOS-3050
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, test
Affects Versions: 0.23.0
 Environment: CentOS Linux release 7.1.1503
 0.24.0
Reporter: Adam B
Assignee: Timothy Chen
Priority: Blocker
  Labels: mesosphere
 Attachments: ROOT_tests.log


 Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several 
 several failures/errors:
 {code}
 [ RUN  ] DockerTest.ROOT_DOCKER_CheckPortResource
 ../../src/tests/docker_tests.cpp:303: Failure
 (run).failure(): Container exited on error: exited with status 1
 [  FAILED  ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms)
 {code}
 ...
 {code}
 [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 ../../src/tests/isolator_tests.cpp:837: Failure
 isolator: Failed to create PerfEvent isolator, invalid events: { cycles, 
 task-clock }
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms)
 [--] 1 test from PerfEventIsolatorTest (9 ms total)

 [--] 2 tests from SharedFilesystemIsolatorTest
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp
 + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac
 ../../src/tests/isolator_tests.cpp:1001: Failure
 Value of: os::exists(file)
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms)
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp
 + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d
 ../../src/tests/isolator_tests.cpp:1086: Failure
 Value of: os::exists(path::join(containerPath, filename))
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms)
 {code}
 ...
 {code}
 [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
 mesos::internal::slave::CgroupsMemIsolatorProcess
 userdel: user 'mesos.test.unprivileged.user' does not exist
 [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
 -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied
 mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: 
 Permission denied
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or 
 directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs:
  Permission denied
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or 
 directory
 mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: 
 No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file 
 or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/cgroup.procs:
  No such file or directory
 mkdir: cannot create directory 
 ‘/sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user’:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/name=systemd/user.slice/user-2004.slice/session-3865.scope/user/cgroup.procs:
  No such file or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  +

[jira] [Updated] (MESOS-3280) Master fails to access replicated log after network partition

2015-08-17 Thread Bernd Mathiske (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-3280:
--
Affects Version/s: 0.23.0
  Environment: 
Zookeeper version 3.4.5--1


 Master fails to access replicated log after network partition
 -

 Key: MESOS-3280
 URL: https://issues.apache.org/jira/browse/MESOS-3280
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.23.0
 Environment: Zookeeper version 3.4.5--1
Reporter: Bernd Mathiske
  Labels: mesosphere

 In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a 
 network partition is forced, all the masters apparently lose access to their 
 replicated log. The leading master halts. Unknown reasons, but presumably 
 related to replicated log access. The others fail to recover from the 
 replicated log. Unknown reasons. This could have to do with ZK setup, but it 
 might also be a Mesos bug. 
 This was observed in a Chronos test drive scenario described in detail here:
 https://github.com/mesos/chronos/issues/511
 With setup instructions here:
 https://github.com/mesos/chronos/issues/508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2769) Metric for cpu scheduling latency from all components

2015-08-17 Thread Cong Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cong Wang updated MESOS-2769:
-
Sprint: Twitter Q2 Sprint 3, Twitter Mesos Q3 Sprint 3  (was: Twitter Q2 
Sprint 3)

 Metric for cpu scheduling latency from all components
 -

 Key: MESOS-2769
 URL: https://issues.apache.org/jira/browse/MESOS-2769
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Affects Versions: 0.22.1
Reporter: Ian Downes
Assignee: Cong Wang
  Labels: twitter

 The metric will provide statistics on the scheduling latency for 
 processes/threads in a container, i.e., statistics on the delay before 
 application code can run. This will be the aggregate effect of the normal 
 scheduling period, contention from other threads/processes, both in the 
 container and on the system, and any effects from the CFS bandwidth control 
 (if enabled) or other CPU isolation strategies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1554) Persistent resources support for storage-like services

[
https://issues.apache.org/jira/browse/MESOS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marco Massenzio reassigned MESOS-1554:
--

Assignee: Marco Massenzio

Persistent resources support for storage-like services
--

Key: MESOS-1554
URL: https://issues.apache.org/jira/browse/MESOS-1554
Project: Mesos
Issue Type: Epic
Components: general, hadoop
Reporter: Nikita Vetoshkin
Assignee: Marco Massenzio
Priority: Critical
Labels: mesosphere, twitter

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3264) JVM can exit prematurely following framework teardown


[ 
https://issues.apache.org/jira/browse/MESOS-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699967#comment-14699967
 ] 

Greg Mann commented on MESOS-3264:
--

Thanks for having a look at this, [~haosd...@gmail.com]! I had explored the 
option of using similar shutdown hooks previously, and unfortunately it doesn't 
do the trick, I assume because the order of the shutdown hooks is unspecified? 
And since they are run concurrently, perhaps the JVM will continue on to its 
post-shutdownHook GC while the hooks are still executing. In any case, the 
tests continue to fail with such shutdown hooks placed in the constructors of 
the SchedulerDriver and/or the ExecutorDriver.

If we define the {{close()}} method as {{public}} and call it explicitly in the 
body of {{main()}}, the tests do pass reliably. However, there seems to be some 
conventional wisdom saying that defining/calling a method that calls 
{{finalize()}} in that way is A Bad Thing. Any thoughts? If we decide that it 
is acceptable to define a public {{close()}} method that calls {{finalize()}} 
for the SchedulerDriver, similar to the one in your patch, and call it 
explicitly just before we call {{System.exit()}}, then that would solve this 
issue.

 JVM can exit prematurely following framework teardown
 -

 Key: MESOS-3264
 URL: https://issues.apache.org/jira/browse/MESOS-3264
 Project: Mesos
  Issue Type: Bug
  Components: java api
Affects Versions: 0.23.0, 0.24.0
Reporter: Greg Mann
Priority: Minor
  Labels: java, tech-debt

 In Java frameworks, it is possible for the JVM to begin exiting the program - 
 via {{System.exit()}}, for example - while teardown of native objects such as 
 the SchedulerDriver and associated Executors is still in progress. 
 {{SchedulerDriver::stop()}} will return after it has sent messages to other 
 actors to begin their teardown, meanwhile the JVM is free to terminate the 
 program and thus begin executing native object destructors while those 
 objects are still in use, potentially leading to a segfault.
 This has manifested itself in flaky tests from the ExamplesTest suite (see 
 MESOS-830 and MESOS-1013), as mutexes from glog are destroyed while the 
 framework is still shutting down and attempting to log.
 Ideally, a mechanism would exist to block the Java code until a confirmation 
 that framework teardown is complete has been received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-3281) Create a user doc for Scheduler HTTP API

Vinod Kone created MESOS-3281:
-

 Summary: Create a user doc for Scheduler HTTP API
 Key: MESOS-3281
 URL: https://issues.apache.org/jira/browse/MESOS-3281
 Project: Mesos
  Issue Type: Documentation
Reporter: Vinod Kone
Assignee: Vinod Kone


We need to convert the design doc into user doc that we can add to our docs 
folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3264) JVM can exit prematurely following framework teardown

2015-08-17 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1468#comment-1468
 ] 

haosdent commented on MESOS-3264:
-

{code}
I had explored the option of using similar shutdown hooks previously, and 
unfortunately it doesn't do the trick, I assume because the order of the 
shutdown hooks is unspecified?
{code}

Very interesting. Could you try this code snippet in you jvm? 
https://ideone.com/48o7SG
The output should be
{code}
Enter main
Before System.exit(0);
Call finalize()
Call finalize()
{code}

Notice I wrap finalize() with close() and call close()  in the ShutdownHook. 
Call finalize() in the ShutdownHook thread is not work.

 JVM can exit prematurely following framework teardown
 -

 Key: MESOS-3264
 URL: https://issues.apache.org/jira/browse/MESOS-3264
 Project: Mesos
  Issue Type: Bug
  Components: java api
Affects Versions: 0.23.0, 0.24.0
Reporter: Greg Mann
Priority: Minor
  Labels: java, tech-debt

 In Java frameworks, it is possible for the JVM to begin exiting the program - 
 via {{System.exit()}}, for example - while teardown of native objects such as 
 the SchedulerDriver and associated Executors is still in progress. 
 {{SchedulerDriver::stop()}} will return after it has sent messages to other 
 actors to begin their teardown, meanwhile the JVM is free to terminate the 
 program and thus begin executing native object destructors while those 
 objects are still in use, potentially leading to a segfault.
 This has manifested itself in flaky tests from the ExamplesTest suite (see 
 MESOS-830 and MESOS-1013), as mutexes from glog are destroyed while the 
 framework is still shutting down and attempting to log.
 Ideally, a mechanism would exist to block the Java code until a confirmation 
 that framework teardown is complete has been received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3276) Add Scrapinghub to the Powered By Mesos page

2015-08-17 Thread Shuai Lin (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shuai Lin updated MESOS-3276:
-
Description:
Hello!

At [Scrapinghub|http://scrapinghub.com/] we have been using mesos to run our
core services in production for one year. Mesos is awesome and we really love
it! We'd like to add our organization to the Powered By Mesos page.

I've created a RB patch here: https://reviews.apache.org/r/37513/

Scrapinghub is the leading platform for deploying, running and scaling web
crawlers. We've redesigned large part of our Paas based on mesos, and we plan
to open source part of it in the future!

Thanks!

Shuai (+ Scrapinghub's Scrapy Cloud team)

was:
Hello!

At [Scrapinghub|https://scrapinghub.com/] we have been using mesos to run our
core services in production for one year. Mesos is awesome and we really love
it! We'd like to add our organization to the Powered By Mesos page.

I've created a RB patch here: https://reviews.apache.org/r/37513/

Scrapinghub is the leading platform for deploying, running and scaling web
crawlers. We've redesigned large part of our Paas based on mesos, and we plan
to open source part of it in the future!

Thanks!

Shuai (+ Scrapinghub's Scrapy Cloud team)

Add Scrapinghub to the Powered By Mesos page

Key: MESOS-3276
URL: https://issues.apache.org/jira/browse/MESOS-3276
Project: Mesos
Issue Type: Wish
Components: documentation
Reporter: Shuai Lin
Priority: Trivial

Hello!
At [Scrapinghub|http://scrapinghub.com/] we have been using mesos to run our
core services in production for one year. Mesos is awesome and we really love
it! We'd like to add our organization to the Powered By Mesos page.
I've created a RB patch here: https://reviews.apache.org/r/37513/
Scrapinghub is the leading platform for deploying, running and scaling web
crawlers. We've redesigned large part of our Paas based on mesos, and we plan
to open source part of it in the future!
Thanks!
Shuai (+ Scrapinghub's Scrapy Cloud team)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3276) Add Scrapinghub to the Powered By Mesos page

2015-08-17 Thread Shuai Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699121#comment-14699121
 ] 

Shuai Lin commented on MESOS-3276:
--

It should be http, I made a typo there. The RB patch should be correct.

Thanks for catching this!

 Add Scrapinghub to the Powered By Mesos page
 

 Key: MESOS-3276
 URL: https://issues.apache.org/jira/browse/MESOS-3276
 Project: Mesos
  Issue Type: Wish
  Components: documentation
Reporter: Shuai Lin
Priority: Trivial

 Hello!
 At [Scrapinghub|http://scrapinghub.com/] we have been using mesos to run our 
 core services in production for one year. Mesos is awesome and we really love 
 it! We'd like to add our organization to the Powered By Mesos page. 
 I've created a RB patch here: https://reviews.apache.org/r/37513/
 Scrapinghub is the leading platform for deploying, running and scaling web 
 crawlers. We've redesigned large part of our Paas based on mesos, and we plan 
 to open source part of it in the future!
 Thanks!
 Shuai (+ Scrapinghub's Scrapy Cloud team)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-3278) Add the revocable metrics information in monitoring doc

Yong Qiao Wang created MESOS-3278:
-

 Summary:  Add the revocable metrics information in monitoring doc
 Key: MESOS-3278
 URL: https://issues.apache.org/jira/browse/MESOS-3278
 Project: Mesos
  Issue Type: Documentation
Reporter: Yong Qiao Wang
Assignee: Yong Qiao Wang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3278) Add the revocable metrics information in monitoring doc


[ 
https://issues.apache.org/jira/browse/MESOS-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699130#comment-14699130
 ] 

Yong Qiao Wang commented on MESOS-3278:
---

The related review request is: https://reviews.apache.org/r/37518/

  Add the revocable metrics information in monitoring doc
 

 Key: MESOS-3278
 URL: https://issues.apache.org/jira/browse/MESOS-3278
 Project: Mesos
  Issue Type: Documentation
Reporter: Yong Qiao Wang
Assignee: Yong Qiao Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-3273) EventCall Test Framework is flaky


 [ 
https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-3273:
-

Assignee: Vinod Kone

 EventCall Test Framework is flaky
 -

 Key: MESOS-3273
 URL: https://issues.apache.org/jira/browse/MESOS-3273
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.24.0
 Environment: 
 https://builds.apache.org/job/Mesos/705/COMPILER=clang,CONFIGURATION=--verbose,OS=ubuntu:14.04,label_exp=docker%7C%7CHadoop/consoleFull
Reporter: Vinod Kone
Assignee: Vinod Kone

 Observed this on ASF CI. h/t [~haosd...@gmail.com]
 Looks like the HTTP scheduler never sent a SUBSCRIBE request to the master.
 {code}
 [ RUN  ] ExamplesTest.EventCallFramework
 Using temporary directory '/tmp/ExamplesTest_EventCallFramework_k4vXkx'
 I0813 19:55:15.643579 26085 exec.cpp:443] Ignoring exited event because the 
 driver is aborted!
 Shutting down
 Sending SIGTERM to process tree at pid 26061
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26062
 Shutting down
 Killing the following process trees:
 [ 
 ]
 Sending SIGTERM to process tree at pid 26063
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26098
 Killing the following process trees:
 [ 
 ]
 Shutting down
 Sending SIGTERM to process tree at pid 26099
 Killing the following process trees:
 [ 
 ]
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0813 19:55:17.161726 26100 process.cpp:1012] libprocess is initialized on 
 172.17.2.10:60249 for 16 cpus
 I0813 19:55:17.161888 26100 logging.cpp:177] Logging to STDERR
 I0813 19:55:17.163625 26100 scheduler.cpp:157] Version: 0.24.0
 I0813 19:55:17.175302 26100 leveldb.cpp:176] Opened db in 3.167446ms
 I0813 19:55:17.176393 26100 leveldb.cpp:183] Compacted db in 1.047996ms
 I0813 19:55:17.176496 26100 leveldb.cpp:198] Created db iterator in 77155ns
 I0813 19:55:17.176518 26100 leveldb.cpp:204] Seeked to beginning of db in 
 8429ns
 I0813 19:55:17.176527 26100 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 4219ns
 I0813 19:55:17.176708 26100 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0813 19:55:17.178951 26136 recover.cpp:449] Starting replica recovery
 I0813 19:55:17.179934 26136 recover.cpp:475] Replica is in EMPTY status
 I0813 19:55:17.181970 26126 master.cpp:378] Master 
 20150813-195517-167907756-60249-26100 (297daca2d01a) started on 
 172.17.2.10:60249
 I0813 19:55:17.182317 26126 master.cpp:380] Flags at startup: 
 --acls=permissive: false
 register_frameworks {
   principals {
 type: SOME
 values: test-principal
   }
   roles {
 type: SOME
 values: *
   }
 }
 run_tasks {
   principals {
 type: SOME
 values: test-principal
   }
   users {
 type: SOME
 values: mesos
   }
 }
  --allocation_interval=1secs --allocator=HierarchicalDRF 
 --authenticate=false --authenticate_slaves=false 
 --authenticators=crammd5 
 --credentials=/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials 
 --framework_sorter=drf --help=false --initialize_driver_logging=true 
 --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO 
 --max_slave_ping_timeouts=5 --quiet=false 
 --recovery_slave_removal_limit=100% --registry=replicated_log 
 --registry_fetch_timeout=1mins --registry_store_timeout=5secs 
 --registry_strict=false --root_submissions=true 
 --slave_ping_timeout=15secs --slave_reregister_timeout=10mins 
 --user_sorter=drf --version=false 
 --webui_dir=/mesos/mesos-0.24.0/src/webui --work_dir=/tmp/mesos-II8Gua 
 --zk_session_timeout=10secs
 I0813 19:55:17.183475 26126 master.cpp:427] Master allowing unauthenticated 
 frameworks to register
 I0813 19:55:17.183536 26126 master.cpp:432] Master allowing unauthenticated 
 slaves to register
 I0813 19:55:17.183615 26126 credentials.hpp:37] Loading credentials for 
 authentication from '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials'
 W0813 19:55:17.183859 26126 credentials.hpp:52] Permissions on credentials 
 file '/tmp/ExamplesTest_EventCallFramework_k4vXkx/credentials' are too open. 
 It is recommended that your credentials file is NOT accessible by others.
 I0813 19:55:17.183969 26123 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0813 19:55:17.184306 26126 master.cpp:469] Using default 'crammd5' 
 authenticator
 I0813 19:55:17.184661 26126 authenticator.cpp:512] Initializing server SASL
 I0813 19:55:17.185104 26138 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0813 19:55:17.185972 26100 containerizer.cpp:143] Using isolation: 
 posix/cpu,posix/mem,filesystem/posix
 I0813 19:55:17.186058 26135 recover.cpp:566] Updating replica status to 
 STARTING
 I0813

[jira] [Commented] (MESOS-3273) EventCall Test Framework is flaky


[ 
https://issues.apache.org/jira/browse/MESOS-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700107#comment-14700107
 ] 

Vinod Kone commented on MESOS-3273:
---

While trying to repro this, observed another issue where the test hangs during 
termination.

{code}
I0817 19:34:33.166100 60834 master.cpp:3998] Received update of slave 
20150817-193432-1828659978-51071-60794-S2 at slave(1)@10.35.255.108:51071 
(smfd-atr-11-sr1.devel.twitter.com) with total oversubscribed resources 
I0817 19:34:33.166316 60834 hierarchical.hpp:600] Slave 
20150817-193432-1828659978-51071-60794-S2 (smfd-atr-11-sr1.devel.twitter.com) 
updated with oversubscribed resources  (total: cpus(*):2; mem(*):10240; 
disk(*):454767; ports(*):[31000-32000], allocated: cpus(*):2; mem(*):10240; 
disk(*):454767; ports(*):[31000-32000])

Received an UPDATE event
Task 4 is in state TASK_FINISHED
I0817 19:34:33.167793 60816 master.cpp:860] Master terminating
I0817 19:34:33.168092 60836 hierarchical.hpp:571] Removed slave 
20150817-193432-1828659978-51071-60794-S2
I0817 19:34:33.168654 60816 master.cpp:5673] Removing executor 'default' with 
resources  of framework 20150817-193432-1828659978-51071-60794- on slave 
20150817-193432-1828659978-51071-60794-S1 at slave(2)@10.35.255.108:51071 
(smfd-atr-11-sr1.devel.twitter.com)
I0817 19:34:33.168725 60819 hierarchical.hpp:571] Removed slave 
20150817-193432-1828659978-51071-60794-S1
I0817 19:34:33.169075 60816 master.cpp:5644] Removing task 4 with resources 
cpus(*):1; mem(*):128 of framework 20150817-193432-1828659978-51071-60794- 
on slave 20150817-193432-1828659978-51071-60794-S0 at 
slave(3)@10.35.255.108:51071 (smfd-atr-11-sr1.devel.twitter.com)
I0817 19:34:33.169153 60818 hierarchical.hpp:571] Removed slave 
20150817-193432-1828659978-51071-60794-S0
I0817 19:34:33.169255 60816 master.cpp:5673] Removing executor 'default' with 
resources  of framework 20150817-193432-1828659978-51071-60794- on slave 
20150817-193432-1828659978-51071-60794-S0 at slave(3)@10.35.255.108:51071 
(smfd-atr-11-sr1.devel.twitter.com)
I0817 19:34:33.170186 60818 hierarchical.hpp:428] Removed framework 
20150817-193432-1828659978-51071-60794-
I0817 19:34:33.170919 60827 slave.cpp:3143] master@10.35.255.108:51071 exited
I0817 19:34:33.170903 60817 slave.cpp:3143] master@10.35.255.108:51071 exited
W0817 19:34:33.170959 60827 slave.cpp:3146] Master disconnected! Waiting for a 
new master to be elected
W0817 19:34:33.170976 60817 slave.cpp:3146] Master disconnected! Waiting for a 
new master to be elected
I0817 19:34:33.171083 60821 slave.cpp:3143] master@10.35.255.108:51071 exited
W0817 19:34:33.171169 60821 slave.cpp:3146] Master disconnected! Waiting for a 
new master to be elected
I0817 19:34:33.172170 60817 slave.cpp:564] Slave terminating
I0817 19:34:33.174253 60794 slave.cpp:564] Slave terminating
I0817 19:34:33.174424 60794 slave.cpp:1959] Asked to shut down framework 
20150817-193432-1828659978-51071-60794- by @0.0.0.0:0
I0817 19:34:33.174473 60794 slave.cpp:1984] Shutting down framework 
20150817-193432-1828659978-51071-60794-
I0817 19:34:33.174665 60794 slave.cpp:3710] Shutting down executor 'default' of 
framework 20150817-193432-1828659978-51071-60794-
I0817 19:34:33.175500 60926 exec.cpp:380] Executor asked to shutdown
I0817 19:34:33.176652 60794 slave.cpp:564] Slave terminating
I0817 19:34:33.176762 60794 slave.cpp:1959] Asked to shut down framework 
20150817-193432-1828659978-51071-60794- by @0.0.0.0:0
I0817 19:34:33.176909 60794 slave.cpp:1984] Shutting down framework 
20150817-193432-1828659978-51071-60794-
I0817 19:34:33.176954 60794 slave.cpp:3710] Shutting down executor 'default' of 
framework 20150817-193432-1828659978-51071-60794-
I0817 19:34:33.177781 60879 exec.cpp:380] Executor asked to shutdown
I0817 19:34:33.178567 60822 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 13.870729ms
I0817 19:34:33.178649 60822 replica.cpp:679] Persisted action at 8
I0817 19:34:33.179919 60815 replica.cpp:658] Replica received learned notice 
for position 8
I0817 19:34:33.195266 60815 leveldb.cpp:343] Persisting action (18 bytes) to 
leveldb took 15.299248ms
I0817 19:34:33.195405 60815 leveldb.cpp:401] Deleting ~2 keys from leveldb took 
29964ns
I0817 19:34:33.195428 60815 replica.cpp:679] Persisted action at 8
I0817 19:34:33.195456 60815 replica.cpp:664] Replica learned TRUNCATE action at 
position 8
{code} 

gdb stack trace points to what looks like a deadlock
{code}
#0  0x7fd5df472019 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7fd5e332293c in 
std::condition_variable::wait(std::unique_lockstd::mutex) () from 
/home/vinod/mesos/build/src/.libs/libmesos-0.24.0.so
#2  0x7fd5e248f3cc in synchronized_waitstd::condition_variable, 
std::mutex () from /home/vinod/mesos/build/src/.libs/libmesos-0.24.0.so
#3  0x7fd5e3191d35 in arrive () from

[jira] [Updated] (MESOS-2466) Write documentation for all the LIBPROCESS_* environment variables.


 [ 
https://issues.apache.org/jira/browse/MESOS-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-2466:
-
Sprint: Mesosphere Sprint 16

 Write documentation for all the LIBPROCESS_* environment variables.
 ---

 Key: MESOS-2466
 URL: https://issues.apache.org/jira/browse/MESOS-2466
 Project: Mesos
  Issue Type: Documentation
Reporter: Alexander Rojas
Assignee: Greg Mann
  Labels: documentation, mesosphere

 libprocess uses a set of environment variables to modify its behaviour; 
 however, these variables are not documented anywhere, nor it is defined where 
 the documentation should be.
 What would be needed is a decision whether the environment variables should 
 be documented (a new doc file or reusing an existing one), and then add the 
 documentation there.
 After searching in the code, these are the variables which need to be 
 documented:
 # {{LIBPROCESS_ENABLE_PROFILER}}
 # {{LIBPROCESS_IP}}
 # {{LIBPROCESS_PORT}}
 # {{LIBPROCESS_STATISTICS_WINDOW}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2466) Write documentation for all the LIBPROCESS_* environment variables.


 [ 
https://issues.apache.org/jira/browse/MESOS-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-2466:
-
Sprint:   (was: Mesosphere Sprint 16)

 Write documentation for all the LIBPROCESS_* environment variables.
 ---

 Key: MESOS-2466
 URL: https://issues.apache.org/jira/browse/MESOS-2466
 Project: Mesos
  Issue Type: Documentation
Reporter: Alexander Rojas
Assignee: Greg Mann
  Labels: documentation, mesosphere

 libprocess uses a set of environment variables to modify its behaviour; 
 however, these variables are not documented anywhere, nor it is defined where 
 the documentation should be.
 What would be needed is a decision whether the environment variables should 
 be documented (a new doc file or reusing an existing one), and then add the 
 documentation there.
 After searching in the code, these are the variables which need to be 
 documented:
 # {{LIBPROCESS_ENABLE_PROFILER}}
 # {{LIBPROCESS_IP}}
 # {{LIBPROCESS_PORT}}
 # {{LIBPROCESS_STATISTICS_WINDOW}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-2466) Write documentation for all the LIBPROCESS_* environment variables.


 [ 
https://issues.apache.org/jira/browse/MESOS-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-2466:


Assignee: Greg Mann

 Write documentation for all the LIBPROCESS_* environment variables.
 ---

 Key: MESOS-2466
 URL: https://issues.apache.org/jira/browse/MESOS-2466
 Project: Mesos
  Issue Type: Documentation
Reporter: Alexander Rojas
Assignee: Greg Mann
  Labels: documentation, mesosphere

 libprocess uses a set of environment variables to modify its behaviour; 
 however, these variables are not documented anywhere, nor it is defined where 
 the documentation should be.
 What would be needed is a decision whether the environment variables should 
 be documented (a new doc file or reusing an existing one), and then add the 
 documentation there.
 After searching in the code, these are the variables which need to be 
 documented:
 # {{LIBPROCESS_ENABLE_PROFILER}}
 # {{LIBPROCESS_IP}}
 # {{LIBPROCESS_PORT}}
 # {{LIBPROCESS_STATISTICS_WINDOW}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-3158) Libprocess Process: Join runqueue workers during finalization


 [ 
https://issues.apache.org/jira/browse/MESOS-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-3158:


Assignee: Greg Mann

 Libprocess Process: Join runqueue workers during finalization
 -

 Key: MESOS-3158
 URL: https://issues.apache.org/jira/browse/MESOS-3158
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Joris Van Remoortere
Assignee: Greg Mann
  Labels: beginner, libprocess, mesosphere, newbie

 The lack of synchronization between ProcessManager destruction and the thread 
 pool threads running the queued processes means that the shared state that is 
 part of the ProcessManager gets destroyed prematurely.
 Synchronizing the ProcessManager destructor with draining the work queues and 
 stopping the workers will allow us to not require leaking the shared state to 
 avoid use beyond destruction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-3283) Improve batch allocations performance especially with large number of slaves and frameworks.

2015-08-17 Thread Mandeep Chadha (JIRA)

Mandeep Chadha created MESOS-3283:
-

 Summary: Improve batch allocations performance especially with 
large number of slaves and frameworks.
 Key: MESOS-3283
 URL: https://issues.apache.org/jira/browse/MESOS-3283
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Affects Versions: 0.23.0
Reporter: Mandeep Chadha


Improve batch allocations performance especially with large number of slaves 
and frameworks. 

e.g. these are the allocation timings for 10K slaves and varying number of 
frameworks.

Using 1 slaves and 1 frameworks
Added 1 slaves in 14.50836112secs
Updated 1 slaves in 18.665093703secs
[   OK ] 
SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/12 (34983 ms)
[ RUN  ] 
SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13
Using 1 slaves and 50 frameworks
Added 1 slaves in 51.534229549secs
Updated 1 slaves in 57.131554303secs
[   OK ] 
SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13 (110449 ms)
[ RUN  ] 
SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14
Using 1 slaves and 100 frameworks
Added 1 slaves in 1.5891310434mins
Updated 1 slaves in 1.80562078148333mins
[   OK ] 
SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14 (205467 ms)
[ RUN  ] 
SlaveCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/15
Using 1 slaves and 200 frameworks
Added 1 slaves in 3.0750647275mins
Updated 1 slaves in 3.85846762096667mins



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3050) Failing ROOT_ tests on CentOS 7.1


[ 
https://issues.apache.org/jira/browse/MESOS-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700261#comment-14700261
 ] 

Jie Yu commented on MESOS-3050:
---

Pushed a fix. [~marco-mesos], let me know if the tests are still failing.

commit 3ae937fb1c41bf858d7e37e5679da646fe93734b
Author: Jie Yu yujie@gmail.com
Date:   Mon Aug 17 12:53:08 2015 -0700

Included /usr/bin/sh in the test root filesystem.

Review: https://reviews.apache.org/r/37555

commit bd4332c68aea3aaf8eac3ef3a15b72541084e0c4
Author: Jie Yu yujie@gmail.com
Date:   Mon Aug 17 12:47:52 2015 -0700

Used execlp instead of execl to exec processes in Mesos.

Review: https://reviews.apache.org/r/37547

commit d7d3b52122613f536bcffe41a5f26132e99728af
Author: Jie Yu yujie@gmail.com
Date:   Mon Aug 17 12:47:41 2015 -0700

Used execlp instead of execl to exec processes in libprocess.

Review: https://reviews.apache.org/r/37546

commit e70493a8acd3c6848bb9dbe7f7a72e694fe6cf07
Author: Jie Yu yujie@gmail.com
Date:   Mon Aug 17 12:47:31 2015 -0700

Used execlp instead of execl to exec processes in stout.

Review: https://reviews.apache.org/r/37545

 Failing ROOT_ tests on CentOS 7.1
 -

 Key: MESOS-3050
 URL: https://issues.apache.org/jira/browse/MESOS-3050
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker, test
Affects Versions: 0.23.0
 Environment: CentOS Linux release 7.1.1503
 0.24.0
Reporter: Adam B
Assignee: Timothy Chen
Priority: Blocker
  Labels: mesosphere
 Attachments: ROOT_tests.log


 Running `sudo make check` on CentOS 7.1 for Mesos 0.23.0-rc3 causes several 
 several failures/errors:
 {code}
 [ RUN  ] DockerTest.ROOT_DOCKER_CheckPortResource
 ../../src/tests/docker_tests.cpp:303: Failure
 (run).failure(): Container exited on error: exited with status 1
 [  FAILED  ] DockerTest.ROOT_DOCKER_CheckPortResource (709 ms)
 {code}
 ...
 {code}
 [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 ../../src/tests/isolator_tests.cpp:837: Failure
 isolator: Failed to create PerfEvent isolator, invalid events: { cycles, 
 task-clock }
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (9 ms)
 [--] 1 test from PerfEventIsolatorTest (9 ms total)

 [--] 2 tests from SharedFilesystemIsolatorTest
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_RelativeVolume_4yTEAC/var/tmp /var/tmp
 + touch /var/tmp/492407e1-5dec-4b34-8f2f-130430f41aac
 ../../src/tests/isolator_tests.cpp:1001: Failure
 Value of: os::exists(file)
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_RelativeVolume (92 ms)
 [ RUN  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume
 + mount -n --bind 
 /tmp/SharedFilesystemIsolatorTest_ROOT_AbsoluteVolume_OwYrXK /var/tmp
 + touch /var/tmp/7de712aa-52eb-4976-b0f9-32b6a006418d
 ../../src/tests/isolator_tests.cpp:1086: Failure
 Value of: os::exists(path::join(containerPath, filename))
   Actual: true
 Expected: false
 [  FAILED  ] SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume (100 ms)
 {code}
 ...
 {code}
 [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
 mesos::internal::slave::CgroupsMemIsolatorProcess
 userdel: user 'mesos.test.unprivileged.user' does not exist
 [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
 -bash: /sys/fs/cgroup/blkio/user.slice/cgroup.procs: Permission denied
 mkdir: cannot create directory ‘/sys/fs/cgroup/blkio/user.slice/user’: 
 Permission denied
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/blkio/user.slice/user/cgroup.procs: No such file or 
 directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  + 
 path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
   Actual: 256
 Expected: 0
 -bash: 
 /sys/fs/cgroup/memory/mesos/bbf8c8f0-3d67-40df-a269-b3dc6a9597aa/cgroup.procs:
  Permission denied
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/cgroup.procs: No such file or 
 directory
 mkdir: cannot create directory ‘/sys/fs/cgroup/cpuacct,cpu/user.slice/user’: 
 No such file or directory
 ../../src/tests/isolator_tests.cpp:1274: Failure
 Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  + 
 path::join(flags.cgroups_hierarchy, userCgroup) + ')
   Actual: 256
 Expected: 0
 -bash: /sys/fs/cgroup/cpuacct,cpu/user.slice/user/cgroup.procs: No such file 
 or directory
 ../../src/tests/isolator_tests.cpp:1283: Failure
 Value of: os::system( su -  +

[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.

2015-08-17 Thread Klaus Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699096#comment-14699096
 ] 

Klaus Ma commented on MESOS-3070:
-

Regarding #4, it also check duplicated TaskTag; for killTask, all tasks with 
the same tag will be killed; and user can also use the generated ID to kill 
task.
Regarding #3, it's similar to #4; #4 use UUID as unique TaskID, #3 use slaveId 
+ taskId + frameworkId as unique TaskID

Personally, prefer to #4; and documentation on new behaviour is necessary.

 Master CHECK failure if a framework uses duplicated task id.
 

 Key: MESOS-3070
 URL: https://issues.apache.org/jira/browse/MESOS-3070
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.22.1
Reporter: Jie Yu
Assignee: Klaus Ma

 We observed this in one of our testing cluster.
 One framework (under development) keeps launching tasks using the same 
 task_id. We don't expect the master to crash even if the framework is not 
 doing what it's supposed to do. However, under a series of events, this could 
 happen and keeps crashing the master.
 1) frameworkA launches task 'task_id_1' on slaveA
 2) master fails over
 3) slaveA has not re-registered yet
 4) frameworkA re-registered and launches task 'task_id_1' on slaveB
 5) slaveA re-registering and add task task_id_1' to frameworkA
 6) CHECK failure in addTask
 {noformat}
 I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with 
 resources cpus(*):4; mem(*):32768 on slave 
 20150417-232509-1735470090-5050-48870-S25 (hostname)
 ...
 ...
 F0716 21:52:50.760136 28805 master.hpp:362] Check failed: 
 !tasks.contains(task-task_id()) Duplicate task 'task_id_1' of framework 
 framework_id
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-3245) The comments of DRFSorter::dirty is not correct

2015-08-17 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699084#comment-14699084
 ] 

Qian Zhang edited comment on MESOS-3245 at 8/17/15 6:50 AM:


RB link: https://reviews.apache.org/r/37289/


was (Author: qianzhang):
RR link: https://reviews.apache.org/r/37289/

 The comments of DRFSorter::dirty is not correct
 ---

 Key: MESOS-3245
 URL: https://issues.apache.org/jira/browse/MESOS-3245
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Qian Zhang
Assignee: Qian Zhang
Priority: Minor

 The comment is:
 {code}
   // If true, start() will recalculate all shares.
   bool dirty;
 {code}
 But there is actually no start() method in class DRFSorter, I think the 
 comment should be:
 {code}
   // If true, sort() will recalculate all shares.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3228) Some spelling error in slave help message