from:"Dominic Hamon \(JIRA\)"

[jira] [Commented] (MESOS-2484) libprocess Clock messages delivered

2015-03-12 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359108#comment-14359108
 ] 

Dominic Hamon commented on MESOS-2484:
--

see also https://issues.apache.org/jira/browse/MESOS-1456 regarding metrics.

there's been some discussion about enforcing unique PID ids for every process 
but I can't find the relevant JIRA ticket.

> libprocess Clock messages delivered 
> 
>
> Key: MESOS-2484
> URL: https://issues.apache.org/jira/browse/MESOS-2484
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Cody Maloney
>
> Found in / discussed at: https://reviews.apache.org/r/30587/#rc118737-72676
> When a process is terminated, any outstanding delay() destined for that 
> process aren't terminated, meaning they arrive whenever the clock happens to 
> get there. With uniquely named processes this isn't an issue, but with names 
> that are reused (master), it could potentially lead to odd test flakiness, 
> and is artifacts carrying across / between tests which shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2216) The "configure" phase breaks with the IBM JVM.

2015-03-12 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359101#comment-14359101
 ] 

Dominic Hamon commented on MESOS-2216:
--

see also https://issues.apache.org/jira/browse/HADOOP-9435

so a change in include path or explicit linking against libdl should help.

> The "configure" phase breaks with the IBM JVM.
> --
>
> Key: MESOS-2216
> URL: https://issues.apache.org/jira/browse/MESOS-2216
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0, 0.20.1
> Environment: Ubuntu / x86_64
>Reporter: Tony Reix
>Priority: Blocker
>
> ./configure does not work with IBM JVM, since it looks for a directory:
>/usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server   x86_64
>/usr/lib/jvm/ibm-java-ppc64le-71/jre/lib/ppc64le/serverPPC64 LE
> that does not exist for the IBM JVM.
> Though this directory does exist for Oracle JVM and Open JDK:
>/usr/lib/jvm/jdk1.7.0_71/jre/lib/amd64/server  Oracle JVM
>/usr/lib/jvm/java-1.7.0-openjdk-amd64/jre/lib/amd64/server OpenJDK
> However, the files:
>   libjsig.so
>   libjvm.so   (3 versions)
> do exist for IBM JVM.
> Anyway, creating the server directory and copying the files (tried with the 3 
> versions of libjvm.so) does not fix the issue:
> checking whether or not we can build with JNI... 
> /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
> reference to `dlopen'
> /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
> reference to `dlclose'
> /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
> reference to `dlerror'
> /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
> reference to `dlsym'
> /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
> reference to `dladdr'
> Something (dlopen, dlclose, dlerror, dlsym, dladdr) is missing in IBM JVM.
> So, either the configure step relies on a feature that is not in the Java 
> standard but only in the Oracle JVM and OpenJDK, or the IBM JVM lacks part of 
> the Java standard.
> I'm not an expert about this. So, I'd like Mesos people to experiment with 
> IBM JVM. Maybe there is another solution for this step of the Mesos configure 
> that would work with all 3 JVMs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1023) Replace all static/global variables with non-POD type

2015-03-09 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353464#comment-14353464
 ] 

Dominic Hamon commented on MESOS-1023:
--

commit f780f67717fe0aa25b6870baedd55c43a7017edb (HEAD, origin/master, 
origin/HEAD, master)
Author: Dominic Hamon 
Commit: Dominic Hamon 

Remove static strings from process and split out some source.

Review: https://reviews.apache.org/r/30841


> Replace all static/global variables with non-POD type
> -
>
> Key: MESOS-1023
> URL: https://issues.apache.org/jira/browse/MESOS-1023
> Project: Mesos
>  Issue Type: Bug
>  Components: general, technical debt
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>  Labels: c++
>
> See 
> http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Static_and_Global_Variables
>  for the background.
> Real bugs have been seen. For example, in process::ID::generate we have a 
> map that can be accessed within the function after exit has been 
> called. Ie, we can try to access the map after it's been destroyed, but 
> before exit has completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2294) Implement the Events endpoint on master

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2294:
-
Labels: twitter  (was: )

> Implement the Events endpoint on master
> ---
>
> Key: MESOS-2294
> URL: https://issues.apache.org/jira/browse/MESOS-2294
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>  Labels: twitter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2277) Document undocumented HTTP endpoints

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2277:
-
Labels: documentation newbie starter twitter  (was: documentation newbie 
starter)

> Document undocumented HTTP endpoints
> 
>
> Key: MESOS-2277
> URL: https://issues.apache.org/jira/browse/MESOS-2277
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>Priority: Minor
>  Labels: documentation, newbie, starter, twitter
>
> Did a quick scan and we are missing documentation for a few endpoints:
> {code}
> files/browse.json
> files/read.json
> files/download.json
> files/debug.json
> master/roles.json
> master/state.json
> master/stats.json
> slave/state.json
> slave/stats.json
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1127) Implement the protobufs for the scheduler API

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1127:


Assignee: Vinod Kone  (was: Benjamin Hindman)

> Implement the protobufs for the scheduler API
> -
>
> Key: MESOS-1127
> URL: https://issues.apache.org/jira/browse/MESOS-1127
> Project: Mesos
>  Issue Type: Task
>  Components: framework
>Reporter: Benjamin Hindman
>Assignee: Vinod Kone
>  Labels: twitter
>
> The default scheduler/executor interface and implementation in Mesos have a 
> few drawbacks:
> (1) The interface is fairly high-level which makes it hard to do certain 
> things, for example, handle events (callbacks) in batch. This can have a big 
> impact on the performance of schedulers (for example, writing task updates 
> that need to be persisted).
> (2) The implementation requires writing a lot of boilerplate JNI and native 
> Python wrappers when adding additional API components.
> The plan is to provide a lower-level API that can easily be used to implement 
> the higher-level API that is currently provided. This will also open the door 
> to more easily building native-language Mesos libraries (i.e., not needing 
> the C++ shim layer) and building new higher-level abstractions on top of the 
> lower-level API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2293) Implement the Call endpoint on master

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2293:
-
Labels: twitter  (was: )

> Implement the Call endpoint on master
> -
>
> Key: MESOS-2293
> URL: https://issues.apache.org/jira/browse/MESOS-2293
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>  Labels: twitter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1988) Scheduler driver should not generate TASK_LOST when disconnected from master

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1988:
-
Labels: twitter  (was: )

> Scheduler driver should not generate TASK_LOST when disconnected from master
> 
>
> Key: MESOS-1988
> URL: https://issues.apache.org/jira/browse/MESOS-1988
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>  Labels: twitter
>
> Currently, the driver replies to launchTasks() with TASK_LOST if it detects 
> that it is disconnected from the master. After MESOS-1972 lands, this will be 
> the only place where driver generates TASK_LOST. See MESOS-1972 for more 
> context.
> This fix is targeted for 0.22.0 to give frameworks time to implement 
> reconciliation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-94) Master and Slave HTTP handlers should have unit tests

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-94:
---
Labels: http json test twitter  (was: http json test)

> Master and Slave HTTP handlers should have unit tests
> -
>
> Key: MESOS-94
> URL: https://issues.apache.org/jira/browse/MESOS-94
> Project: Mesos
>  Issue Type: Improvement
>  Components: json api, master, slave, test
>Reporter: Charles Reiss
>  Labels: http, json, test, twitter
>
> The Master and Slave have HTTP handlers which serve their state (mainly for 
> the webui to use). There should be unit tests of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.

2015-03-09 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353182#comment-14353182
 ] 

Dominic Hamon commented on MESOS-2467:
--

instead of relying on the first character (which can also be '{' in valid json) 
perhaps we can instead:

- try JSON parsing, catch failure
- fallback to old parsing


This also means we can deprecate the old parsing behaviour more easily. 

> Allow --resources flag to take JSON.
> 
>
> Key: MESOS-2467
> URL: https://issues.apache.org/jira/browse/MESOS-2467
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>
> Currently, we used a customized format for --resources flag. As we introduce 
> more and more stuffs (e.g., persistence, reservation) in Resource object, we 
> need a more generic way to specify --resources.
> For backward compatibility, we can scan the first character. If it is '[', 
> then we invoke the JSON parser. Otherwise, we use the existing parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2457) Update post-reviews to rbtools in 'submit your patch' of developer's guide

2015-03-06 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350579#comment-14350579
 ] 

Dominic Hamon commented on MESOS-2457:
--

Install RBTools, yes. But we still want people to run support/post-reviews.py 
as it wraps rbt and avoids users having to set parent branches and manage diff 
chains.

> Update post-reviews to rbtools in 'submit your patch' of developer's guide 
> ---
>
> Key: MESOS-2457
> URL: https://issues.apache.org/jira/browse/MESOS-2457
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation, project website
>Reporter: Nancy Ko
>Priority: Minor
>  Labels: documentation, newbie
>
> In developer's guide 
> (http://mesos.apache.org/documentation/latest/mesos-developers-guide/) 
> post-reviews should be changed to review board tools. Specifically: 
> List item 3: 
> "First, install post-review. See Instructions"
> See Instructions link should also redirect to: 
> https://www.reviewboard.org/docs/rbtools/dev/ 
> instead of:
> https://www.reviewboard.org/docs/manual/dev/users/tools/post-review/
> AND
> List item 5:
> "From your local branch run support/post-reviews.py."
> The run command should be changed to: rbt post



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation

2015-03-03 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345863#comment-14345863
 ] 

Dominic Hamon commented on MESOS-2422:
--

https://reviews.apache.org/r/31502/
https://reviews.apache.org/r/31503/
https://reviews.apache.org/r/31504/
https://reviews.apache.org/r/31505/

> Use fq_codel qdisc for egress network traffic isolation
> ---
>
> Key: MESOS-2422
> URL: https://issues.apache.org/jira/browse/MESOS-2422
> Project: Mesos
>  Issue Type: Task
>Reporter: Cong Wang
>Assignee: Cong Wang
>  Labels: twitter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2422:
-
Labels: twitter  (was: )

> Use fq_codel qdisc for egress network traffic isolation
> ---
>
> Key: MESOS-2422
> URL: https://issues.apache.org/jira/browse/MESOS-2422
> Project: Mesos
>  Issue Type: Task
>Reporter: Cong Wang
>Assignee: Cong Wang
>  Labels: twitter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2058:
-
Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 4  (was: Twitter 
Mesos Q1 Sprint 1)

> Deprecate stats.json endpoints for Master and Slave
> ---
>
> Key: MESOS-2058
> URL: https://issues.apache.org/jira/browse/MESOS-2058
> Project: Mesos
>  Issue Type: Task
>  Components: master, slave
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>  Labels: twitter
> Fix For: 0.23.0
>
>
> With the introduction of the libprocess {{/metrics/snapshot}} endpoint, 
> metrics are now duplicated in the Master and Slave between this and 
> {{stats.json}}. We should deprecate the {{stats.json}} endpoints.
> Manual inspection of {{stats.json}} shows that all metrics are now covered by 
> the new endpoint for Master and Slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2422:
-
Sprint: Twitter Mesos Q1 Sprint 4

> Use fq_codel qdisc for egress network traffic isolation
> ---
>
> Key: MESOS-2422
> URL: https://issues.apache.org/jira/browse/MESOS-2422
> Project: Mesos
>  Issue Type: Task
>Reporter: Cong Wang
>Assignee: Cong Wang
>  Labels: twitter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos 
Q1 Sprint 4  (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3)

> Report per-container metrics for network bandwidth throttling
> -
>
> Key: MESOS-2332
> URL: https://issues.apache.org/jira/browse/MESOS-2332
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Paul Brett
>Assignee: Paul Brett
>  Labels: features, twitter
>
> Export metrics from the network isolation to identify scope and duration of 
> container throttling.  
> Packet loss can be identified from the overlimits and requeues fields of the 
> htb qdisc report for the virtual interface, e.g.
> {noformat}
> $ tc -s -d qdisc show dev mesos19223
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
> 1 1 1
>  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc ingress : parent :fff1 
>  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
> requeues 0)
>  backlog 0b 0p requeues 0
> {noformat}
> Note that since a packet can be examined multiple times before transmission, 
> overlimits can exceed total packets sent.  
> Add to the port_mapping isolator usage() and the container statistics 
> protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
> protobuf fields so it's clear what these represent and how they are different 
> to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2403) MasterAllocatorTest/0.FrameworkReregistersFirst is flaky

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2403:
-
Sprint: Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4  (was: Twitter 
Mesos Q1 Sprint 3)

> MasterAllocatorTest/0.FrameworkReregistersFirst is flaky
> 
>
> Key: MESOS-2403
> URL: https://issues.apache.org/jira/browse/MESOS-2403
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.23.0
> Environment: ASF CI (Ubuntu) 
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> {code}
> [ RUN  ] MasterAllocatorTest/0.FrameworkReregistersFirst
> Using temporary directory 
> '/tmp/MasterAllocatorTest_0_FrameworkReregistersFirst_Vy5Nml'
> I0224 23:22:31.681670 30589 leveldb.cpp:176] Opened db in 2.943518ms
> I0224 23:22:31.682152 30619 process.cpp:2117] Dropped / Lost event for PID: 
> slave(65)@67.195.81.187:38391
> I0224 23:22:31.682732 30589 leveldb.cpp:183] Compacted db in 1.029469ms
> I0224 23:22:31.682777 30589 leveldb.cpp:198] Created db iterator in 15460ns
> I0224 23:22:31.682792 30589 leveldb.cpp:204] Seeked to beginning of db in 
> 1832ns
> I0224 23:22:31.682802 30589 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 319ns
> I0224 23:22:31.682833 30589 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0224 23:22:31.683228 30605 recover.cpp:449] Starting replica recovery
> I0224 23:22:31.683537 30605 recover.cpp:475] Replica is in 4 status
> I0224 23:22:31.684624 30615 replica.cpp:641] Replica in 4 status received a 
> broadcasted recover request
> I0224 23:22:31.684978 30616 recover.cpp:195] Received a recover response from 
> a replica in 4 status
> I0224 23:22:31.685405 30610 recover.cpp:566] Updating replica status to 3
> I0224 23:22:31.686249 30609 master.cpp:349] Master 
> 20150224-232231-3142697795-38391-30589 (pomona.apache.org) started on 
> 67.195.81.187:38391
> I0224 23:22:31.686265 30617 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 717897ns
> I0224 23:22:31.686319 30617 replica.cpp:323] Persisted replica status to 3
> I0224 23:22:31.686336 30609 master.cpp:395] Master only allowing 
> authenticated frameworks to register
> I0224 23:22:31.686357 30609 master.cpp:400] Master only allowing 
> authenticated slaves to register
> I0224 23:22:31.686390 30609 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/tmp/MasterAllocatorTest_0_FrameworkReregistersFirst_Vy5Nml/credentials'
> I0224 23:22:31.686511 30606 recover.cpp:475] Replica is in 3 status
> I0224 23:22:31.686563 30609 master.cpp:442] Authorization enabled
> I0224 23:22:31.686929 30607 whitelist_watcher.cpp:79] No whitelist given
> I0224 23:22:31.686954 30603 hierarchical.hpp:287] Initialized hierarchical 
> allocator process
> I0224 23:22:31.687134 30605 replica.cpp:641] Replica in 3 status received a 
> broadcasted recover request
> I0224 23:22:31.687731 30609 master.cpp:1356] The newly elected leader is 
> master@67.195.81.187:38391 with id 20150224-232231-3142697795-38391-30589
> I0224 23:22:31.839818 30609 master.cpp:1369] Elected as the leading master!
> I0224 23:22:31.839834 30609 master.cpp:1187] Recovering from registrar
> I0224 23:22:31.839926 30605 registrar.cpp:313] Recovering registrar
> I0224 23:22:31.84 30613 recover.cpp:195] Received a recover response from 
> a replica in 3 status
> I0224 23:22:31.840504 30606 recover.cpp:566] Updating replica status to 1
> I0224 23:22:31.841599 30611 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 990330ns
> I0224 23:22:31.841627 30611 replica.cpp:323] Persisted replica status to 1
> I0224 23:22:31.841743 30611 recover.cpp:580] Successfully joined the Paxos 
> group
> I0224 23:22:31.841904 30611 recover.cpp:464] Recover process terminated
> I0224 23:22:31.842366 30608 log.cpp:660] Attempting to start the writer
> I0224 23:22:31.843557 30607 replica.cpp:477] Replica received implicit 
> promise request with proposal 1
> I0224 23:22:31.844312 30607 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 722368ns
> I0224 23:22:31.844337 30607 replica.cpp:345] Persisted promised to 1
> I0224 23:22:31.844889 30615 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0224 23:22:31.846043 30614 replica.cpp:378] Replica received explicit 
> promise request for position 0 with proposal 2
> I0224 23:22:31.846729 30614 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 660024ns
> I0224 23:22:31.846746 30614 replica.cpp:679] Persisted action at 0
> I0224 23:22:31.847671 30611 replica.cpp:511] Replica received write request 
> for position 0
> I0224 23:22:31.847723 30611 leveldb.cpp:438] Reading position from leveldb 
> took 27349ns
> I0224 23:22:31.848429 30611 leveldb.cpp:34

[jira] [Updated] (MESOS-2289) Design doc for the HTTP API

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2289:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos 
Q1 Sprint 4  (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3)

> Design doc for the HTTP API
> ---
>
> Key: MESOS-2289
> URL: https://issues.apache.org/jira/browse/MESOS-2289
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2350) Add support for MesosContainerizerLaunch to chroot to a specified path

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2350:
-
Sprint: Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4  (was: Twitter 
Mesos Q1 Sprint 3)

> Add support for MesosContainerizerLaunch to chroot to a specified path
> --
>
> Key: MESOS-2350
> URL: https://issues.apache.org/jira/browse/MESOS-2350
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.22.0, 0.21.1
>Reporter: Ian Downes
>Assignee: Ian Downes
>  Labels: twitter
>
> In preparation for the MesosContainerizer to support a filesystem isolator 
> the MesosContainerizerLauncher must support chrooting. Optionally, it should 
> also configure the chroot environment by (re-)mounting special filesystems 
> such as /proc and /sys and making device nodes such as /dev/zero, etc., such 
> that the chroot environment is functional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2136) Expose per-cgroup memory pressure

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2136:
-
Sprint: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos 
Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter 
Mesos Q1 Sprint 4  (was: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, 
Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3)

> Expose per-cgroup memory pressure
> -
>
> Key: MESOS-2136
> URL: https://issues.apache.org/jira/browse/MESOS-2136
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Ian Downes
>Assignee: Chi Zhang
>  Labels: twitter
>
> The cgroup memory controller can provide information on the memory pressure 
> of a cgroup. This is in the form of an event based notification where events 
> of (low, medium, critical) are generated when the kernel makes specific 
> actions to allocate memory. This signal is probably more informative than 
> comparing memory usage to memory limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2103) Expose number of processes and threads in a container

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2103:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos 
Q1 Sprint 4  (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3)

> Expose number of processes and threads in a container
> -
>
> Key: MESOS-2103
> URL: https://issues.apache.org/jira/browse/MESOS-2103
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.20.0
>Reporter: Ian Downes
>Assignee: Chi Zhang
>  Labels: twitter
>
> The CFS cpu statistics (cpus_nr_throttled, cpus_nr_periods, 
> cpus_throttled_time) are difficult to interpret.
> 1) nr_throttled is the number of intervals where *any* throttling occurred
> 2) throttled_time is the aggregate time *across all runnable tasks* (tasks in 
> the Linux sense).
> For example, in a typical 60 second sampling interval: nr_periods = 600, 
> nr_throttled could be 60, i.e., 10% of intervals, but throttled_time could be 
> much higher than (60/600) * 60 = 6 seconds if there is more than one task 
> that is runnable but throttled. *Each* throttled task contributes to the 
> total throttled time.
> Small test to demonstrate throttled_time > nr_periods * quota_interval:
> 5 x {{'openssl speed'}} running with quota=100ms:
> {noformat}
> cat cpu.stat && sleep 1 && cat cpu.stat
> nr_periods 3228
> nr_throttled 1276
> throttled_time 528843772540
> nr_periods 3238
> nr_throttled 1286
> throttled_time 531668964667
> {noformat}
> All 10 intervals throttled (100%) for total time of 2.8 seconds in 1 second 
> ("more than 100%" of the time interval)
> It would be helpful to expose the number of processes and tasks in the 
> container cgroup. This would be at a very coarse granularity but would give 
> some guidance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2418) Remove raw pointers from stout/os.hpp

2015-02-27 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340367#comment-14340367
 ] 

Dominic Hamon commented on MESOS-2418:
--

no more boost please.

there's also {{std::array}} if it's available on our compiler/platform 
suite as it is more similar to the existing fixed-size buffer usage.

{{std::vector}} has the benefit of working in C++03 compatible compilers.

given we can't reach consensus on any use of {{std::unique_ptr}} i doubt it's a 
good fit here.

> Remove raw pointers from stout/os.hpp
> -
>
> Key: MESOS-2418
> URL: https://issues.apache.org/jira/browse/MESOS-2418
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Joerg Schad
>Priority: Minor
>
> In MESOS-2412 a memory leak was found because of a missing {{delete}}. 
> Forgetting to free memory is a common error while manually managing memory. 
> In order to prevent this issue from happening again, another strategy should 
> be used to handle buffers.
> Among the options there are {{std::vector}}, 
> {{std::unique_ptr}}, or {{boost::scoped_array}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2412:
-
Sprint:   (was: Twitter Mesos Q1 Sprint 3)

> Potential memleak(s) in stout/os.hpp
> 
>
> Key: MESOS-2412
> URL: https://issues.apache.org/jira/browse/MESOS-2412
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: coverity, twitter
>
> Coverity picked up this potential memleak in os.hpp where we do not delete 
> buffer in the else case. The exact same pattern occurs in getuid(const 
> Option& user = None()).
> The corresponding CID 1230371 and 1230371.
> {code}
> inline Result getgid(const Option& user = None())
> ...
>   while (true) {
> char* buffer = new char[size];
> if (getpwnam_r(user.get().c_str(), &passwd, buffer, size, &result) == 0) {
>   ... 
>   delete[] buffer;
>   return gid;
> } else {
>   // RHEL7 (and possibly other systems) will return non-zero and
>   // set one of the following errors for "The given name or uid
>   // was not found." See 'man getpwnam_r'. We only check for the
>   // errors explicitly listed, and do not consider the ellipsis.
>   if (errno == ENOENT ||
>   errno == ESRCH ||
>   errno == EBADF ||
>   errno == EPERM) {
> return None();
>// HERE WE DO NOT DELETE BUFFER.
>   }
>  ...
>  // getpwnam_r set ERANGE so try again with a larger buffer.
>   size *= 2;
>   delete[] buffer;
>}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2412:
-
Assignee: Joerg Schad  (was: Dominic Hamon)

> Potential memleak(s) in stout/os.hpp
> 
>
> Key: MESOS-2412
> URL: https://issues.apache.org/jira/browse/MESOS-2412
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: coverity, twitter
>
> Coverity picked up this potential memleak in os.hpp where we do not delete 
> buffer in the else case. The exact same pattern occurs in getuid(const 
> Option& user = None()).
> The corresponding CID 1230371 and 1230371.
> {code}
> inline Result getgid(const Option& user = None())
> ...
>   while (true) {
> char* buffer = new char[size];
> if (getpwnam_r(user.get().c_str(), &passwd, buffer, size, &result) == 0) {
>   ... 
>   delete[] buffer;
>   return gid;
> } else {
>   // RHEL7 (and possibly other systems) will return non-zero and
>   // set one of the following errors for "The given name or uid
>   // was not found." See 'man getpwnam_r'. We only check for the
>   // errors explicitly listed, and do not consider the ellipsis.
>   if (errno == ENOENT ||
>   errno == ESRCH ||
>   errno == EBADF ||
>   errno == EPERM) {
> return None();
>// HERE WE DO NOT DELETE BUFFER.
>   }
>  ...
>  // getpwnam_r set ERANGE so try again with a larger buffer.
>   size *= 2;
>   delete[] buffer;
>}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339350#comment-14339350
 ] 

Dominic Hamon commented on MESOS-2412:
--

discarded my review. more discussion is apparently needed.

> Potential memleak(s) in stout/os.hpp
> 
>
> Key: MESOS-2412
> URL: https://issues.apache.org/jira/browse/MESOS-2412
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: coverity, twitter
>
> Coverity picked up this potential memleak in os.hpp where we do not delete 
> buffer in the else case. The exact same pattern occurs in getuid(const 
> Option& user = None()).
> The corresponding CID 1230371 and 1230371.
> {code}
> inline Result getgid(const Option& user = None())
> ...
>   while (true) {
> char* buffer = new char[size];
> if (getpwnam_r(user.get().c_str(), &passwd, buffer, size, &result) == 0) {
>   ... 
>   delete[] buffer;
>   return gid;
> } else {
>   // RHEL7 (and possibly other systems) will return non-zero and
>   // set one of the following errors for "The given name or uid
>   // was not found." See 'man getpwnam_r'. We only check for the
>   // errors explicitly listed, and do not consider the ellipsis.
>   if (errno == ENOENT ||
>   errno == ESRCH ||
>   errno == EBADF ||
>   errno == EPERM) {
> return None();
>// HERE WE DO NOT DELETE BUFFER.
>   }
>  ...
>  // getpwnam_r set ERANGE so try again with a larger buffer.
>   size *= 2;
>   delete[] buffer;
>}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338796#comment-14338796
 ] 

Dominic Hamon commented on MESOS-2412:
--

https://reviews.apache.org/r/31489/

> Potential memleak(s) in stout/os.hpp
> 
>
> Key: MESOS-2412
> URL: https://issues.apache.org/jira/browse/MESOS-2412
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Joerg Schad
>Assignee: Dominic Hamon
>  Labels: coverity, twitter
>
> Coverity picked up this potential memleak in os.hpp where we do not delete 
> buffer in the else case. The exact same pattern occurs in getuid(const 
> Option& user = None()).
> The corresponding CID 1230371 and 1230371.
> {code}
> inline Result getgid(const Option& user = None())
> ...
>   while (true) {
> char* buffer = new char[size];
> if (getpwnam_r(user.get().c_str(), &passwd, buffer, size, &result) == 0) {
>   ... 
>   delete[] buffer;
>   return gid;
> } else {
>   // RHEL7 (and possibly other systems) will return non-zero and
>   // set one of the following errors for "The given name or uid
>   // was not found." See 'man getpwnam_r'. We only check for the
>   // errors explicitly listed, and do not consider the ellipsis.
>   if (errno == ENOENT ||
>   errno == ESRCH ||
>   errno == EBADF ||
>   errno == EPERM) {
> return None();
>// HERE WE DO NOT DELETE BUFFER.
>   }
>  ...
>  // getpwnam_r set ERANGE so try again with a larger buffer.
>   size *= 2;
>   delete[] buffer;
>}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2412:
-
Labels: coverity twitter  (was: coverity)

> Potential memleak(s) in stout/os.hpp
> 
>
> Key: MESOS-2412
> URL: https://issues.apache.org/jira/browse/MESOS-2412
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Joerg Schad
>Assignee: Dominic Hamon
>  Labels: coverity, twitter
>
> Coverity picked up this potential memleak in os.hpp where we do not delete 
> buffer in the else case. The exact same pattern occurs in getuid(const 
> Option& user = None()).
> The corresponding CID 1230371 and 1230371.
> {code}
> inline Result getgid(const Option& user = None())
> ...
>   while (true) {
> char* buffer = new char[size];
> if (getpwnam_r(user.get().c_str(), &passwd, buffer, size, &result) == 0) {
>   ... 
>   delete[] buffer;
>   return gid;
> } else {
>   // RHEL7 (and possibly other systems) will return non-zero and
>   // set one of the following errors for "The given name or uid
>   // was not found." See 'man getpwnam_r'. We only check for the
>   // errors explicitly listed, and do not consider the ellipsis.
>   if (errno == ENOENT ||
>   errno == ESRCH ||
>   errno == EBADF ||
>   errno == EPERM) {
> return None();
>// HERE WE DO NOT DELETE BUFFER.
>   }
>  ...
>  // getpwnam_r set ERANGE so try again with a larger buffer.
>   size *= 2;
>   delete[] buffer;
>}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-2412:


Assignee: Dominic Hamon

> Potential memleak(s) in stout/os.hpp
> 
>
> Key: MESOS-2412
> URL: https://issues.apache.org/jira/browse/MESOS-2412
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Joerg Schad
>Assignee: Dominic Hamon
>  Labels: coverity, twitter
>
> Coverity picked up this potential memleak in os.hpp where we do not delete 
> buffer in the else case. The exact same pattern occurs in getuid(const 
> Option& user = None()).
> The corresponding CID 1230371 and 1230371.
> {code}
> inline Result getgid(const Option& user = None())
> ...
>   while (true) {
> char* buffer = new char[size];
> if (getpwnam_r(user.get().c_str(), &passwd, buffer, size, &result) == 0) {
>   ... 
>   delete[] buffer;
>   return gid;
> } else {
>   // RHEL7 (and possibly other systems) will return non-zero and
>   // set one of the following errors for "The given name or uid
>   // was not found." See 'man getpwnam_r'. We only check for the
>   // errors explicitly listed, and do not consider the ellipsis.
>   if (errno == ENOENT ||
>   errno == ESRCH ||
>   errno == EBADF ||
>   errno == EPERM) {
> return None();
>// HERE WE DO NOT DELETE BUFFER.
>   }
>  ...
>  // getpwnam_r set ERANGE so try again with a larger buffer.
>   size *= 2;
>   delete[] buffer;
>}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2412:
-
Sprint: Twitter Mesos Q1 Sprint 3

> Potential memleak(s) in stout/os.hpp
> 
>
> Key: MESOS-2412
> URL: https://issues.apache.org/jira/browse/MESOS-2412
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Joerg Schad
>Assignee: Dominic Hamon
>  Labels: coverity, twitter
>
> Coverity picked up this potential memleak in os.hpp where we do not delete 
> buffer in the else case. The exact same pattern occurs in getuid(const 
> Option& user = None()).
> The corresponding CID 1230371 and 1230371.
> {code}
> inline Result getgid(const Option& user = None())
> ...
>   while (true) {
> char* buffer = new char[size];
> if (getpwnam_r(user.get().c_str(), &passwd, buffer, size, &result) == 0) {
>   ... 
>   delete[] buffer;
>   return gid;
> } else {
>   // RHEL7 (and possibly other systems) will return non-zero and
>   // set one of the following errors for "The given name or uid
>   // was not found." See 'man getpwnam_r'. We only check for the
>   // errors explicitly listed, and do not consider the ellipsis.
>   if (errno == ENOENT ||
>   errno == ESRCH ||
>   errno == EBADF ||
>   errno == EPERM) {
> return None();
>// HERE WE DO NOT DELETE BUFFER.
>   }
>  ...
>  // getpwnam_r set ERANGE so try again with a larger buffer.
>   size *= 2;
>   delete[] buffer;
>}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2058:
-
Fix Version/s: 0.23.0

> Deprecate stats.json endpoints for Master and Slave
> ---
>
> Key: MESOS-2058
> URL: https://issues.apache.org/jira/browse/MESOS-2058
> Project: Mesos
>  Issue Type: Task
>  Components: master, slave
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>  Labels: twitter
> Fix For: 0.23.0
>
>
> With the introduction of the libprocess {{/metrics/snapshot}} endpoint, 
> metrics are now duplicated in the Master and Slave between this and 
> {{stats.json}}. We should deprecate the {{stats.json}} endpoints.
> Manual inspection of {{stats.json}} shows that all metrics are now covered by 
> the new endpoint for Master and Slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Sprint: Twitter Mesos Q1 Sprint 2  (was: Twitter Mesos Q1 Sprint 2, Twitter 
Mesos Q1 Sprint 3)

> Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
> ---
>
> Key: MESOS-2144
> URL: https://issues.apache.org/jira/browse/MESOS-2144
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Cody Maloney
>Assignee: Yan Xu
>Priority: Minor
>  Labels: flaky, twitter
>
> Occured on review bot review of: 
> https://reviews.apache.org/r/28262/#review62333
> The review doesn't touch code related to the test (And doesn't break 
> libprocess in general)
> [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
> ../../src/tests/script.cpp:83: Failure
> Failed
> low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
> The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2366:
-
Story Points: 1

> MasterSlaveReconciliationTest.ReconcileLostTask is flaky
> 
>
> Key: MESOS-2366
> URL: https://issues.apache.org/jira/browse/MESOS-2366
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Dominic Hamon
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
> {code}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
> I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
> I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
> I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
> I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
> 2370ns
> I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 348ns
> I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
> I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
> I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
> broadcasted recover request
> I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
> a replica in 4 status
> I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
> I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 733223ns
> I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
> I0218 01:53:26.886714 13943 master.cpp:347] Master 
> 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
> 67.195.81.187:57268
> I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
> authenticated frameworks to register
> I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
> authenticated slaves to register
> I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
> I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
> I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
> I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
> broadcasted recover request
> I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
> I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
> allocator process
> I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
> a replica in 3 status
> I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
> master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
> I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(183)@67.195.81.187:57268
> I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
> I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
> I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
> scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
> I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
> I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(179)@67.195.81.187:57268
> I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
> allocate!
> I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
> slaves in 60316ns
> I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
> I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 921045ns
> I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
> I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
> group
> I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
> I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
> I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
> promise request with proposal 1
> I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 534122ns
> I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1
> I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to 
> fill missing position
> I0218 01:53:27.919147 13941 replica.cpp:377] Re

[jira] [Comment Edited] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333701#comment-14333701
 ] 

Dominic Hamon edited comment on MESOS-2366 at 2/23/15 8:10 PM:
---

looks like waiting for the status update acknowledgement message should be 
enough.

The master updates the metrics in {{updateTask}}, called from {{statusUpdate}}. 
It's possible that the StatusUpdate message has been sent (which we check for) 
but not acted on by the Master yet, hence the metrics have not been updated. 
Waiting for the explicit acknowledgement is a proxy signal that the Master has 
updated the metrics.


was (Author: dhamon):
looks like waiting for the status update acknowledgement message should be 
enough.

> MasterSlaveReconciliationTest.ReconcileLostTask is flaky
> 
>
> Key: MESOS-2366
> URL: https://issues.apache.org/jira/browse/MESOS-2366
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Dominic Hamon
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
> {code}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
> I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
> I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
> I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
> I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
> 2370ns
> I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 348ns
> I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
> I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
> I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
> broadcasted recover request
> I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
> a replica in 4 status
> I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
> I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 733223ns
> I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
> I0218 01:53:26.886714 13943 master.cpp:347] Master 
> 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
> 67.195.81.187:57268
> I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
> authenticated frameworks to register
> I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
> authenticated slaves to register
> I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
> I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
> I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
> I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
> broadcasted recover request
> I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
> I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
> allocator process
> I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
> a replica in 3 status
> I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
> master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
> I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(183)@67.195.81.187:57268
> I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
> I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
> I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
> scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
> I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
> I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(179)@67.195.81.187:57268
> I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
> allocate!
> I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
> slaves in 60316ns
> I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
> I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 921045ns
> I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
> I0218 01:53:27.914106 13932 recover.cpp:579] S

[jira] [Updated] (MESOS-886) Slave should wait until resources are isolated before launching tasks

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-886:

Labels: twitter  (was: )

> Slave should wait until resources are isolated before launching tasks
> -
>
> Key: MESOS-886
> URL: https://issues.apache.org/jira/browse/MESOS-886
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, slave
>Affects Versions: 0.14.0
>Reporter: Ian Downes
>Assignee: Yifan Gu
>Priority: Minor
>  Labels: twitter
>
> The slave dispatches to the isolator to update resources and then sends 
> RunTaskMessage to the executor without waiting for the update to complete. 
> This race could, for example, lead to the task using too much RAM (including 
> file cache) and then being OOM killed whenever the resource update completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2386) Provide full filesystem isolation as a native mesos isolator

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2386:
-
Labels: twitter  (was: )

> Provide full filesystem isolation as a native mesos isolator
> 
>
> Key: MESOS-2386
> URL: https://issues.apache.org/jira/browse/MESOS-2386
> Project: Mesos
>  Issue Type: Epic
>Reporter: Dominic Hamon
>  Labels: twitter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-886) Slave should wait until resources are isolated before launching tasks

2015-02-23 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333741#comment-14333741
 ] 

Dominic Hamon commented on MESOS-886:
-

Is this still relevant?

> Slave should wait until resources are isolated before launching tasks
> -
>
> Key: MESOS-886
> URL: https://issues.apache.org/jira/browse/MESOS-886
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, slave
>Affects Versions: 0.14.0
>Reporter: Ian Downes
>Assignee: Yifan Gu
>Priority: Minor
>  Labels: twitter
>
> The slave dispatches to the isolator to update resources and then sends 
> RunTaskMessage to the executor without waiting for the update to complete. 
> This race could, for example, lead to the task using too much RAM (including 
> file cache) and then being OOM killed whenever the resource update completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1397) Rename ResourceStatistics for containers

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1397:
-
Labels: twitter  (was: )

> Rename ResourceStatistics for containers
> 
>
> Key: MESOS-1397
> URL: https://issues.apache.org/jira/browse/MESOS-1397
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Ian Downes
>  Labels: twitter
>
> Rename ContainerStatistics which includes optional ResourceStatistics and 
> optional PerfStatistics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1282) Support unprivileged access to cgroups

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1282:
-
Labels: twitter  (was: )

> Support unprivileged access to cgroups
> --
>
> Key: MESOS-1282
> URL: https://issues.apache.org/jira/browse/MESOS-1282
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.19.0
>Reporter: Ian Downes
>Priority: Minor
>  Labels: twitter
> Attachments: MESOS-1282.patch
>
>
> Supporting this would allow running tests with cgroup isolators on CI 
> machines where sudo access is unavailable.
> This could be achieved by having the subsystems mounted and the mesos (or 
> mesos_test) cgroup created and owned by the unprivileged user.
> {noformat}
> [vagrant@mesos cpu]$ cat /proc/mounts | grep cgroup
> tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
> cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset,clone_children 0 0
> cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu,clone_children 0 0
> cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct,clone_children 0 0
> cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory,clone_children 0 0
> cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices,clone_children 0 0
> cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer,clone_children 0 0
> cgroup /sys/fs/cgroup/net_cls cgroup rw,relatime,net_cls,clone_children 0 0
> cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio,clone_children 0 0
> [vagrant@mesos cpu]$ pwd
> /sys/fs/cgroup/cpu
> [vagrant@mesos cpu]$ ls -la
> total 0
> drwxr-xr-x  2 root root   0 May  1 22:11 .
> drwxrwxrwt 10 root root 200 Apr 30 23:09 ..
> -rw-r--r--  1 root root   0 Apr 30 23:14 cgroup.clone_children
> --w--w--w-  1 root root   0 Apr 30 23:09 cgroup.event_control
> -rw-r--r--  1 root root   0 Apr 30 23:09 cgroup.procs
> -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.cfs_period_us
> -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.cfs_quota_us
> -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.rt_period_us
> -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.rt_runtime_us
> -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.shares
> -r--r--r--  1 root root   0 Apr 30 23:09 cpu.stat
> -rw-r--r--  1 root root   0 Apr 30 23:09 notify_on_release
> -rw-r--r--  1 root root   0 Apr 30 23:09 release_agent
> -rw-r--r--  1 root root   0 Apr 30 23:09 tasks
> {noformat}
> User is unprivileged:
> {noformat}
> [vagrant@mesos cpu]$ id
> uid=500(vagrant) gid=500(vagrant) groups=500(vagrant),10(wheel)
> [vagrant@mesos cpu]$ mkdir mesos
> mkdir: cannot create directory `mesos': Permission denied
> {noformat}
> Create a cgroup and chown to the unprivileged user.
> {noformat}
> [vagrant@mesos cpu]$ sudo mkdir mesos && sudo chown -R vagrant:vagrant mesos
> [vagrant@mesos cpu]$ ls -la
> total 0
> drwxr-xr-x  3 rootroot  0 May  1 22:11 .
> drwxrwxrwt 10 rootroot200 Apr 30 23:09 ..
> -rw-r--r--  1 rootroot  0 Apr 30 23:14 cgroup.clone_children
> --w--w--w-  1 rootroot  0 Apr 30 23:09 cgroup.event_control
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 cgroup.procs
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.cfs_period_us
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.cfs_quota_us
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.rt_period_us
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.rt_runtime_us
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.shares
> -r--r--r--  1 rootroot  0 Apr 30 23:09 cpu.stat
> drwxr-xr-x  2 vagrant vagrant   0 May  1 22:12 mesos
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 notify_on_release
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 release_agent
> -rw-r--r--  1 rootroot  0 Apr 30 23:09 tasks
> {noformat}
> The unprivileged user can now create nested cgroups and move processes 
> into/out of cgroups it owns.
> {noformat}
> [vagrant@mesos cpu]$ echo $$
> 2877
> [vagrant@mesos cpu]$ echo $$ > mesos/cgroup.procs
> [vagrant@mesos cpu]$ cat mesos/cgroup.procs
> 2877
> 2957
> [vagrant@mesos cpu]$ mkdir mesos/test
> [vagrant@mesos cpu]$ echo $$ > mesos/test/cgroup.procs
> [vagrant@mesos cpu]$ cat mesos/test/cgroup.procs
> 2877
> 2960
> [vagrant@mesos cpu]$ echo $$ > mesos/cgroup.procs
> [vagrant@mesos cpu]$ cat mesos/cgroup.procs
> 2877
> 2977
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2386) Provide full filesystem isolation as a native mesos isolator

2015-02-23 Thread Dominic Hamon (JIRA)

Dominic Hamon created MESOS-2386:


 Summary: Provide full filesystem isolation as a native mesos 
isolator
 Key: MESOS-2386
 URL: https://issues.apache.org/jira/browse/MESOS-2386
 Project: Mesos
  Issue Type: Epic
Reporter: Dominic Hamon






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333727#comment-14333727
 ] 

Dominic Hamon commented on MESOS-2366:
--

https://reviews.apache.org/r/31315/

> MasterSlaveReconciliationTest.ReconcileLostTask is flaky
> 
>
> Key: MESOS-2366
> URL: https://issues.apache.org/jira/browse/MESOS-2366
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Dominic Hamon
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
> {code}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
> I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
> I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
> I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
> I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
> 2370ns
> I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 348ns
> I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
> I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
> I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
> broadcasted recover request
> I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
> a replica in 4 status
> I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
> I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 733223ns
> I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
> I0218 01:53:26.886714 13943 master.cpp:347] Master 
> 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
> 67.195.81.187:57268
> I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
> authenticated frameworks to register
> I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
> authenticated slaves to register
> I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
> I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
> I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
> I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
> broadcasted recover request
> I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
> I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
> allocator process
> I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
> a replica in 3 status
> I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
> master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
> I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(183)@67.195.81.187:57268
> I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
> I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
> I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
> scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
> I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
> I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(179)@67.195.81.187:57268
> I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
> allocate!
> I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
> slaves in 60316ns
> I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
> I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 921045ns
> I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
> I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
> group
> I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
> I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
> I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
> promise request with proposal 1
> I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 534122ns
> I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1
> I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to

[jira] [Commented] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333701#comment-14333701
 ] 

Dominic Hamon commented on MESOS-2366:
--

looks like waiting for the status update acknowledgement message should be 
enough.

> MasterSlaveReconciliationTest.ReconcileLostTask is flaky
> 
>
> Key: MESOS-2366
> URL: https://issues.apache.org/jira/browse/MESOS-2366
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Dominic Hamon
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
> {code}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
> I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
> I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
> I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
> I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
> 2370ns
> I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 348ns
> I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
> I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
> I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
> broadcasted recover request
> I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
> a replica in 4 status
> I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
> I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 733223ns
> I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
> I0218 01:53:26.886714 13943 master.cpp:347] Master 
> 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
> 67.195.81.187:57268
> I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
> authenticated frameworks to register
> I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
> authenticated slaves to register
> I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
> I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
> I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
> I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
> broadcasted recover request
> I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
> I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
> allocator process
> I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
> a replica in 3 status
> I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
> master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
> I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(183)@67.195.81.187:57268
> I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
> I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
> I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
> scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
> I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
> I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(179)@67.195.81.187:57268
> I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
> allocate!
> I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
> slaves in 60316ns
> I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
> I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 921045ns
> I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
> I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
> group
> I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
> I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
> I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
> promise request with proposal 1
> I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 534122ns
> I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1
> I0218 01:53:27.917795 1393

[jira] [Updated] (MESOS-2350) Add support for MesosContainerizerLaunch to chroot to a specified path

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2350:
-
Labels: twitter  (was: )

> Add support for MesosContainerizerLaunch to chroot to a specified path
> --
>
> Key: MESOS-2350
> URL: https://issues.apache.org/jira/browse/MESOS-2350
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.22.0, 0.21.1
>Reporter: Ian Downes
>Assignee: Ian Downes
>  Labels: twitter
>
> In preparation for the MesosContainerizer to support a filesystem isolator 
> the MesosContainerizerLauncher must support chrooting. Optionally, it should 
> also configure the chroot environment by (re-)mounting special filesystems 
> such as /proc and /sys and making device nodes such as /dev/zero, etc., such 
> that the chroot environment is functional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2366:
-
Sprint: Twitter Mesos Q1 Sprint 3

> MasterSlaveReconciliationTest.ReconcileLostTask is flaky
> 
>
> Key: MESOS-2366
> URL: https://issues.apache.org/jira/browse/MESOS-2366
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>Assignee: Dominic Hamon
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
> {code}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
> I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
> I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
> I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
> I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
> 2370ns
> I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 348ns
> I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
> I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
> I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
> broadcasted recover request
> I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
> a replica in 4 status
> I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
> I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 733223ns
> I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
> I0218 01:53:26.886714 13943 master.cpp:347] Master 
> 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
> 67.195.81.187:57268
> I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
> authenticated frameworks to register
> I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
> authenticated slaves to register
> I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
> I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
> I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
> I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
> broadcasted recover request
> I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
> I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
> allocator process
> I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
> a replica in 3 status
> I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
> master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
> I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(183)@67.195.81.187:57268
> I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
> I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
> I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
> scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
> I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
> I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(179)@67.195.81.187:57268
> I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
> allocate!
> I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
> slaves in 60316ns
> I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
> I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 921045ns
> I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
> I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
> group
> I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
> I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
> I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
> promise request with proposal 1
> I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 534122ns
> I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1
> I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to 
> fill missing position
> I0218 01:53:27.919147 13941 r

[jira] [Updated] (MESOS-2359) Expose slave's memory and cpu cgroup metrics

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2359:
-
Component/s: twitter

> Expose slave's memory and cpu cgroup metrics
> 
>
> Key: MESOS-2359
> URL: https://issues.apache.org/jira/browse/MESOS-2359
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation, twitter
>Reporter: Ian Downes
>Priority: Minor
>
> The slave can optionally be placed into its own cgroups (--slave_cgroups=). 
> If this is enabled, we should export the relevant metrics - in preference or 
> in addition to the process based metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-18 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326659#comment-14326659
 ] 

Dominic Hamon commented on MESOS-2366:
--

That's curious. That suggests that the status update is not being received at 
the master, but I see it in the log.

We could remove the metrics from the test temporarily, but it suggests that 
there's some wait missing in the test itself, or some check not present.

> MasterSlaveReconciliationTest.ReconcileLostTask is flaky
> 
>
> Key: MESOS-2366
> URL: https://issues.apache.org/jira/browse/MESOS-2366
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>  Labels: flaky-test
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
> {code}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
> I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
> I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
> I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
> I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
> 2370ns
> I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 348ns
> I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
> I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
> I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
> broadcasted recover request
> I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
> a replica in 4 status
> I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
> I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 733223ns
> I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
> I0218 01:53:26.886714 13943 master.cpp:347] Master 
> 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
> 67.195.81.187:57268
> I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
> authenticated frameworks to register
> I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
> authenticated slaves to register
> I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
> I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
> I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
> I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
> broadcasted recover request
> I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
> I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
> allocator process
> I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
> a replica in 3 status
> I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
> master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
> I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(183)@67.195.81.187:57268
> I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
> I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
> I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
> scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
> I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
> I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(179)@67.195.81.187:57268
> I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
> allocate!
> I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
> slaves in 60316ns
> I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
> I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 921045ns
> I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
> I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
> group
> I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
> I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
> I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
> promise request with proposal 1
> I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting

[jira] [Commented] (MESOS-2277) Document undocumented HTTP endpoints

2015-02-18 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326214#comment-14326214
 ] 

Dominic Hamon commented on MESOS-2277:
--

the idea is that metrics/snapshots will hold anything that uses the metrics 
part of libprocess; counters, gauges, timers, etc. state.json should reflect 
the state of the component. This isn't static data given that the slave, 
framework, and task info is in there.

> Document undocumented HTTP endpoints
> 
>
> Key: MESOS-2277
> URL: https://issues.apache.org/jira/browse/MESOS-2277
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>Priority: Minor
>  Labels: documentation, newbie, starter
>
> Did a quick scan and we are missing documentation for a few endpoints:
> {code}
> files/browse.json
> files/read.json
> files/download.json
> files/debug.json
> master/roles.json
> master/state.json
> master/stats.json
> slave/state.json
> slave/stats.json
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-1708) Using the wrong resource "name" should report a better error.

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-1708.
--
   Resolution: Fixed
Fix Version/s: 0.22.0

> Using the wrong resource "name" should report a better error.
> -
>
> Key: MESOS-1708
> URL: https://issues.apache.org/jira/browse/MESOS-1708
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, master
>Reporter: Benjamin Hindman
>Assignee: Dominic Hamon
>  Labels: newbie, twitter
> Fix For: 0.22.0
>
>
> If a scheduler launches a task using resources the master doesn't know about 
> the task validator causes the task to fail but the error message is not very 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-2344) segfaults running make check from ev integration

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-2344.
--
   Resolution: Fixed
Fix Version/s: 0.22.0
 Assignee: Dominic Hamon  (was: Joris Van Remoortere)

i haven't been able to reproduce since making these changes. we can open 
another or reopen this if it's seen again.

> segfaults running make check from ev integration
> 
>
> Key: MESOS-2344
> URL: https://issues.apache.org/jira/browse/MESOS-2344
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Blocker
> Fix For: 0.22.0
>
>
> Running make check on Ubuntu under gdb, I've seen a number of segfaults from 
> the {{process::EventLoop}}. Stack traces and debugging sessions below:
> {noformat}
> (gdb) bt
> #0  0x00789c71 in std::move&> (__t=...) at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102
> #1  0x76821148 in std::_Tuple_impl<1, void 
> (*)()>::_Tuple_impl( 0x27e516d, DIE 0x27f7273>) (
> this=0x7fffe00228d8, __in= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270
> #2  0x768210a4 in std::_Tuple_impl<0, Duration, void 
> (*)()>::_Tuple_impl( 0x27e516d, DIE 0x27f71f7>) (this=0x7fffe00228d8, __in= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271
> #3  0x76821068 in std::tuple::tuple( type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71c4>) (
> this=0x7fffe00228d8) at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542
> #4  0x76821014 in std::_Bind (*(Duration, 
> void (*)()))(const Duration &, void (*)())>::_Bind( build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d>) 
> (this=0x7fffe00228d0, __b= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342
> #5  0x76820f86 in 
> std::_Function_base::_Base_manager 
> (*(Duration, void (*)()))(Duration const&, void (*)())> 
> >::_M_init_functor(std::_Any_data&, std::_Bind 
> (*(Duration, void (*)()))(Duration const&, void (*)())>&&, 
> std::integral_constant) (__functor=..., __f= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f714b>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987
> #6  0x76820ab0 in 
> std::_Function_base::_Base_manager 
> (*(Duration, void (*)()))(Duration const&, void (*)())> 
> >::_M_init_functor(std::_Any_data&, std::_Bind 
> (*(Duration, void (*)()))(Duration const&, void (*)())>&&) (__functor=..., 
> __f= DIE 0x27f7115>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1958
> #7  0x768208e6 in std::function 
> ()>::function (*(Duration, void 
> (*)()))(const Duration &, void (*)())>, 
> void>(std::_Bind (*(Duration, void (*)()))(const 
> Duration &, void (*)())>) (this=0x7fffe85ca9d0, __f=...)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2451
> #8  0x7681fe55 in process::EventLoop::delay (duration=..., 
> function=0x76729580 ) at 
> ../../../3rdparty/libprocess/src/libev.cpp:98
> #9  0x7672a151 in process::tick () at 
> ../../../3rdparty/libprocess/src/clock.cpp:125
> #10 0x7681fcb2 in process::internal::handle_delay 
> (loop=0x77dd91f0 , timer=0x7fffe00279b0, revents=256)
> at ../../../3rdparty/libprocess/src/libev.cpp:64
> #11 0x7685f8c5 in ev_invoke_pending (loop=0x77dd91f0 
> ) at ev.c:2994
> #12 0x76860803 in ev_run (loop=0x77dd91f0 , 
> flags=) at ev.c:3394
> #13 0x7681fffb in ev_loop (loop=0x77dd91f0 , 
> flags=0) at 3rdparty/libev-4.15/ev.h:826
> #14 0x7681ff49 in process::EventLoop::run () at 
> ../../../3rdparty/libprocess/src/libev.cpp:114
> #15 0x721d2182 in start_thread (arg=0x7fffe85cb700) at 
> pthread_create.c:312
> #16 0x71eff00d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) frame 8
> #8  0x7681fe55 in process::EventLoop::delay (duration=..., 
> function=0x76729580 ) at 
> ../../../3rdparty/libprocess/src/libev.cpp:98
> 98run_in_event_loop(
> (gdb) list
> 93  } // namespace internal {
> 94
> 95
> 96  void EventLoop::delay(const Duration& duration, void(*function)(void))
> 97  {
> 98run_in_event_loop(
> 99lambda::bind(&internal::delay, duration, function));
> 100 }
> 101
> 102
> (gdb)

[jira] [Commented] (MESOS-2244) RoutingTestINETSockets fails

2015-02-17 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325013#comment-14325013
 ] 

Dominic Hamon commented on MESOS-2244:
--

i'll ask again: should we have libnl in tree? cc [~idownes] [~vinodkone]

> RoutingTestINETSockets fails 
> -
>
> Key: MESOS-2244
> URL: https://issues.apache.org/jira/browse/MESOS-2244
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.10, libnl 3.2.25
> kernel version:
> 3.16.0-23-generic #31-Ubuntu SMP x86_64
>Reporter: Evelina Dumitrescu
>Assignee: Chi Zhang
>
> [ RUN  ] RoutingTest.INETSockets
> *** stack smashing detected ***: 
> /home/evelina/mesos2/mesos/build/src/.libs/lt-mesos-tests terminated
> *** Aborted at 1421895912 (unix time) try "date -d @1421895912" if you are 
> using GNU date ***
> PC: @ 0x7f3566460d27 (unknown)
> *** SIGABRT (@0x3e81633) received by PID 5683 (TID 0x7f356c53a7c0) from 
> PID 5683; stack trace: ***
> @ 0x7f35667fec90 (unknown)
> @ 0x7f3566460d27 (unknown)
> @ 0x7f3566462418 (unknown)
> @ 0x7f35664a29f4 (unknown)
> @ 0x7f35665365cc (unknown)
> @ 0x7f3566536570 (unknown)
> @ 0x7f3566226753 idiagnl_msg_parse
> @ 0x7f356622678b idiagnl_msg_parser
> @ 0x7f3565dac4c9 nl_cache_parse
> @ 0x7f3565dac51b update_msg_parser
> @ 0x7f3565db1fbf nl_recvmsgs_report
> @ 0x7f3565db2329 nl_recvmsgs
> @ 0x7f3565dab9c9 __cache_pickup
> @ 0x7f3565dac43d nl_cache_pickup
> @ 0x7f3565dac66e nl_cache_refill
> @ 0x7f3566226024 idiagnl_msg_alloc_cache
> @ 0x7f356a95f455 routing::diagnosis::socket::infos()
> @  0x114da90 RoutingTest_INETSockets_Test::TestBody()
> @  0x11e6957 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x11e151d 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x11c7adb testing::Test::Run()
> @  0x11c8253 testing::TestInfo::Run()
> @  0x11c87f6 testing::TestCase::Run()
> @  0x11cd987 testing::internal::UnitTestImpl::RunAllTests()
> @  0x11e7905 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x11e2304 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x11cc74a testing::UnitTest::Run()
> @   0xd7a4ad main
> @ 0x7f356644bec5 (unknown)
> @   0x91ccb9 (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2361) Add metrics to status update manager to expose number of outstanding (un-ack'ed) status updates

2015-02-17 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324975#comment-14324975
 ] 

Dominic Hamon commented on MESOS-2361:
--

the queue length is easily exposed as it is on Master and the Scheduler driver 
already.

see src/master/metrics.hpp:157 - 159.

> Add metrics to status update manager to expose number of outstanding 
> (un-ack'ed) status updates
> ---
>
> Key: MESOS-2361
> URL: https://issues.apache.org/jira/browse/MESOS-2361
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>
> We have experienced custom executors with high volume of status updates cause 
> congestion on the slave due to framework unavailability (either from being 
> disconnected or not processing status updates fast enough). As a first step, 
> it would be helpful to expose the status update stream/queue depths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1708) Using the wrong resource "name" should report a better error.

2015-02-17 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324890#comment-14324890
 ] 

Dominic Hamon commented on MESOS-1708:
--

I think the spirit of the issue is that we should differentiate between a 
request for a resource type that isn't known (due to a typo, or something) and 
a request for a resource type that is known but the amount of which can't be 
satisfied.

The former doesn't exist but i think could be added by iterating through the 
resources in 'total' and checking that the name exists in the 'offered' 
Resources.

Having said that, I think the error 'Task uses more resources' that contains 
the stringified resources for both offered and total should be enough to debug 
the issue.

Do you think it's worth the extra validation check?

> Using the wrong resource "name" should report a better error.
> -
>
> Key: MESOS-1708
> URL: https://issues.apache.org/jira/browse/MESOS-1708
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, master
>Reporter: Benjamin Hindman
>Assignee: Dominic Hamon
>  Labels: newbie, twitter
>
> If a scheduler launches a task using resources the master doesn't know about 
> the task validator causes the task to fail but the error message is not very 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1807) Disallow executors with cpu only or memory only resources

2015-02-17 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324880#comment-14324880
 ] 

Dominic Hamon commented on MESOS-1807:
--

Can you update the TODO on src/master/validation.cpp:322 which still references 
0.22?

> Disallow executors with cpu only or memory only resources
> -
>
> Key: MESOS-1807
> URL: https://issues.apache.org/jira/browse/MESOS-1807
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>  Labels: newbie
>
> Currently master allows executors to be launched with either only cpus or 
> only memory but we shouldn't allow that.
> This is because executor is an actual unix process that is launched by the 
> slave. If an executor doesn't specify cpus, what should do the cpu limits be 
> for that executor when there are no tasks running on it? If no cpu limits are 
> set then it might starve other executors/tasks on the slave violating 
> isolation guarantees. Same goes with memory. Moreover, the current 
> containerizer/isolator code will throw failures when using such an executor, 
> e.g., when the last task on the executor finishes and Containerizer::update() 
> is called with 0 cpus or 0 mem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1708) Using the wrong resource "name" should report a better error.

2015-02-17 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324866#comment-14324866
 ] 

Dominic Hamon commented on MESOS-1708:
--

I think this was fixed as part of b22d7add. [~jieyu] do you agree?

> Using the wrong resource "name" should report a better error.
> -
>
> Key: MESOS-1708
> URL: https://issues.apache.org/jira/browse/MESOS-1708
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, master
>Reporter: Benjamin Hindman
>Assignee: Dominic Hamon
>  Labels: newbie, twitter
>
> If a scheduler launches a task using resources the master doesn't know about 
> the task validator causes the task to fail but the error message is not very 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-2185) slave state endpoint does not contain all resources in the resources field

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-2185.
--
   Resolution: Fixed
Fix Version/s: 0.22.0

commit 73ddc21f44e65499d4179bb15edf97243c8ab18c (HEAD, origin/master, 
origin/HEAD, master)
Author: Joerg Schad 
Commit: Dominic Hamon 

Included all resources in state endpoint.

Review: https://reviews.apache.org/r/31082

> slave state endpoint does not contain all resources in the resources field
> --
>
> Key: MESOS-2185
> URL: https://issues.apache.org/jira/browse/MESOS-2185
> Project: Mesos
>  Issue Type: Bug
>  Components: json api, slave
>Affects Versions: 0.21.0
> Environment: Centos 6.5 / Centos 6.6
>Reporter: Henning Schmiedehausen
>Assignee: Joerg Schad
>  Labels: mesosphere
> Fix For: 0.22.0
>
>
> fetching status for a slave from the /state.json yields
>   "resources": {
> "ports": "[31000-32000]",
> "mem": 512,
> "disk": 33659,
> "cpus": 1
>   }
> but in the flags section, it lists
> "flags": {
>"resources": 
> "cpus:1;mem:512;ports:[31000-32000];set:{label_a,label_b,label_c,label_d};range:[0-1000];scalar:108;numbers:{4,8,15,16,23,42}",
> }
> so there are additional resources. these resources show up when sending 
> offers from that slave to the frameworks and the frameworks can use and 
> consume them.
> This may just be a reporting issue with the state.json endpoint.
> https://gist.github.com/hgschmie/0dc4f599bb0ff2e815ed is the full response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2244) RoutingTestINETSockets fails

2015-02-17 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324677#comment-14324677
 ] 

Dominic Hamon commented on MESOS-2244:
--

is this an instance of mismatched libnl header/kernel?

> RoutingTestINETSockets fails 
> -
>
> Key: MESOS-2244
> URL: https://issues.apache.org/jira/browse/MESOS-2244
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.10, libnl 3.2.25
>Reporter: Evelina Dumitrescu
>Assignee: Chi Zhang
>
> [ RUN  ] RoutingTest.INETSockets
> *** stack smashing detected ***: 
> /home/evelina/mesos2/mesos/build/src/.libs/lt-mesos-tests terminated
> *** Aborted at 1421895912 (unix time) try "date -d @1421895912" if you are 
> using GNU date ***
> PC: @ 0x7f3566460d27 (unknown)
> *** SIGABRT (@0x3e81633) received by PID 5683 (TID 0x7f356c53a7c0) from 
> PID 5683; stack trace: ***
> @ 0x7f35667fec90 (unknown)
> @ 0x7f3566460d27 (unknown)
> @ 0x7f3566462418 (unknown)
> @ 0x7f35664a29f4 (unknown)
> @ 0x7f35665365cc (unknown)
> @ 0x7f3566536570 (unknown)
> @ 0x7f3566226753 idiagnl_msg_parse
> @ 0x7f356622678b idiagnl_msg_parser
> @ 0x7f3565dac4c9 nl_cache_parse
> @ 0x7f3565dac51b update_msg_parser
> @ 0x7f3565db1fbf nl_recvmsgs_report
> @ 0x7f3565db2329 nl_recvmsgs
> @ 0x7f3565dab9c9 __cache_pickup
> @ 0x7f3565dac43d nl_cache_pickup
> @ 0x7f3565dac66e nl_cache_refill
> @ 0x7f3566226024 idiagnl_msg_alloc_cache
> @ 0x7f356a95f455 routing::diagnosis::socket::infos()
> @  0x114da90 RoutingTest_INETSockets_Test::TestBody()
> @  0x11e6957 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x11e151d 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x11c7adb testing::Test::Run()
> @  0x11c8253 testing::TestInfo::Run()
> @  0x11c87f6 testing::TestCase::Run()
> @  0x11cd987 testing::internal::UnitTestImpl::RunAllTests()
> @  0x11e7905 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x11e2304 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x11cc74a testing::UnitTest::Run()
> @   0xd7a4ad main
> @ 0x7f356644bec5 (unknown)
> @   0x91ccb9 (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2244) RoutingTestINETSockets fails

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2244:
-
Assignee: Chi Zhang

> RoutingTestINETSockets fails 
> -
>
> Key: MESOS-2244
> URL: https://issues.apache.org/jira/browse/MESOS-2244
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 14.10, libnl 3.2.25
>Reporter: Evelina Dumitrescu
>Assignee: Chi Zhang
>
> [ RUN  ] RoutingTest.INETSockets
> *** stack smashing detected ***: 
> /home/evelina/mesos2/mesos/build/src/.libs/lt-mesos-tests terminated
> *** Aborted at 1421895912 (unix time) try "date -d @1421895912" if you are 
> using GNU date ***
> PC: @ 0x7f3566460d27 (unknown)
> *** SIGABRT (@0x3e81633) received by PID 5683 (TID 0x7f356c53a7c0) from 
> PID 5683; stack trace: ***
> @ 0x7f35667fec90 (unknown)
> @ 0x7f3566460d27 (unknown)
> @ 0x7f3566462418 (unknown)
> @ 0x7f35664a29f4 (unknown)
> @ 0x7f35665365cc (unknown)
> @ 0x7f3566536570 (unknown)
> @ 0x7f3566226753 idiagnl_msg_parse
> @ 0x7f356622678b idiagnl_msg_parser
> @ 0x7f3565dac4c9 nl_cache_parse
> @ 0x7f3565dac51b update_msg_parser
> @ 0x7f3565db1fbf nl_recvmsgs_report
> @ 0x7f3565db2329 nl_recvmsgs
> @ 0x7f3565dab9c9 __cache_pickup
> @ 0x7f3565dac43d nl_cache_pickup
> @ 0x7f3565dac66e nl_cache_refill
> @ 0x7f3566226024 idiagnl_msg_alloc_cache
> @ 0x7f356a95f455 routing::diagnosis::socket::infos()
> @  0x114da90 RoutingTest_INETSockets_Test::TestBody()
> @  0x11e6957 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x11e151d 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x11c7adb testing::Test::Run()
> @  0x11c8253 testing::TestInfo::Run()
> @  0x11c87f6 testing::TestCase::Run()
> @  0x11cd987 testing::internal::UnitTestImpl::RunAllTests()
> @  0x11e7905 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x11e2304 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x11cc74a testing::UnitTest::Run()
> @   0xd7a4ad main
> @ 0x7f356644bec5 (unknown)
> @   0x91ccb9 (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2289) Design doc for the HTTP API

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2289:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

> Design doc for the HTTP API
> ---
>
> Key: MESOS-2289
> URL: https://issues.apache.org/jira/browse/MESOS-2289
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1708) Using the wrong resource "name" should report a better error.

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1708:
-
Sprint: Twitter Mesos Q1 Sprint 3

> Using the wrong resource "name" should report a better error.
> -
>
> Key: MESOS-1708
> URL: https://issues.apache.org/jira/browse/MESOS-1708
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, master
>Reporter: Benjamin Hindman
>Assignee: Dominic Hamon
>  Labels: newbie, twitter
>
> If a scheduler launches a task using resources the master doesn't know about 
> the task validator causes the task to fail but the error message is not very 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-998) Slave should wait until Containerizer::update() completes successfully

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-998:

Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

> Slave should wait until Containerizer::update() completes successfully
> --
>
> Key: MESOS-998
> URL: https://issues.apache.org/jira/browse/MESOS-998
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.18.0, 0.19.0, 0.20.0, 0.21.0, 0.19.1, 0.20.1, 0.21.1
>Reporter: Ian Downes
>Assignee: Jie Yu
>
> Container resources are updated in several places in the slave and we don't 
> check the update was successful or even wait until it completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2103) Expose number and state of threads in a container

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2103:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

> Expose number and state of threads in a container
> -
>
> Key: MESOS-2103
> URL: https://issues.apache.org/jira/browse/MESOS-2103
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.20.0
>Reporter: Ian Downes
>Assignee: Chi Zhang
>  Labels: twitter
>
> The CFS cpu statistics (cpus_nr_throttled, cpus_nr_periods, 
> cpus_throttled_time) are difficult to interpret.
> 1) nr_throttled is the number of intervals where *any* throttling occurred
> 2) throttled_time is the aggregate time *across all runnable tasks* (tasks in 
> the Linux sense).
> For example, in a typical 60 second sampling interval: nr_periods = 600, 
> nr_throttled could be 60, i.e., 10% of intervals, but throttled_time could be 
> much higher than (60/600) * 60 = 6 seconds if there is more than one task 
> that is runnable but throttled. *Each* throttled task contributes to the 
> total throttled time.
> Small test to demonstrate throttled_time > nr_periods * quota_interval:
> 5 x {{'openssl speed'}} running with quota=100ms:
> {noformat}
> cat cpu.stat && sleep 1 && cat cpu.stat
> nr_periods 3228
> nr_throttled 1276
> throttled_time 528843772540
> nr_periods 3238
> nr_throttled 1286
> throttled_time 531668964667
> {noformat}
> All 10 intervals throttled (100%) for total time of 2.8 seconds in 1 second 
> ("more than 100%" of the time interval)
> It would be helpful to expose the number and state of tasks in the container 
> cgroup. This would be at a very coarse granularity but would give some 
> guidance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2136) Expose per-cgroup memory pressure

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2136:
-
Sprint: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos 
Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: 
Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 
1, Twitter Mesos Q1 Sprint 2)

> Expose per-cgroup memory pressure
> -
>
> Key: MESOS-2136
> URL: https://issues.apache.org/jira/browse/MESOS-2136
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Ian Downes
>Assignee: Chi Zhang
>  Labels: twitter
>
> The cgroup memory controller can provide information on the memory pressure 
> of a cgroup. This is in the form of an event based notification where events 
> of (low, medium, critical) are generated when the kernel makes specific 
> actions to allocate memory. This signal is probably more informative than 
> comparing memory usage to memory limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

> Report per-container metrics for network bandwidth throttling
> -
>
> Key: MESOS-2332
> URL: https://issues.apache.org/jira/browse/MESOS-2332
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Paul Brett
>Assignee: Paul Brett
>  Labels: features, twitter
>
> Export metrics from the network isolation to identify scope and duration of 
> container throttling.  
> Packet loss can be identified from the overlimits and requeues fields of the 
> htb qdisc report for the virtual interface, e.g.
> {noformat}
> $ tc -s -d qdisc show dev mesos19223
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
> 1 1 1
>  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc ingress : parent :fff1 
>  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
> requeues 0)
>  backlog 0b 0p requeues 0
> {noformat}
> Note that since a packet can be examined multiple times before transmission, 
> overlimits can exceed total packets sent.  
> Add to the port_mapping isolator usage() and the container statistics 
> protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
> protobuf fields so it's clear what these represent and how they are different 
> to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2123) Document changes in C++ Resources API in CHANGELOG.

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2123:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

> Document changes in C++ Resources API in CHANGELOG.
> ---
>
> Key: MESOS-2123
> URL: https://issues.apache.org/jira/browse/MESOS-2123
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>  Labels: twitter
>
> With the refactor introduced in MESOS-1974, we need to document those API 
> changes in CHANGELOG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1690) Expose metric for container destroy failures

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1690:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

> Expose metric for container destroy failures
> 
>
> Key: MESOS-1690
> URL: https://issues.apache.org/jira/browse/MESOS-1690
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0
>Reporter: Ian Downes
>Assignee: Vinod Kone
>
> Increment counter when container destroy fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2031) Manage persistent directories on slave.

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2031:
-
Sprint: Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos 
Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter 
Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter Mesos Q4 Sprint 3, 
Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 
6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2)

> Manage persistent directories on slave.
> ---
>
> Key: MESOS-2031
> URL: https://issues.apache.org/jira/browse/MESOS-2031
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> Whenever a slave sees a persistent disk resource (in ExecutorInfo or 
> TaskInfo) that is new to it, it will create a persistent directory which is 
> for tasks to store persistent data.
> The slave needs to do the following after it's created:
> 1) symlink into the executor sandbox so that tasks/executor can see it
> 2) garbage collect it once it is released by the framework



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

> Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
> ---
>
> Key: MESOS-2144
> URL: https://issues.apache.org/jira/browse/MESOS-2144
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Cody Maloney
>Assignee: Yan Xu
>Priority: Minor
>  Labels: flaky, twitter
>
> Occured on review bot review of: 
> https://reviews.apache.org/r/28262/#review62333
> The review doesn't touch code related to the test (And doesn't break 
> libprocess in general)
> [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
> ../../src/tests/script.cpp:83: Failure
> Failed
> low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
> The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-02-11 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Story Points: 5

> Report per-container metrics for network bandwidth throttling
> -
>
> Key: MESOS-2332
> URL: https://issues.apache.org/jira/browse/MESOS-2332
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Reporter: Paul Brett
>Assignee: Paul Brett
>  Labels: features, twitter
>
> Export metrics from the network isolation to identify scope and duration of 
> container throttling.  
> Packet loss can be identified from the overlimits and requeues fields of the 
> htb qdisc report for the virtual interface, e.g.
> {noformat}
> $ tc -s -d qdisc show dev mesos19223
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
> 1 1 1
>  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc ingress : parent :fff1 
>  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
> requeues 0)
>  backlog 0b 0p requeues 0
> {noformat}
> Note that since a packet can be examined multiple times before transmission, 
> overlimits can exceed total packets sent.  
> Add to the port_mapping isolator usage() and the container statistics 
> protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
> protobuf fields so it's clear what these represent and how they are different 
> to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2344) segfaults running make check from ev integration

2015-02-11 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316896#comment-14316896
 ] 

Dominic Hamon commented on MESOS-2344:
--

commit 95c448f77731034114183fc5f5bf6e040d4c0f5d (HEAD, origin/master, 
origin/HEAD, nonpod.clock, master)
Author: Dominic Hamon 
Commit: Dominic Hamon 

Remove more non-pod statics from clock

Review: https://reviews.apache.org/r/30886



> segfaults running make check from ev integration
> 
>
> Key: MESOS-2344
> URL: https://issues.apache.org/jira/browse/MESOS-2344
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Dominic Hamon
>Assignee: Joris Van Remoortere
>Priority: Blocker
>
> Running make check on Ubuntu under gdb, I've seen a number of segfaults from 
> the {{process::EventLoop}}. Stack traces and debugging sessions below:
> {noformat}
> (gdb) bt
> #0  0x00789c71 in std::move&> (__t=...) at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102
> #1  0x76821148 in std::_Tuple_impl<1, void 
> (*)()>::_Tuple_impl( 0x27e516d, DIE 0x27f7273>) (
> this=0x7fffe00228d8, __in= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270
> #2  0x768210a4 in std::_Tuple_impl<0, Duration, void 
> (*)()>::_Tuple_impl( 0x27e516d, DIE 0x27f71f7>) (this=0x7fffe00228d8, __in= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271
> #3  0x76821068 in std::tuple::tuple( type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71c4>) (
> this=0x7fffe00228d8) at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542
> #4  0x76821014 in std::_Bind (*(Duration, 
> void (*)()))(const Duration &, void (*)())>::_Bind( build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d>) 
> (this=0x7fffe00228d0, __b= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342
> #5  0x76820f86 in 
> std::_Function_base::_Base_manager 
> (*(Duration, void (*)()))(Duration const&, void (*)())> 
> >::_M_init_functor(std::_Any_data&, std::_Bind 
> (*(Duration, void (*)()))(Duration const&, void (*)())>&&, 
> std::integral_constant) (__functor=..., __f= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f714b>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987
> #6  0x76820ab0 in 
> std::_Function_base::_Base_manager 
> (*(Duration, void (*)()))(Duration const&, void (*)())> 
> >::_M_init_functor(std::_Any_data&, std::_Bind 
> (*(Duration, void (*)()))(Duration const&, void (*)())>&&) (__functor=..., 
> __f= DIE 0x27f7115>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1958
> #7  0x768208e6 in std::function 
> ()>::function (*(Duration, void 
> (*)()))(const Duration &, void (*)())>, 
> void>(std::_Bind (*(Duration, void (*)()))(const 
> Duration &, void (*)())>) (this=0x7fffe85ca9d0, __f=...)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2451
> #8  0x7681fe55 in process::EventLoop::delay (duration=..., 
> function=0x76729580 ) at 
> ../../../3rdparty/libprocess/src/libev.cpp:98
> #9  0x7672a151 in process::tick () at 
> ../../../3rdparty/libprocess/src/clock.cpp:125
> #10 0x7681fcb2 in process::internal::handle_delay 
> (loop=0x77dd91f0 , timer=0x7fffe00279b0, revents=256)
> at ../../../3rdparty/libprocess/src/libev.cpp:64
> #11 0x7685f8c5 in ev_invoke_pending (loop=0x77dd91f0 
> ) at ev.c:2994
> #12 0x76860803 in ev_run (loop=0x77dd91f0 , 
> flags=) at ev.c:3394
> #13 0x7681fffb in ev_loop (loop=0x77dd91f0 , 
> flags=0) at 3rdparty/libev-4.15/ev.h:826
> #14 0x7681ff49 in process::EventLoop::run () at 
> ../../../3rdparty/libprocess/src/libev.cpp:114
> #15 0x721d2182 in start_thread (arg=0x7fffe85cb700) at 
> pthread_create.c:312
> #16 0x71eff00d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) frame 8
> #8  0x7681fe55 in process::EventLoop::delay (duration=..., 
> function=0x76729580 ) at 
> ../../../3rdparty/libprocess/src/libev.cpp:98
> 98run_in_event_loop(
> (gdb) list
> 93  } // namespace internal {
> 94
> 95
> 96  void EventLoop::delay(const Duration& duration, void(*function)(void))
> 97  {
> 98run_in_event_loop(
> 99lambda::bind(&internal::delay, dura

[jira] [Comment Edited] (MESOS-2344) segfaults running make check from ev integration

2015-02-11 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316853#comment-14316853
 ] 

Dominic Hamon edited comment on MESOS-2344 at 2/11/15 7:45 PM:
---

a different one now:

{noformat}
(gdb) bt
#0  boost::range_detail::range_end > > > const> (c=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:44
#1  0x7680c665 in boost::range_adl_barrier::end > > > > (r=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:113
#2  0x7680c4f5 in boost::foreach_detail_::end > > >, mpl_::bool_ > 
(col=...) at 3rdparty/boost-1.53.0/boost/foreach.hpp:714
#3  0x768096ae in multihashmap > > >::keys (this=0x7fffdc009878) 
at ../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp:74
#4  0x7680911e in process::ReaperProcess::wait (this=0x7fffdc009870) at 
../../../3rdparty/libprocess/src/reap.cpp:82
#5  0x7680a968 in operator() (this=0x7fffe0004180, 
process=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:78
#6  0x7680a612 in std::_Function_handler(process::PID 
const&, void 
(process::ReaperProcess::*)())::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data
 const&, process::ProcessBase*) (__functor=..., 
__args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2071
#7  0x767b4388 in std::function::operator()(process::ProcessBase*) const 
(this=0x7fffe0029f00, __args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2464
#8  0x767a31b4 in process::ProcessBase::visit (this=0x7fffdc0098a8, 
event=...) at ../../../3rdparty/libprocess/src/process.cpp:2764
#9  0x767ece5e in process::DispatchEvent::visit (this=0x7fffe0010a90, 
visitor=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/event.hpp:141
#10 0x008cb061 in process::ProcessBase::serve (this=0x7fffdc0098a8, 
event=...) at ../../3rdparty/libprocess/include/process/process.hpp:39
#11 0x7679355d in process::ProcessManager::resume (this=0x3334bb0, 
process=0x7fffdc0098a8) at ../../../3rdparty/libprocess/src/process.cpp:2238
#12 0x76792d8e in process::schedule (arg=0x0) at 
../../../3rdparty/libprocess/src/process.cpp:655
#13 0x721b5182 in start_thread (arg=0x7fffe9dce700) at 
pthread_create.c:312
#14 0x71ee200d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

though this might be related if the DispatchEvent holds a function object 
that's been destroyed. any other non-POD static function objects around the 
place?


was (Author: dhamon):
a different one now:

{noformat}
(gdb) bt
#0  boost::range_detail::range_end > > > const> (c=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:44
#1  0x7680c665 in boost::range_adl_barrier::end > > > > (r=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:113
#2  0x7680c4f5 in boost::foreach_detail_::end > > >, mpl_::bool_ > 
(col=...) at 3rdparty/boost-1.53.0/boost/foreach.hpp:714
#3  0x768096ae in multihashmap > > >::keys (this=0x7fffdc009878) 
at ../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp:74
#4  0x7680911e in process::ReaperProcess::wait (this=0x7fffdc009870) at 
../../../3rdparty/libprocess/src/reap.cpp:82
#5  0x7680a968 in operator() (this=0x7fffe0004180, 
process=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:78
#6  0x7680a612 in std::_Function_handler(process::PID 
const&, void 
(process::ReaperProcess::*)())::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data
 const&, process::ProcessBase*) (__functor=..., 
__args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2071
#7  0x767b4388 in std::function::operator()(process::ProcessBase*) const 
(this=0x7fffe0029f00, __args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2464
#8  0x767a31b4 in process::ProcessBase::visit (this=0x7fffdc0098a8, 
event=...) at ../../../3rdparty/libprocess/src/process.cpp:2764
#9  0x767ece5e in process::DispatchEvent::visit (this=0x7fffe0010a90, 
visitor=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/event.hpp:141
#10 0x008cb061 in process::ProcessBase::serve (this=0x7fffdc0098a8, 
event=...) at ../../3rdparty/libprocess/include/process/process.hpp:39
#11 0x7679355d in process::ProcessManager::resume (this=0x3334bb0, 
process=0x7fffdc0098a8) at ../../../3rdparty/libprocess/src/process.cpp:2238
#12 0x76792d8e in process::schedule (arg=0x0) at 
../../../3rdparty/libprocess/src/process.cpp:655
#13 0x721b5182 in start_thread (arg=0x7fffe9dce700) at 
pthread_create.c:312
#14 0x71ee200d in clone () at

[jira] [Commented] (MESOS-2344) segfaults running make check from ev integration

2015-02-11 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316853#comment-14316853
 ] 

Dominic Hamon commented on MESOS-2344:
--

a different one now:

{noformat}
(gdb) bt
#0  boost::range_detail::range_end > > > const> (c=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:44
#1  0x7680c665 in boost::range_adl_barrier::end > > > > (r=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:113
#2  0x7680c4f5 in boost::foreach_detail_::end > > >, mpl_::bool_ > 
(col=...) at 3rdparty/boost-1.53.0/boost/foreach.hpp:714
#3  0x768096ae in multihashmap > > >::keys (this=0x7fffdc009878) 
at ../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp:74
#4  0x7680911e in process::ReaperProcess::wait (this=0x7fffdc009870) at 
../../../3rdparty/libprocess/src/reap.cpp:82
#5  0x7680a968 in operator() (this=0x7fffe0004180, 
process=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:78
#6  0x7680a612 in std::_Function_handler(process::PID 
const&, void 
(process::ReaperProcess::*)())::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data
 const&, process::ProcessBase*) (__functor=..., 
__args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2071
#7  0x767b4388 in std::function::operator()(process::ProcessBase*) const 
(this=0x7fffe0029f00, __args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2464
#8  0x767a31b4 in process::ProcessBase::visit (this=0x7fffdc0098a8, 
event=...) at ../../../3rdparty/libprocess/src/process.cpp:2764
#9  0x767ece5e in process::DispatchEvent::visit (this=0x7fffe0010a90, 
visitor=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/event.hpp:141
#10 0x008cb061 in process::ProcessBase::serve (this=0x7fffdc0098a8, 
event=...) at ../../3rdparty/libprocess/include/process/process.hpp:39
#11 0x7679355d in process::ProcessManager::resume (this=0x3334bb0, 
process=0x7fffdc0098a8) at ../../../3rdparty/libprocess/src/process.cpp:2238
#12 0x76792d8e in process::schedule (arg=0x0) at 
../../../3rdparty/libprocess/src/process.cpp:655
#13 0x721b5182 in start_thread (arg=0x7fffe9dce700) at 
pthread_create.c:312
#14 0x71ee200d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

> segfaults running make check from ev integration
> 
>
> Key: MESOS-2344
> URL: https://issues.apache.org/jira/browse/MESOS-2344
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Dominic Hamon
>Assignee: Joris Van Remoortere
>Priority: Blocker
>
> Running make check on Ubuntu under gdb, I've seen a number of segfaults from 
> the {{process::EventLoop}}. Stack traces and debugging sessions below:
> {noformat}
> (gdb) bt
> #0  0x00789c71 in std::move&> (__t=...) at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102
> #1  0x76821148 in std::_Tuple_impl<1, void 
> (*)()>::_Tuple_impl( 0x27e516d, DIE 0x27f7273>) (
> this=0x7fffe00228d8, __in= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270
> #2  0x768210a4 in std::_Tuple_impl<0, Duration, void 
> (*)()>::_Tuple_impl( 0x27e516d, DIE 0x27f71f7>) (this=0x7fffe00228d8, __in= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271
> #3  0x76821068 in std::tuple::tuple( type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71c4>) (
> this=0x7fffe00228d8) at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542
> #4  0x76821014 in std::_Bind (*(Duration, 
> void (*)()))(const Duration &, void (*)())>::_Bind( build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d>) 
> (this=0x7fffe00228d0, __b= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342
> #5  0x76820f86 in 
> std::_Function_base::_Base_manager 
> (*(Duration, void (*)()))(Duration const&, void (*)())> 
> >::_M_init_functor(std::_Any_data&, std::_Bind 
> (*(Duration, void (*)()))(Duration const&, void (*)())>&&, 
> std::integral_constant) (__functor=..., __f= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f714b>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987
> #6  0x76820ab0 in 
> std::_Function_base::_Base_manager 
> (*(Duration, v

[jira] [Commented] (MESOS-2344) segfaults running make check from ev integration

2015-02-10 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315292#comment-14315292
 ] 

Dominic Hamon commented on MESOS-2344:
--

cc [~jvanremoortere] [~benjaminhindman]

> segfaults running make check from ev integration
> 
>
> Key: MESOS-2344
> URL: https://issues.apache.org/jira/browse/MESOS-2344
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Dominic Hamon
>
> Running make check on Ubuntu under gdb, I've seen a number of segfaults from 
> the {{process::EventLoop}}. Stack traces and debugging sessions below:
> {noformat}
> (gdb) bt
> #0  0x00789c71 in std::move&> (__t=...) at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102
> #1  0x76821148 in std::_Tuple_impl<1, void 
> (*)()>::_Tuple_impl( 0x27e516d, DIE 0x27f7273>) (
> this=0x7fffe00228d8, __in= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270
> #2  0x768210a4 in std::_Tuple_impl<0, Duration, void 
> (*)()>::_Tuple_impl( 0x27e516d, DIE 0x27f71f7>) (this=0x7fffe00228d8, __in= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271
> #3  0x76821068 in std::tuple::tuple( type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71c4>) (
> this=0x7fffe00228d8) at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542
> #4  0x76821014 in std::_Bind (*(Duration, 
> void (*)()))(const Duration &, void (*)())>::_Bind( build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d>) 
> (this=0x7fffe00228d0, __b= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342
> #5  0x76820f86 in 
> std::_Function_base::_Base_manager 
> (*(Duration, void (*)()))(Duration const&, void (*)())> 
> >::_M_init_functor(std::_Any_data&, std::_Bind 
> (*(Duration, void (*)()))(Duration const&, void (*)())>&&, 
> std::integral_constant) (__functor=..., __f= build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f714b>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987
> #6  0x76820ab0 in 
> std::_Function_base::_Base_manager 
> (*(Duration, void (*)()))(Duration const&, void (*)())> 
> >::_M_init_functor(std::_Any_data&, std::_Bind 
> (*(Duration, void (*)()))(Duration const&, void (*)())>&&) (__functor=..., 
> __f= DIE 0x27f7115>)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1958
> #7  0x768208e6 in std::function 
> ()>::function (*(Duration, void 
> (*)()))(const Duration &, void (*)())>, 
> void>(std::_Bind (*(Duration, void (*)()))(const 
> Duration &, void (*)())>) (this=0x7fffe85ca9d0, __f=...)
> at 
> /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2451
> #8  0x7681fe55 in process::EventLoop::delay (duration=..., 
> function=0x76729580 ) at 
> ../../../3rdparty/libprocess/src/libev.cpp:98
> #9  0x7672a151 in process::tick () at 
> ../../../3rdparty/libprocess/src/clock.cpp:125
> #10 0x7681fcb2 in process::internal::handle_delay 
> (loop=0x77dd91f0 , timer=0x7fffe00279b0, revents=256)
> at ../../../3rdparty/libprocess/src/libev.cpp:64
> #11 0x7685f8c5 in ev_invoke_pending (loop=0x77dd91f0 
> ) at ev.c:2994
> #12 0x76860803 in ev_run (loop=0x77dd91f0 , 
> flags=) at ev.c:3394
> #13 0x7681fffb in ev_loop (loop=0x77dd91f0 , 
> flags=0) at 3rdparty/libev-4.15/ev.h:826
> #14 0x7681ff49 in process::EventLoop::run () at 
> ../../../3rdparty/libprocess/src/libev.cpp:114
> #15 0x721d2182 in start_thread (arg=0x7fffe85cb700) at 
> pthread_create.c:312
> #16 0x71eff00d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) frame 8
> #8  0x7681fe55 in process::EventLoop::delay (duration=..., 
> function=0x76729580 ) at 
> ../../../3rdparty/libprocess/src/libev.cpp:98
> 98run_in_event_loop(
> (gdb) list
> 93  } // namespace internal {
> 94
> 95
> 96  void EventLoop::delay(const Duration& duration, void(*function)(void))
> 97  {
> 98run_in_event_loop(
> 99lambda::bind(&internal::delay, duration, function));
> 100 }
> 101
> 102
> (gdb) p duration
> $1 = (const Duration &) @0x7fffe000da90: {static NANOSECONDS = 1, static 
> MICROSECONDS = 1000, static MILLISECONDS = 100, static SECONDS = 
> 10, 
>   static MINUTES = 600, static HOURS = 3600

[jira] [Commented] (MESOS-1403) Segfault when starting a slave locally.

2015-02-10 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315291#comment-14315291
 ] 

Dominic Hamon commented on MESOS-1403:
--

cc [~benjaminhindman] [~jvanremoortere]

> Segfault when starting a slave locally.
> ---
>
> Key: MESOS-1403
> URL: https://issues.apache.org/jira/browse/MESOS-1403
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>
> This is from the build directory on a CentOS machine.
> {noformat}
> [bmahler@foobar build]$ sudo ./bin/mesos-slave.sh --master=localhost:5050
> [sudo] password for bmahler:
> I0522 01:01:02.639114  4605 main.cpp:126] Build: 2014-05-06 22:08:34 by root
> I0522 01:01:02.639277  4605 main.cpp:128] Version: 0.19.0
> I0522 01:01:02.639312  4605 mesos_containerizer.cpp:124] Using isolation: 
> posix/cpu,posix/mem
> I0522 01:01:02.642699  4605 main.cpp:149] Starting Mesos slave
> I0522 01:01:02.644693  4631 slave.cpp:143] Slave started on 1)@IP:5051
> I0522 01:01:02.645560  4631 slave.cpp:255] Slave resources: cpus(*):24; 
> mem(*):71322; disk(*):454895; ports(*):[31000-32000]
> I0522 01:01:02.647763  4631 slave.cpp:283] Slave hostname: foobar
> I0522 01:01:02.647790  4631 slave.cpp:284] Slave checkpoint: true
> I0522 01:01:02.651803  4625 state.cpp:33] Recovering state from 
> '/tmp/mesos/meta'
> I0522 01:01:02.653393  4625 status_update_manager.cpp:193] Recovering status 
> update manager
> I0522 01:01:02.654024  4643 mesos_containerizer.cpp:281] Recovering 
> containerizer
> I0522 01:01:02.655377  4639 slave.cpp:2988] Finished recovery
> I0522 01:01:02.656368  4639 slave.cpp:536] New master detected at 
> master@127.0.0.1:5050
> I0522 01:01:02.656682  4639 slave.cpp:572] No credentials provided. 
> Attempting to register without authentication
> I0522 01:01:02.656744  4629 status_update_manager.cpp:167] New master 
> detected at master@127.0.0.1:5050
> I0522 01:01:02.656754  4639 slave.cpp:585] Detecting new master
> *** Aborted at 1400720462 (unix time) try "date -d @1400720462" if you are 
> using GNU date ***
> I0522 01:01:02.656982  4639 slave.cpp:2194] master@127.0.0.1:5050 exited
> W0522 01:01:02.657004  4639 slave.cpp:2197] Master disconnected! Waiting for 
> a new master to be elected
> PC: @ 0x7f4a9e3faff6 std::_Deque_base<>::_M_destroy_nodes()
> *** SIGSEGV (@0x31) received by PID 4605 (TID 0x7f4a8c1d0940) from PID 49; 
> stack trace: ***
> @ 0x7f4a9baefca0 (unknown)
> @ 0x7f4a9e3faff6 std::_Deque_base<>::_M_destroy_nodes()
> @ 0x7f4a9e3ecdaf std::_Deque_base<>::~_Deque_base()
> @ 0x7f4a9e3e2bd5 std::deque<>::~deque()
> @ 0x7f4a9e3dfe10 process::DataDecoder::~DataDecoder()
> @ 0x7f4a9e3ba9bc process::receiving_connect()
> @ 0x7f4a9e506bc5 ev_invoke_pending
> @ 0x7f4a9e509af5 ev_run
> @ 0x7f4a9e3b5928 ev_loop
> @ 0x7f4a9e3bb2d9 process::serve()
> @ 0x7f4a9bae783d start_thread
> @ 0x7f4a9a84f26d clone
> /var/tmp/scltOMGb3: line 8:  4605 Segmentation fault  
> './bin/mesos-slave.sh' '--master=localhost:5050'
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (MESOS-1403) Segfault when starting a slave locally.

2015-02-10 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1403:
-
Comment: was deleted

(was: cc [~benjaminhindman] [~jvanremoortere])

> Segfault when starting a slave locally.
> ---
>
> Key: MESOS-1403
> URL: https://issues.apache.org/jira/browse/MESOS-1403
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.19.0
>Reporter: Benjamin Mahler
>
> This is from the build directory on a CentOS machine.
> {noformat}
> [bmahler@foobar build]$ sudo ./bin/mesos-slave.sh --master=localhost:5050
> [sudo] password for bmahler:
> I0522 01:01:02.639114  4605 main.cpp:126] Build: 2014-05-06 22:08:34 by root
> I0522 01:01:02.639277  4605 main.cpp:128] Version: 0.19.0
> I0522 01:01:02.639312  4605 mesos_containerizer.cpp:124] Using isolation: 
> posix/cpu,posix/mem
> I0522 01:01:02.642699  4605 main.cpp:149] Starting Mesos slave
> I0522 01:01:02.644693  4631 slave.cpp:143] Slave started on 1)@IP:5051
> I0522 01:01:02.645560  4631 slave.cpp:255] Slave resources: cpus(*):24; 
> mem(*):71322; disk(*):454895; ports(*):[31000-32000]
> I0522 01:01:02.647763  4631 slave.cpp:283] Slave hostname: foobar
> I0522 01:01:02.647790  4631 slave.cpp:284] Slave checkpoint: true
> I0522 01:01:02.651803  4625 state.cpp:33] Recovering state from 
> '/tmp/mesos/meta'
> I0522 01:01:02.653393  4625 status_update_manager.cpp:193] Recovering status 
> update manager
> I0522 01:01:02.654024  4643 mesos_containerizer.cpp:281] Recovering 
> containerizer
> I0522 01:01:02.655377  4639 slave.cpp:2988] Finished recovery
> I0522 01:01:02.656368  4639 slave.cpp:536] New master detected at 
> master@127.0.0.1:5050
> I0522 01:01:02.656682  4639 slave.cpp:572] No credentials provided. 
> Attempting to register without authentication
> I0522 01:01:02.656744  4629 status_update_manager.cpp:167] New master 
> detected at master@127.0.0.1:5050
> I0522 01:01:02.656754  4639 slave.cpp:585] Detecting new master
> *** Aborted at 1400720462 (unix time) try "date -d @1400720462" if you are 
> using GNU date ***
> I0522 01:01:02.656982  4639 slave.cpp:2194] master@127.0.0.1:5050 exited
> W0522 01:01:02.657004  4639 slave.cpp:2197] Master disconnected! Waiting for 
> a new master to be elected
> PC: @ 0x7f4a9e3faff6 std::_Deque_base<>::_M_destroy_nodes()
> *** SIGSEGV (@0x31) received by PID 4605 (TID 0x7f4a8c1d0940) from PID 49; 
> stack trace: ***
> @ 0x7f4a9baefca0 (unknown)
> @ 0x7f4a9e3faff6 std::_Deque_base<>::_M_destroy_nodes()
> @ 0x7f4a9e3ecdaf std::_Deque_base<>::~_Deque_base()
> @ 0x7f4a9e3e2bd5 std::deque<>::~deque()
> @ 0x7f4a9e3dfe10 process::DataDecoder::~DataDecoder()
> @ 0x7f4a9e3ba9bc process::receiving_connect()
> @ 0x7f4a9e506bc5 ev_invoke_pending
> @ 0x7f4a9e509af5 ev_run
> @ 0x7f4a9e3b5928 ev_loop
> @ 0x7f4a9e3bb2d9 process::serve()
> @ 0x7f4a9bae783d start_thread
> @ 0x7f4a9a84f26d clone
> /var/tmp/scltOMGb3: line 8:  4605 Segmentation fault  
> './bin/mesos-slave.sh' '--master=localhost:5050'
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2344) segfaults running make check from ev integration

2015-02-10 Thread Dominic Hamon (JIRA)

Dominic Hamon created MESOS-2344:


 Summary: segfaults running make check from ev integration
 Key: MESOS-2344
 URL: https://issues.apache.org/jira/browse/MESOS-2344
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Dominic Hamon


Running make check on Ubuntu under gdb, I've seen a number of segfaults from 
the {{process::EventLoop}}. Stack traces and debugging sessions below:

{noformat}
(gdb) bt
#0  0x00789c71 in std::move&> (__t=...) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102
#1  0x76821148 in std::_Tuple_impl<1, void (*)()>::_Tuple_impl() (
this=0x7fffe00228d8, __in=)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270
#2  0x768210a4 in std::_Tuple_impl<0, Duration, void 
(*)()>::_Tuple_impl() (this=0x7fffe00228d8, __in=)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271
#3  0x76821068 in std::tuple::tuple() (
this=0x7fffe00228d8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542
#4  0x76821014 in std::_Bind (*(Duration, void 
(*)()))(const Duration &, void (*)())>::_Bind() 
(this=0x7fffe00228d0, __b=)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342
#5  0x76820f86 in 
std::_Function_base::_Base_manager 
(*(Duration, void (*)()))(Duration const&, void (*)())> 
>::_M_init_functor(std::_Any_data&, std::_Bind 
(*(Duration, void (*)()))(Duration const&, void (*)())>&&, 
std::integral_constant) (__functor=..., __f=)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987
#6  0x76820ab0 in 
std::_Function_base::_Base_manager 
(*(Duration, void (*)()))(Duration const&, void (*)())> 
>::_M_init_functor(std::_Any_data&, std::_Bind 
(*(Duration, void (*)()))(Duration const&, void (*)())>&&) (__functor=..., 
__f=)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1958
#7  0x768208e6 in std::function 
()>::function (*(Duration, void 
(*)()))(const Duration &, void (*)())>, 
void>(std::_Bind (*(Duration, void (*)()))(const 
Duration &, void (*)())>) (this=0x7fffe85ca9d0, __f=...)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2451
#8  0x7681fe55 in process::EventLoop::delay (duration=..., 
function=0x76729580 ) at 
../../../3rdparty/libprocess/src/libev.cpp:98
#9  0x7672a151 in process::tick () at 
../../../3rdparty/libprocess/src/clock.cpp:125
#10 0x7681fcb2 in process::internal::handle_delay (loop=0x77dd91f0 
, timer=0x7fffe00279b0, revents=256)
at ../../../3rdparty/libprocess/src/libev.cpp:64
#11 0x7685f8c5 in ev_invoke_pending (loop=0x77dd91f0 
) at ev.c:2994
#12 0x76860803 in ev_run (loop=0x77dd91f0 , 
flags=) at ev.c:3394
#13 0x7681fffb in ev_loop (loop=0x77dd91f0 , 
flags=0) at 3rdparty/libev-4.15/ev.h:826
#14 0x7681ff49 in process::EventLoop::run () at 
../../../3rdparty/libprocess/src/libev.cpp:114
#15 0x721d2182 in start_thread (arg=0x7fffe85cb700) at 
pthread_create.c:312
#16 0x71eff00d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 8
#8  0x7681fe55 in process::EventLoop::delay (duration=..., 
function=0x76729580 ) at 
../../../3rdparty/libprocess/src/libev.cpp:98
98run_in_event_loop(
(gdb) list
93  } // namespace internal {
94
95
96  void EventLoop::delay(const Duration& duration, void(*function)(void))
97  {
98run_in_event_loop(
99lambda::bind(&internal::delay, duration, function));
100 }
101
102
(gdb) p duration
$1 = (const Duration &) @0x7fffe000da90: {static NANOSECONDS = 1, static 
MICROSECONDS = 1000, static MILLISECONDS = 100, static SECONDS = 
10, 
  static MINUTES = 600, static HOURS = 36000, static DAYS = 
864000, static WEEKS = 6048000, nanos = 91569920}
(gdb) p function
$2 = (void (*)(void)) 0x76729580 
{noformat}

{noformat}
(gdb) bt
#0  std::map, 
std::allocator > >::end (
this=0x7fffe0022620) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_map.h:339
#1  0x767c29dc in std::map, std::allocator > >::operator[] (this=0x32f6478, __k=...) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_map.h:463
#2  0x767a476a in process::ProcessManager::use (this=0x32f6470, 
pid=...) at ../../../3rdparty/libprocess/src/process.cpp:1944
#3  0x767b1288 in process::ProcessManager::deliver (this=0x32f6470, 
to=..., event=0x7fffe00155c0, sender=0x0)
at ../../../3rdparty/libprocess/src/process.cpp:2113
#4  0x767b5a0f in process::internal::dispatch(process::UPID const&,

[jira] [Commented] (MESOS-1956) Add IPv6 & ICMPv6 libnl traffic control U32 filters

2015-02-10 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314529#comment-14314529
 ] 

Dominic Hamon commented on MESOS-1956:
--

when we wrote the port mapping isolator, it was to deal with the constraint of 
having not enough IP addresses. If we have IPv6 available, we should be able to 
ensure each container gets its own IP address so the port mapping isolator 
shouldn't be needed.

when we initialize the port mapping isolator, can we check if we're in IPv4 or 
IPv6 world? I'm ok with the port mapping isolator only working in IPv4. 
[~idownes] do you agree?

> Add IPv6 & ICMPv6 libnl traffic control U32 filters
> ---
>
> Key: MESOS-1956
> URL: https://issues.apache.org/jira/browse/MESOS-1956
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Evelina Dumitrescu
>Assignee: Evelina Dumitrescu
>
> For IPv6, the filtering should be done by source and destination ports, 
> destination IP, destination MAC.
> For ICMPv6, the filtering should be done by protocol and destination IP.
> The IPv6/IPv4 difference could be done by the source/destination IP type from 
> the classifier.
> IPv4 packets with options in the header are currently ignored due to a bug in 
> libnl. It should be investigated if the problem occurs in the case of IPv6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-02-10 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Assignee: Paul Brett

> Report per-container metrics for network bandwidth throttling
> -
>
> Key: MESOS-2332
> URL: https://issues.apache.org/jira/browse/MESOS-2332
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation, twitter
>Reporter: Paul Brett
>Assignee: Paul Brett
>  Labels: features
>
> Export metrics from the network isolation to identify scope and duration of 
> container throttling.  
> Packet loss can be identified from the overlimits and requeues fields of the 
> htb qdisc report for the virtual interface, e.g.
> {noformat}
> $ tc -s -d qdisc show dev mesos19223
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
> 1 1 1
>  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc ingress : parent :fff1 
>  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
> requeues 0)
>  backlog 0b 0p requeues 0
> {noformat}
> Note that since a packet can be examined multiple times before transmission, 
> overlimits can exceed total packets sent.  
> Add to the port_mapping isolator usage() and the container statistics 
> protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
> protobuf fields so it's clear what these represent and how they are different 
> to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-02-10 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Component/s: twitter
 Sprint: Twitter Mesos Q1 Sprint 2

> Report per-container metrics for network bandwidth throttling
> -
>
> Key: MESOS-2332
> URL: https://issues.apache.org/jira/browse/MESOS-2332
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation, twitter
>Reporter: Paul Brett
>  Labels: features
>
> Export metrics from the network isolation to identify scope and duration of 
> container throttling.  
> Packet loss can be identified from the overlimits and requeues fields of the 
> htb qdisc report for the virtual interface, e.g.
> {noformat}
> $ tc -s -d qdisc show dev mesos19223
> qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
> 1 1 1
>  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc ingress : parent :fff1 
>  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
> requeues 0)
>  backlog 0b 0p requeues 0
> {noformat}
> Note that since a packet can be examined multiple times before transmission, 
> overlimits can exceed total packets sent.  
> Add to the port_mapping isolator usage() and the container statistics 
> protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
> protobuf fields so it's clear what these represent and how they are different 
> to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1708) Using the wrong resource "name" should report a better error.

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1708:


Assignee: Dominic Hamon

> Using the wrong resource "name" should report a better error.
> -
>
> Key: MESOS-1708
> URL: https://issues.apache.org/jira/browse/MESOS-1708
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, master
>Reporter: Benjamin Hindman
>Assignee: Dominic Hamon
>  Labels: newbie, twitter
>
> If a scheduler launches a task using resources the master doesn't know about 
> the task validator causes the task to fail but the error message is not very 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1251) Slave should make sure that the containerizer::launch returned Future is ready

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1251:


Assignee: Ian Downes

Please resolve if there's nothing to do here.

> Slave should make sure that the containerizer::launch returned Future is ready
> --
>
> Key: MESOS-1251
> URL: https://issues.apache.org/jira/browse/MESOS-1251
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Till Toenshoff
>Assignee: Ian Downes
>Priority: Minor
>  Labels: concurrency, containerizer, order, slave, twitter
>
> Currently the slave is not awaiting the {{Future}} returned by 
> {{Containerizer::Launch}} before sending out more command events.
> Is there a reason for this behavior?
> This issue becomes apparent only when having a launch-command-implementations 
> that is relatively "expensive".
> So what I can see here is the following chain of events along a vertical time 
> axis:
> {noformat}
> Launch
>|
>|  Wait
>||
>|| Update 
>|||
>--Launch Future became ready
> {noformat}
> What I would like to see is:
> {noformat}
> Launch
>|
>|
>|
>|
>|
>--Launch Future became ready
>   Wait
> |
> | Update 
> ||
> {noformat}
> As we are currently pushing the former behavior into the implementation of 
> the containerizer, things quickly get rather complicated on that side. Hence 
> I would like to understand if that is something we really want / need or if 
> we might want to fix this within the slave in a longer run.
> So far, I have only observed this to be a challenge for {{Launch}}, but other 
> events might just as well be worth a thought on enforced chaining instead of 
> concurrent invocations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-808) The scheduler driver should queue messages when disconnected or return delivery status.

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-808.
-
Resolution: Won't Fix

No longer an issue under the HTTP API.

> The scheduler driver should queue messages when disconnected or return 
> delivery status.
> ---
>
> Key: MESOS-808
> URL: https://issues.apache.org/jira/browse/MESOS-808
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: twitter, v1_api
>
> Currently when schedulers try to take an action while the driver is 
> disconnected (i.e. a call to SchedulerDriver::disconnected has occurred), the 
> driver will drop the request.
> In the case of launching a task, we'll reply with TASK_LOST directly in the 
> driver. However, with things like killTask, we simply drop the kill task 
> request.
> This behavior seems a little unfriendly for schedulers, as they need to be 
> concerned about queueing any operations until Scheduler::connected is called. 
> We should consider queuing in the driver instead.
> The implementation here can consist of a queue holding the messages 
> that were constructed while !connected. Once we re-connect, we simply run 
> through this queue sending all messages.
> However, without state in the driver, schedulers will have to live with the 
> possibility of dropped messages anyway (i.e. if they fail while disconnected, 
> any messages will be lost).
> Therefore, an alternative here is possible when we add a v1 API. If we return 
> a Future or other form of status we can indicate whether the message was 
> sent. This is definitely simpler and more reliable than queueing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-621) HierarchicalAllocator::slaveRemoved doesn't properly handle framework allocations/resources

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-621.
-
Resolution: Won't Fix

> HierarchicalAllocator::slaveRemoved doesn't properly handle framework 
> allocations/resources
> ---
>
> Key: MESOS-621
> URL: https://issues.apache.org/jira/browse/MESOS-621
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, technical debt
>Reporter: Vinod Kone
>  Labels: twitter
>
> Currently a slaveRemoved() simply removes the slave from 'slaves' map and 
> slave's resources from 'roleSorter'. Looking at resourcesRecovered(), more 
> things need to be done when a slave is removed (e.g., framework 
> unallocations).
> It would be nice to fix this and have a test for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2255) SlaveRecoveryTest/0.MasterFailover is flaky

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2255:
-
Labels: flaky twitter  (was: flaky)

> SlaveRecoveryTest/0.MasterFailover is flaky
> ---
>
> Key: MESOS-2255
> URL: https://issues.apache.org/jira/browse/MESOS-2255
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
>Reporter: Yan Xu
>  Labels: flaky, twitter
>
> {noformat:title=}
> [ RUN  ] SlaveRecoveryTest/0.MasterFailover
> Using temporary directory '/tmp/SlaveRecoveryTest_0_MasterFailover_dtF7o0'
> I0123 07:45:49.818686 17634 leveldb.cpp:176] Opened db in 31.195549ms
> I0123 07:45:49.821962 17634 leveldb.cpp:183] Compacted db in 3.190936ms
> I0123 07:45:49.822049 17634 leveldb.cpp:198] Created db iterator in 47324ns
> I0123 07:45:49.822069 17634 leveldb.cpp:204] Seeked to beginning of db in 
> 2038ns
> I0123 07:45:49.822084 17634 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 484ns
> I0123 07:45:49.822160 17634 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0123 07:45:49.824241 17660 recover.cpp:449] Starting replica recovery
> I0123 07:45:49.825217 17660 recover.cpp:475] Replica is in EMPTY status
> I0123 07:45:49.827020 17660 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0123 07:45:49.827453 17659 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0123 07:45:49.828047 17659 recover.cpp:566] Updating replica status to 
> STARTING
> I0123 07:45:49.838543 17659 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 10.24963ms
> I0123 07:45:49.838580 17659 replica.cpp:323] Persisted replica status to 
> STARTING
> I0123 07:45:49.848836 17659 recover.cpp:475] Replica is in STARTING status
> I0123 07:45:49.850039 17659 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I0123 07:45:49.850286 17659 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I0123 07:45:49.850754 17659 recover.cpp:566] Updating replica status to VOTING
> I0123 07:45:49.853698 17655 master.cpp:262] Master 
> 20150123-074549-16842879-44955-17634 (utopic) started on 127.0.1.1:44955
> I0123 07:45:49.853981 17655 master.cpp:308] Master only allowing 
> authenticated frameworks to register
> I0123 07:45:49.853997 17655 master.cpp:313] Master only allowing 
> authenticated slaves to register
> I0123 07:45:49.854038 17655 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/SlaveRecoveryTest_0_MasterFailover_dtF7o0/credentials'
> I0123 07:45:49.854557 17655 master.cpp:357] Authorization enabled
> I0123 07:45:49.859633 17659 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.742923ms
> I0123 07:45:49.859853 17659 replica.cpp:323] Persisted replica status to 
> VOTING
> I0123 07:45:49.860327 17658 recover.cpp:580] Successfully joined the Paxos 
> group
> I0123 07:45:49.860703 17654 recover.cpp:464] Recover process terminated
> I0123 07:45:49.859591 17655 master.cpp:1219] The newly elected leader is 
> master@127.0.1.1:44955 with id 20150123-074549-16842879-44955-17634
> I0123 07:45:49.864702 17655 master.cpp:1232] Elected as the leading master!
> I0123 07:45:49.864904 17655 master.cpp:1050] Recovering from registrar
> I0123 07:45:49.865406 17660 registrar.cpp:313] Recovering registrar
> I0123 07:45:49.866576 17660 log.cpp:660] Attempting to start the writer
> I0123 07:45:49.868638 17658 replica.cpp:477] Replica received implicit 
> promise request with proposal 1
> I0123 07:45:49.872521 17658 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 3.848859ms
> I0123 07:45:49.872555 17658 replica.cpp:345] Persisted promised to 1
> I0123 07:45:49.873769 17661 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0123 07:45:49.875474 17658 replica.cpp:378] Replica received explicit 
> promise request for position 0 with proposal 2
> I0123 07:45:49.880878 17658 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 5.364021ms
> I0123 07:45:49.880913 17658 replica.cpp:679] Persisted action at 0
> I0123 07:45:49.882619 17657 replica.cpp:511] Replica received write request 
> for position 0
> I0123 07:45:49.882998 17657 leveldb.cpp:438] Reading position from leveldb 
> took 150092ns
> I0123 07:45:49.886488 17657 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 3.269189ms
> I0123 07:45:49.886536 17657 replica.cpp:679] Persisted action at 0
> I0123 07:45:49.887181 17657 replica.cpp:658] Replica received learned notice 
> for position 0
> I0123 07:45:49.892900 17657 leveldb.cpp:343] Persisting action (16 bytes) to 
> leveldb took 5.690093ms
> I0123 07:45:49.892935 17657 replica.cpp:679] Persisted

[jira] [Updated] (MESOS-2300) Failing tests on 0.21.1 with Ubuntu 14.10 / Linux 3.16.0-23

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2300:
-
Labels: cgroups test twitter  (was: cgroups test)

> Failing tests on 0.21.1 with Ubuntu 14.10 / Linux 3.16.0-23
> ---
>
> Key: MESOS-2300
> URL: https://issues.apache.org/jira/browse/MESOS-2300
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.1
> Environment: (Though the hostname of this box is {{docker1}}, this is 
> not running on a docker container. This box sits on vanilla hardware, and 
> happens to also be used as a docker server. Though not when I ran the 
> offending tests.)
> {code}
> huitseeker@docker1:~$  lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 14.10
> Release:  14.10
> Codename: utopic
> {code}
> {code}
> huitseeker@docker1:~$ uname -a
> Linux docker1 3.16.0-23-generic #31-Ubuntu SMP Tue Oct 21 17:56:17 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux }}
> {code}
> Mesos retrieved from {{http://git-wip-us.apache.org/repos/asf/mesos.git}}
> And compiled from git tag {{0.21.1}} (currently resolves to 
> {{2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8}}). Box is a clean, 
> ansible-generated Ubuntu with cgmanager disabled, and the following packages 
> installed on top of the usual mesos dependencies:
> - cgroup-lite (service is enabled and started)
> - linux-tools-common
> - linux-tools-generic
> - linux-cloud-tools-generic
> - linux-tools-3.16.0-23-generic
> - linux-cloud-tools-3.16.0-23-generic
>Reporter: François Garillot
>  Labels: cgroups, test, twitter
>
> During make check :
> {code}
> [--] Global test environment tear-down
> [==] 503 tests from 89 test cases ran. (387352 ms total)
> [  PASSED  ] 499 tests.
> [  FAILED  ] 4 tests, listed below:
> [  FAILED  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get
> [  FAILED  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups
> [  FAILED  ] NsTest.ROOT_setns
> [  FAILED  ] PerfTest.ROOT_SampleInit
> {code}
> Details:
> {code}
> [ RUN  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get
> ../../src/tests/cgroups_tests.cpp:364: Failure
> Value of: "mesos_test2"
> Expected: cgroups.get()[0]
> Which is: "mesos"
> [  FAILED  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get (10 ms)
> [ RUN  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups
> ../../src/tests/cgroups_tests.cpp:392: Failure
> Value of: path::join(TEST_CGROUPS_ROOT, "2")
>   Actual: "mesos_test/2"
> Expected: cgroups.get()[0]
> Which is: "mesos_test/1"
> ../../src/tests/cgroups_tests.cpp:393: Failure
> Value of: path::join(TEST_CGROUPS_ROOT, "1")
>   Actual: "mesos_test/1"
> Expected: cgroups.get()[1]
> Which is: "mesos_test/2"
> [  FAILED  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups (12 ms)
> {code}
> {code}
> [ RUN  ] NsTest.ROOT_setns
> ../../src/tests/ns_tests.cpp:123: Failure
> Value of: status.get().get()
>   Actual: 256
> Expected: 0
> [  FAILED  ] NsTest.ROOT_setns (93 ms)
> {code}
> {code}
> [ RUN  ] PerfTest.ROOT_SampleInit
> ../../src/tests/perf_tests.cpp:143: Failure
> Expected: (0u) < (statistics.get().cycles()), actual: 0 vs 0
> ../../src/tests/perf_tests.cpp:146: Failure
> Expected: (0.0) < (statistics.get().task_clock()), actual: 0 vs 0
> [  FAILED  ] PerfTest.ROOT_SampleInit (1078 ms)
> {code}
> Those tests have been run in parallel (-j 8) as well as sequentially (-j 1), 
> no difference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2289) Design doc for the HTTP API

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2289:
-
Sprint: Twitter Mesos Q1 Sprint 2

> Design doc for the HTTP API
> ---
>
> Key: MESOS-2289
> URL: https://issues.apache.org/jira/browse/MESOS-2289
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2289) Design doc for the HTTP API

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2289:
-
Assignee: Vinod Kone

> Design doc for the HTTP API
> ---
>
> Key: MESOS-2289
> URL: https://issues.apache.org/jira/browse/MESOS-2289
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2288) HTTP API for interacting with Mesos

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2288:
-
Labels: twitter  (was: )

> HTTP API for interacting with Mesos
> ---
>
> Key: MESOS-2288
> URL: https://issues.apache.org/jira/browse/MESOS-2288
> Project: Mesos
>  Issue Type: Epic
>Reporter: Vinod Kone
>  Labels: twitter
>
> Currently Mesos frameworks (schedulers and executors) interact with Mesos 
> (masters and slaves) via drivers provided by Mesos. While the driver helped 
> in providing some common functionality for all frameworks (master detection, 
> authentication, validation etc), it has several drawbacks.
> --> Frameworks need to depend on a native library which makes their 
> build/deploy process cumbersome.
> --> Pure language frameworks cannot use off the shelf libraries to interact 
> with the undocumented API used by the driver.
> --> Makes it hard for developers to implement new APIs (lot of boiler plate 
> code to write).
> This proposal is for Mesos to provide a well documented public HTTP API that 
> frameworks (and maybe operators) can use to interact with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-181) Virtual Machine Isolation Module

2015-02-04 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-181.
-
Resolution: Won't Fix

Sadly, our isolation efforts have diverged from the initial effort here. If we 
do ever provide VM isolation, we'll need to carefully determine requirements 
first and then develop a solution.

> Virtual Machine Isolation Module
> 
>
> Key: MESOS-181
> URL: https://issues.apache.org/jira/browse/MESOS-181
> Project: Mesos
>  Issue Type: Story
>  Components: isolation, slave
> Environment: Ubuntu 11.04, Ubuntu 11.10
>Reporter: Charles Earl
>Priority: Minor
>  Labels: virtualiztion
>
> Earlier in the year I implemented a virtual machine isolation module. This 
> module uses lib-virt to launch and manage virtual machine containers. The 
> code is still rough and have done basic testing with the Spark example. 
> This code works with the KVM (http://www.linux-kvm.org/page/Main_Page) 
> virtual machine manager. I've placed the relevant code in a branch called 
> mesos-vm, for now located at https://github.com/charlescearl/VirtualMesos. 
> The code is based upon the mesos lxc isolation module that is located in 
> src/slave/lxc_isolation_module.cpp/.hpp. My code based on the mesos master 
> branch dated Wed Nov 23 12:02:07 2011 -0800, commit 
> 059aabb2ec5bd7b20ed08ab9c439531a352ba3ec. I'll generate a patch soon for 
> this. Suggestions appreciated on whether this is the appropriate 
> branch/commit to patch against.
> Most of the implementation is contained in vm_isolation_module.cpp and 
> vm_isolation_module.hpp and there are some minor additions in launcher to 
> handle setup of the environment for the virtual machine. I use the libvirt 
> (http://libvirt.org/) library, to manage the virtual machine container in 
> which the jobs are executed.
> Dependencies
> The code has been tested on Ubuntu 11.04 and 11.10 and depends on 
> libpython2.6 and libvirt0
> Configuration of the virtual machine container
> The virtual machine invocation depends upon a few configuration assumptions:
>  1.   ssh public keys installed on the container. I assume that the container 
> is setup to allow password-less secure access.
>   2.  Directory structure on the container matches the servant machine. For 
> example, in invoking a spark executor, assume that the paths match the setup 
> on the container host.
> Running it
> In the $MESOS_HOME/conf/mesos.conf file add the line 
>isolation=vm
> to use the virtual machine isolation.
> The Mesos slave is invoked with the isolation parameter set to vm. For example
>  sudo bin/mesos-slave -m mesos://master@mesos-host:5050 -w 9839 
> --isolation=vm
> Rough description of how it works
> The `vm_isolation_module` class forks a process that in turn launches a 
> virtual machine.  A routine  located in bin called find_addr.pl is 
> responsible for figuring out the IP address of the launched virtual machine. 
> This is probably not portable since it is explicitly looking for entry in the 
> virbr0 network.
> A script vmLauncherTemplate.sh located in bin assists the the vmLauncher 
> method to setup the environment for launching tasks inside of the virtual 
> machine. The vmLauncher method uses vmLauncherTemplate.sh  to create a tasks 
> specific shell vmLauncherTemplate-.sh, which is copied to the 
> running guest and used to run the executor inside the VM. This communicates 
> with the slave on the host.
> Comments and suggestions on improvements and next directions are appreciated!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-181) Virtual Machine Isolation Module

2015-02-04 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305765#comment-14305765
 ] 

Dominic Hamon commented on MESOS-181:
-

It doesn't seem likely that we're going to integrate this any time soon. Shall 
we close out the issue?

> Virtual Machine Isolation Module
> 
>
> Key: MESOS-181
> URL: https://issues.apache.org/jira/browse/MESOS-181
> Project: Mesos
>  Issue Type: Story
>  Components: isolation, slave
> Environment: Ubuntu 11.04, Ubuntu 11.10
>Reporter: Charles Earl
>Priority: Minor
>  Labels: virtualiztion
>
> Earlier in the year I implemented a virtual machine isolation module. This 
> module uses lib-virt to launch and manage virtual machine containers. The 
> code is still rough and have done basic testing with the Spark example. 
> This code works with the KVM (http://www.linux-kvm.org/page/Main_Page) 
> virtual machine manager. I've placed the relevant code in a branch called 
> mesos-vm, for now located at https://github.com/charlescearl/VirtualMesos. 
> The code is based upon the mesos lxc isolation module that is located in 
> src/slave/lxc_isolation_module.cpp/.hpp. My code based on the mesos master 
> branch dated Wed Nov 23 12:02:07 2011 -0800, commit 
> 059aabb2ec5bd7b20ed08ab9c439531a352ba3ec. I'll generate a patch soon for 
> this. Suggestions appreciated on whether this is the appropriate 
> branch/commit to patch against.
> Most of the implementation is contained in vm_isolation_module.cpp and 
> vm_isolation_module.hpp and there are some minor additions in launcher to 
> handle setup of the environment for the virtual machine. I use the libvirt 
> (http://libvirt.org/) library, to manage the virtual machine container in 
> which the jobs are executed.
> Dependencies
> The code has been tested on Ubuntu 11.04 and 11.10 and depends on 
> libpython2.6 and libvirt0
> Configuration of the virtual machine container
> The virtual machine invocation depends upon a few configuration assumptions:
>  1.   ssh public keys installed on the container. I assume that the container 
> is setup to allow password-less secure access.
>   2.  Directory structure on the container matches the servant machine. For 
> example, in invoking a spark executor, assume that the paths match the setup 
> on the container host.
> Running it
> In the $MESOS_HOME/conf/mesos.conf file add the line 
>isolation=vm
> to use the virtual machine isolation.
> The Mesos slave is invoked with the isolation parameter set to vm. For example
>  sudo bin/mesos-slave -m mesos://master@mesos-host:5050 -w 9839 
> --isolation=vm
> Rough description of how it works
> The `vm_isolation_module` class forks a process that in turn launches a 
> virtual machine.  A routine  located in bin called find_addr.pl is 
> responsible for figuring out the IP address of the launched virtual machine. 
> This is probably not portable since it is explicitly looking for entry in the 
> virbr0 network.
> A script vmLauncherTemplate.sh located in bin assists the the vmLauncher 
> method to setup the environment for launching tasks inside of the virtual 
> machine. The vmLauncher method uses vmLauncherTemplate.sh  to create a tasks 
> specific shell vmLauncherTemplate-.sh, which is copied to the 
> running guest and used to run the executor inside the VM. This communicates 
> with the slave on the host.
> Comments and suggestions on improvements and next directions are appreciated!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2238) Use Owned<> for Process pointers in wrapper classes

2015-02-04 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2238:
-
Labels: easyfix newbie  (was: easyfix)

> Use Owned<> for Process pointers in wrapper classes
> ---
>
> Key: MESOS-2238
> URL: https://issues.apache.org/jira/browse/MESOS-2238
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>  Labels: easyfix, newbie
>
> A common pattern in our code (see e.g. {{Isolator}}, {{DockerContainerizer}}, 
> {{Allocator}}) is to wrap Process-based class into a non Process-one. 
> However, our code base is inconsistent about how we store the pointer to the 
> underlying class: somewhere we wrap it into {{Owned<>}} (see e.g. 
> {{Isolator}}, {{DockerContainerizer}}), somewhere it is a raw pointer (see 
> e.g. {{Allocator}}, {{ExternalContainerizer}}).
> Using {{Owned<>}} for this particular case is preferable, since it hints the 
> reader about the correct semantics and intention. For consistency reason, 
> sweep through the code base and replace raw pointers with its {{Owned<>}} 
> counterpart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2277) Document undocumented HTTP endpoints

2015-02-04 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2277:
-
Labels: documentation newbie starter  (was: starter)

> Document undocumented HTTP endpoints
> 
>
> Key: MESOS-2277
> URL: https://issues.apache.org/jira/browse/MESOS-2277
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>Priority: Minor
>  Labels: documentation, newbie, starter
>
> Did a quick scan and we are missing documentation for a few endpoints:
> {code}
> files/browse.json
> files/read.json
> files/download.json
> files/debug.json
> master/roles.json
> master/state.json
> master/stats.json
> slave/state.json
> slave/stats.json
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2314) remove unnecessary constants

2015-02-04 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302334#comment-14302334
 ] 

Dominic Hamon edited comment on MESOS-2314 at 2/4/15 5:26 PM:
--

code refactor: https://reviews.apache.org/r/30531
test refactor: https://reviews.apache.org/r/30624/


was (Author: dhamon):
https://reviews.apache.org/r/30531

> remove unnecessary constants
> 
>
> Key: MESOS-2314
> URL: https://issues.apache.org/jira/browse/MESOS-2314
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave, technical debt
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Minor
>  Labels: newbie
>
> In {{src/slave/paths.cpp}} a number of string constants are defined to 
> describe the formats of various paths. However, given there is a 1:1 mapping 
> between the string constant and the functions that build the paths, the code 
> would be more readable if the format strings were inline in the functions.
> In the cases where one constant depends on another (see the 
> {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
> chain, for example) the function calls can just be chained together.
> This will have the added benefit of removing some statically constructed 
> string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2314:
-
Story Points: 2

> remove unnecessary constants
> 
>
> Key: MESOS-2314
> URL: https://issues.apache.org/jira/browse/MESOS-2314
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave, technical debt
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Minor
>  Labels: newbie
>
> In {{src/slave/paths.cpp}} a number of string constants are defined to 
> describe the formats of various paths. However, given there is a 1:1 mapping 
> between the string constant and the functions that build the paths, the code 
> would be more readable if the format strings were inline in the functions.
> In the cases where one constant depends on another (see the 
> {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
> chain, for example) the function calls can just be chained together.
> This will have the added benefit of removing some statically constructed 
> string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2314:
-
Sprint: Twitter Mesos Q1 Sprint 2

> remove unnecessary constants
> 
>
> Key: MESOS-2314
> URL: https://issues.apache.org/jira/browse/MESOS-2314
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave, technical debt
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Minor
>  Labels: newbie
>
> In {{src/slave/paths.cpp}} a number of string constants are defined to 
> describe the formats of various paths. However, given there is a 1:1 mapping 
> between the string constant and the functions that build the paths, the code 
> would be more readable if the format strings were inline in the functions.
> In the cases where one constant depends on another (see the 
> {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
> chain, for example) the function calls can just be chained together.
> This will have the added benefit of removing some statically constructed 
> string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302334#comment-14302334
 ] 

Dominic Hamon commented on MESOS-2314:
--

https://reviews.apache.org/r/30531

> remove unnecessary constants
> 
>
> Key: MESOS-2314
> URL: https://issues.apache.org/jira/browse/MESOS-2314
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave, technical debt
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Minor
>  Labels: newbie
>
> In {{src/slave/paths.cpp}} a number of string constants are defined to 
> describe the formats of various paths. However, given there is a 1:1 mapping 
> between the string constant and the functions that build the paths, the code 
> would be more readable if the format strings were inline in the functions.
> In the cases where one constant depends on another (see the 
> {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
> chain, for example) the function calls can just be chained together.
> This will have the added benefit of removing some statically constructed 
> string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-2314:


Assignee: Dominic Hamon

> remove unnecessary constants
> 
>
> Key: MESOS-2314
> URL: https://issues.apache.org/jira/browse/MESOS-2314
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave, technical debt
>Reporter: Dominic Hamon
>Assignee: Dominic Hamon
>Priority: Minor
>  Labels: newbie
>
> In {{src/slave/paths.cpp}} a number of string constants are defined to 
> describe the formats of various paths. However, given there is a 1:1 mapping 
> between the string constant and the functions that build the paths, the code 
> would be more readable if the format strings were inline in the functions.
> In the cases where one constant depends on another (see the 
> {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
> chain, for example) the function calls can just be chained together.
> This will have the added benefit of removing some statically constructed 
> string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (MESOS-2138) Add an Offer::Operation message for Dynamic Reservations

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reopened MESOS-2138:
--
  Assignee: Michael Park  (was: Benjamin Mahler)

>  Add an Offer::Operation message for Dynamic Reservations
> -
>
> Key: MESOS-2138
> URL: https://issues.apache.org/jira/browse/MESOS-2138
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Michael Park
>Assignee: Michael Park
>  Labels: protobuf
> Fix For: 0.22.0
>
>
> A framework now has a notion of *accepting* offers that it was given (via 
> {{acceptOffers}}) and is able to specify a sequence of operations to perform 
> (via a sequence of {{Offer::Operation}}). {{Launch}} is one of the possible 
> {{Offer::Operation}} and which means {{LaunchTasks}} is an alias to a 
> sequence of {{Offer::Operation}} consisting of only {{Launch}}.
> The goal of this ticket is to add {{Reserve}} and {{Unreserve}} messages as 
> possible {{Offer::Operation}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (MESOS-2138) Add an Offer::Operation message for Dynamic Reservations

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon closed MESOS-2138.

Resolution: Fixed

>  Add an Offer::Operation message for Dynamic Reservations
> -
>
> Key: MESOS-2138
> URL: https://issues.apache.org/jira/browse/MESOS-2138
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Michael Park
>Assignee: Michael Park
>  Labels: protobuf
> Fix For: 0.22.0
>
>
> A framework now has a notion of *accepting* offers that it was given (via 
> {{acceptOffers}}) and is able to specify a sequence of operations to perform 
> (via a sequence of {{Offer::Operation}}). {{Launch}} is one of the possible 
> {{Offer::Operation}} and which means {{LaunchTasks}} is an alias to a 
> sequence of {{Offer::Operation}} consisting of only {{Launch}}.
> The goal of this ticket is to add {{Reserve}} and {{Unreserve}} messages as 
> possible {{Offer::Operation}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2308) Task reconciliation API should support data partitioning

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2308:
-
Shepherd: Vinod Kone
Story Points: 8

> Task reconciliation API should support data partitioning
> 
>
> Key: MESOS-2308
> URL: https://issues.apache.org/jira/browse/MESOS-2308
> Project: Mesos
>  Issue Type: Story
>Reporter: Bill Farner
>Assignee: Benjamin Mahler
>  Labels: twitter
>
> The {{reconcileTasks}} API call requires the caller to specify a collection 
> of {{TaskStatus}}es, with the option to provide an empty collection to 
> retrieve the master's entire state.  Retrieving the entire state is the only 
> mechanism for the scheduler to learn that there are tasks running it does not 
> know about, however this call does not allow incremental querying.  The 
> result would be that the master may need to send many thousands of status 
> updates, and the scheduler would have to handle them.  It would be ideal if 
> the scheduler had a means to partition these requests so it can control the 
> pace of these status updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Shepherd: Jie Yu  (was: Vinod Kone)

> Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
> ---
>
> Key: MESOS-2144
> URL: https://issues.apache.org/jira/browse/MESOS-2144
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Cody Maloney
>Assignee: Yan Xu
>Priority: Minor
>  Labels: flaky, twitter
>
> Occured on review bot review of: 
> https://reviews.apache.org/r/28262/#review62333
> The review doesn't touch code related to the test (And doesn't break 
> libprocess in general)
> [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
> ../../src/tests/script.cpp:83: Failure
> Failed
> low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
> The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Shepherd: Vinod Kone
Story Points: 8

> Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
> ---
>
> Key: MESOS-2144
> URL: https://issues.apache.org/jira/browse/MESOS-2144
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
>Reporter: Cody Maloney
>Assignee: Yan Xu
>Priority: Minor
>  Labels: flaky, twitter
>
> Occured on review bot review of: 
> https://reviews.apache.org/r/28262/#review62333
> The review doesn't touch code related to the test (And doesn't break 
> libprocess in general)
> [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
> ../../src/tests/script.cpp:83: Failure
> Failed
> low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
> The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 5 6 7 8 9 >

1 - 100 of 823 matches

Mail list logo