[jira] [Created] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite

2014-09-16 Thread Dario Rexin (JIRA)
Dario Rexin created MESOS-1797:
--

 Summary: Packaged Zookeeper does not compile on OSX Yosemite
 Key: MESOS-1797
 URL: https://issues.apache.org/jira/browse/MESOS-1797
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.19.1, 0.20.0, 0.21.0
Reporter: Dario Rexin
Priority: Minor


I have been struggling with this for some time (due to my lack of knowledge 
about C compiler error messages) and finally found a way to make it compile. 
The problem is that Zookeeper defines a function `htonll` that is a builtin 
function in Yosemite. For me it worked to just remove this function, but as it 
needs to keep working on other systems as well, we would need some check for 
the OS version or if the function is already defined.

Here are the links to the source:

https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73
https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-1798) Packaged Zookeeper does not compile on OSX Yosemite

2014-09-16 Thread Dario Rexin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dario Rexin closed MESOS-1798.
--
Resolution: Duplicate

> Packaged Zookeeper does not compile on OSX Yosemite
> ---
>
> Key: MESOS-1798
> URL: https://issues.apache.org/jira/browse/MESOS-1798
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.20.0, 0.21.0, 0.19.1
>Reporter: Dario Rexin
>Priority: Minor
>
> I have been struggling with this for some time (due to my lack of knowledge 
> about C compiler error messages) and finally found a way to make it compile. 
> The problem is that Zookeeper defines a function `htonll` that is a builtin 
> function in Yosemite. For me it worked to just remove this function, but as 
> it needs to keep working on other systems as well, we would need some check 
> for the OS version or if the function is already defined.
> Here are the links to the source:
> https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73
> https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1798) Packaged Zookeeper does not compile on OSX Yosemite

2014-09-16 Thread Dario Rexin (JIRA)
Dario Rexin created MESOS-1798:
--

 Summary: Packaged Zookeeper does not compile on OSX Yosemite
 Key: MESOS-1798
 URL: https://issues.apache.org/jira/browse/MESOS-1798
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.19.1, 0.20.0, 0.21.0
Reporter: Dario Rexin
Priority: Minor


I have been struggling with this for some time (due to my lack of knowledge 
about C compiler error messages) and finally found a way to make it compile. 
The problem is that Zookeeper defines a function `htonll` that is a builtin 
function in Yosemite. For me it worked to just remove this function, but as it 
needs to keep working on other systems as well, we would need some check for 
the OS version or if the function is already defined.

Here are the links to the source:

https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73
https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-1764) Build Fixes from 0.20 release

2014-09-16 Thread Timothy St. Clair (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy St. Clair resolved MESOS-1764.
--
Resolution: Fixed

> Build Fixes from 0.20 release
> -
>
> Key: MESOS-1764
> URL: https://issues.apache.org/jira/browse/MESOS-1764
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.20.0
>Reporter: Timothy St. Clair
>Assignee: Timothy St. Clair
> Fix For: 0.20.1
>
>
> This ticket is a catch all for minor issues caught during a rebase and 
> testing.
> + Add package configuration file to deployment
> + Updates deploy_dir from localstatedir to sysconfdir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1764) Build Fixes from 0.20 release

2014-09-16 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130604#comment-14130604
 ] 

Timothy St. Clair edited comment on MESOS-1764 at 9/16/14 2:32 PM:
---

Punting last update to https://issues.apache.org/jira/browse/MESOS-1675


was (Author: tstclair):
add initial -version-info for shared library
http://reviews.apache.org/r/25551/

> Build Fixes from 0.20 release
> -
>
> Key: MESOS-1764
> URL: https://issues.apache.org/jira/browse/MESOS-1764
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.20.0
>Reporter: Timothy St. Clair
>Assignee: Timothy St. Clair
> Fix For: 0.20.1
>
>
> This ticket is a catch all for minor issues caught during a rebase and 
> testing.
> + Add package configuration file to deployment
> + Updates deploy_dir from localstatedir to sysconfdir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version

2014-09-16 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135526#comment-14135526
 ] 

Timothy St. Clair commented on MESOS-1675:
--

[~vinodkone] Did you want to elaborate on your thoughts here? 

> Decouple version of the mesos library from the package release version
> --
>
> Key: MESOS-1675
> URL: https://issues.apache.org/jira/browse/MESOS-1675
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>
> This discussion should be rolled into the larger discussion around how to 
> version Mesos (APIs, packages, libraries etc).
> Some notes from libtool docs.
> http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
> http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1621) Docker run networking should be configurable and support bridge network

2014-09-16 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135531#comment-14135531
 ] 

Timothy St. Clair commented on MESOS-1621:
--

I'll open up a separate ticket to discuss the API + override conversation. 

> Docker run networking should be configurable and support bridge network
> ---
>
> Key: MESOS-1621
> URL: https://issues.apache.org/jira/browse/MESOS-1621
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>  Labels: Docker
> Fix For: 0.20.1
>
>
> Currently to easily support running executors in Docker image, we hardcode 
> --net=host into Docker run so slave and executor and reuse the same mechanism 
> to communicate, which is to pass the slave IP/PORT for the framework to 
> respond with it's own hostname and port information back to setup the tunnel.
> We want to see how to abstract this or even get rid of host networking 
> altogether if we have a good way to not rely on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.

2014-09-16 Thread Timothy St. Clair (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy St. Clair updated MESOS-1195:
-
Target Version/s: 0.21.0

reviews.apache.org/r/25695/

> systemd.slice + cgroup enablement fails in multiple ways. 
> --
>
> Key: MESOS-1195
> URL: https://issues.apache.org/jira/browse/MESOS-1195
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Timothy St. Clair
>Assignee: Timothy St. Clair
>
> When attempting to configure mesos to use systemd slices on a 'rawhide/f21' 
> machine, it fails creating the isolator: 
> I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: 
> cgroups/cpu,cgroups/mem
> Failed to create a containerizer: Could not create isolator cgroups/cpu: 
> Failed to create isolator: The cpu subsystem is co-mounted at 
> /sys/fs/cgroup/cpu with other subsytems
> -- details --
> /sys/fs/cgroup
> total 0
> drwxr-xr-x. 12 root root 280 Mar 18 08:47 .
> drwxr-xr-x.  6 root root   0 Mar 18 08:47 ..
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 blkio
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpu -> cpu,cpuacct
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpuacct -> cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpuset
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 devices
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 freezer
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 hugetlb
> drwxr-xr-x.  3 root root   0 Apr  3 11:26 memory
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 net_cls
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 perf_event
> drwxr-xr-x.  4 root root   0 Mar 18 08:47 systemd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.

2014-09-16 Thread Timothy St. Clair (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy St. Clair updated MESOS-1195:
-
Target Version/s: 0.20.1  (was: 0.21.0)

[~bhuvan] please eval for 0.20.1 inclusion.  

> systemd.slice + cgroup enablement fails in multiple ways. 
> --
>
> Key: MESOS-1195
> URL: https://issues.apache.org/jira/browse/MESOS-1195
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Timothy St. Clair
>Assignee: Timothy St. Clair
>
> When attempting to configure mesos to use systemd slices on a 'rawhide/f21' 
> machine, it fails creating the isolator: 
> I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: 
> cgroups/cpu,cgroups/mem
> Failed to create a containerizer: Could not create isolator cgroups/cpu: 
> Failed to create isolator: The cpu subsystem is co-mounted at 
> /sys/fs/cgroup/cpu with other subsytems
> -- details --
> /sys/fs/cgroup
> total 0
> drwxr-xr-x. 12 root root 280 Mar 18 08:47 .
> drwxr-xr-x.  6 root root   0 Mar 18 08:47 ..
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 blkio
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpu -> cpu,cpuacct
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpuacct -> cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpuset
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 devices
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 freezer
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 hugetlb
> drwxr-xr-x.  3 root root   0 Apr  3 11:26 memory
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 net_cls
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 perf_event
> drwxr-xr-x.  4 root root   0 Mar 18 08:47 systemd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version

2014-09-16 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135698#comment-14135698
 ] 

Vinod Kone commented on MESOS-1675:
---

Is adding version info is backwards compatible, i.e., the new lib can be a drop 
in replacement for the old lib, then that should be fine.

{quote}
However the release wrangler will need to add a step to their punch-list prior 
to adoption.
{quote}

Not sure what this means?

> Decouple version of the mesos library from the package release version
> --
>
> Key: MESOS-1675
> URL: https://issues.apache.org/jira/browse/MESOS-1675
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>
> This discussion should be rolled into the larger discussion around how to 
> version Mesos (APIs, packages, libraries etc).
> Some notes from libtool docs.
> http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
> http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite

2014-09-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135714#comment-14135714
 ] 

Benjamin Mahler commented on MESOS-1797:


Is there a ZooKeeper ticket related to this?

> Packaged Zookeeper does not compile on OSX Yosemite
> ---
>
> Key: MESOS-1797
> URL: https://issues.apache.org/jira/browse/MESOS-1797
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.20.0, 0.21.0, 0.19.1
>Reporter: Dario Rexin
>Priority: Minor
>
> I have been struggling with this for some time (due to my lack of knowledge 
> about C compiler error messages) and finally found a way to make it compile. 
> The problem is that Zookeeper defines a function `htonll` that is a builtin 
> function in Yosemite. For me it worked to just remove this function, but as 
> it needs to keep working on other systems as well, we would need some check 
> for the OS version or if the function is already defined.
> Here are the links to the source:
> https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73
> https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite

2014-09-16 Thread Dario Rexin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135719#comment-14135719
 ] 

Dario Rexin commented on MESOS-1797:


I didn't find one.

> Packaged Zookeeper does not compile on OSX Yosemite
> ---
>
> Key: MESOS-1797
> URL: https://issues.apache.org/jira/browse/MESOS-1797
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.20.0, 0.21.0, 0.19.1
>Reporter: Dario Rexin
>Priority: Minor
>
> I have been struggling with this for some time (due to my lack of knowledge 
> about C compiler error messages) and finally found a way to make it compile. 
> The problem is that Zookeeper defines a function `htonll` that is a builtin 
> function in Yosemite. For me it worked to just remove this function, but as 
> it needs to keep working on other systems as well, we would need some check 
> for the OS version or if the function is already defined.
> Here are the links to the source:
> https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73
> https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version

2014-09-16 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135816#comment-14135816
 ] 

Timothy St. Clair commented on MESOS-1675:
--

Folks will need to check compatibility and update the revision in 
src/Makefile.am as outlined here:  
http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html

> Decouple version of the mesos library from the package release version
> --
>
> Key: MESOS-1675
> URL: https://issues.apache.org/jira/browse/MESOS-1675
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>
> This discussion should be rolled into the larger discussion around how to 
> version Mesos (APIs, packages, libraries etc).
> Some notes from libtool docs.
> http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
> http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-444) Remove --checkpoint flag in the slave once checkpointing is stable.

2014-09-16 Thread Kevin Sweeney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135880#comment-14135880
 ] 

Kevin Sweeney commented on MESOS-444:
-

Any activity here? I'd like to simplify this flag: 
https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/DriverFactory.java#L75-L101

> Remove --checkpoint flag in the slave once checkpointing is stable.
> ---
>
> Key: MESOS-444
> URL: https://issues.apache.org/jira/browse/MESOS-444
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>  Labels: newbie
>
> In the interim of slave recovery being worked on (see: MESOS-110), we've 
> added a --checkpoint flag to the slave to enable or disable the feature.
> Prior to releasing this feature, we need to remove this flag so that all 
> slaves have checkpointing available, and frameworks can choose to use it. 
> There's no need to keep this flag around and add configuration complexity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1746) clear TaskStatus data to avoid OOM

2014-09-16 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135982#comment-14135982
 ] 

Timothy St. Clair commented on MESOS-1746:
--

Maybe I'm missing something, but how is this a Mesos problem?  It seems like an 
Executor sizing constraint issue in the Spark Scheduler. 

> clear TaskStatus data to avoid OOM
> --
>
> Key: MESOS-1746
> URL: https://issues.apache.org/jira/browse/MESOS-1746
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos-0.19.0
>Reporter: Chengwei Yang
>Assignee: Chengwei Yang
>
> Spark on mesos may use TaskStatus to transfer computed result between worker 
> and scheduler, the source code like below (spark 1.0.2)
> {code}
> val serializedResult = {
>   if (serializedDirectResult.limit >= execBackend.akkaFrameSize() -
>   AkkaUtils.reservedSizeBytes) {  
>   
>   
> 
> logInfo("Storing result for " + taskId + " in local BlockManager")
> val blockId = TaskResultBlockId(taskId)
> env.blockManager.putBytes(
>   blockId, serializedDirectResult, 
> StorageLevel.MEMORY_AND_DISK_SER)
> ser.serialize(new IndirectTaskResult[Any](blockId))   
>   
>   
> 
>   } else {
>   
>   
> 
> logInfo("Sending result for " + taskId + " directly to driver")
> serializedDirectResult
>   
>   
> 
>   }   
>   
>   
> 
> }
> {code}
> And In our test environment, we enlarge akkaFrameSize to 128MB from default 
> value (10MB) and this cause our mesos-master process will be OOM in tens of 
> minutes when running spark tasks in fine-grained mode.
> As you can see, even changed akkaFrameSize back to default value (10MB), it's 
> very likely to make mesos-master OOM too, however more slower.
> So I think it's good to delete data from TaskStatus since this is only 
> designed to on-top framework and we don't interested in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1746) clear TaskStatus data to avoid OOM

2014-09-16 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135982#comment-14135982
 ] 

Timothy St. Clair edited comment on MESOS-1746 at 9/16/14 7:05 PM:
---

Are you saying, a task status update is OOM killing the mesos-master?


was (Author: tstclair):
Maybe I'm missing something, but how is this a Mesos problem?  It seems like an 
Executor sizing constraint issue in the Spark Scheduler. 

> clear TaskStatus data to avoid OOM
> --
>
> Key: MESOS-1746
> URL: https://issues.apache.org/jira/browse/MESOS-1746
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos-0.19.0
>Reporter: Chengwei Yang
>Assignee: Chengwei Yang
>
> Spark on mesos may use TaskStatus to transfer computed result between worker 
> and scheduler, the source code like below (spark 1.0.2)
> {code}
> val serializedResult = {
>   if (serializedDirectResult.limit >= execBackend.akkaFrameSize() -
>   AkkaUtils.reservedSizeBytes) {  
>   
>   
> 
> logInfo("Storing result for " + taskId + " in local BlockManager")
> val blockId = TaskResultBlockId(taskId)
> env.blockManager.putBytes(
>   blockId, serializedDirectResult, 
> StorageLevel.MEMORY_AND_DISK_SER)
> ser.serialize(new IndirectTaskResult[Any](blockId))   
>   
>   
> 
>   } else {
>   
>   
> 
> logInfo("Sending result for " + taskId + " directly to driver")
> serializedDirectResult
>   
>   
> 
>   }   
>   
>   
> 
> }
> {code}
> And In our test environment, we enlarge akkaFrameSize to 128MB from default 
> value (10MB) and this cause our mesos-master process will be OOM in tens of 
> minutes when running spark tasks in fine-grained mode.
> As you can see, even changed akkaFrameSize back to default value (10MB), it's 
> very likely to make mesos-master OOM too, however more slower.
> So I think it's good to delete data from TaskStatus since this is only 
> designed to on-top framework and we don't interested in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1799) Reconciliation can send out-of-order updates.

2014-09-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1799:
--

 Summary: Reconciliation can send out-of-order updates.
 Key: MESOS-1799
 URL: https://issues.apache.org/jira/browse/MESOS-1799
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler


When a slave re-registers with the master, it currently sends the latest task 
state for all tasks that are not both terminal and acknowledged.

However, reconciliation assumes that we always have the latest unacknowledged 
state of the task represented in the master.

As a result, out-of-order updates are possible, e.g.

(1) Slave has task T in TASK_FINISHED, with unacknowledged updates: 
[TASK_RUNNING, TASK_FINISHED].
(2) Master fails over.
(3) New master re-registers the slave with T in TASK_FINISHED.
(4) Reconciliation request arrives, master sends TASK_FINISHED.
(5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING.

I think the fix here is to preserve the task state invariants in the master, 
namely, that the master has the latest unacknowledged state of the task. This 
means when the slave re-registers, it should instead send the latest 
unacknowledged state of each task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1027) IPv6 support

2014-09-16 Thread Oskar Stenman (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136253#comment-14136253
 ] 

Oskar Stenman commented on MESOS-1027:
--

This would be great if it was resolved, we have a few things holding us back 
from V6-only (which would allow us to greatly simplify a lot of our 
infrastructure), one is mesos, the others are most likely weird services we 
haven't discovered are an issue yet since we can't even run mesos on V6-only. :)

> IPv6 support
> 
>
> Key: MESOS-1027
> URL: https://issues.apache.org/jira/browse/MESOS-1027
> Project: Mesos
>  Issue Type: Epic
>  Components: framework, libprocess, master, slave
>Reporter: Dominic Hamon
> Fix For: 1.0.0
>
>
> From the CLI down through the various layers of tech we should support IPv6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1799) Reconciliation can send out-of-order updates.

2014-09-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1799:
---
Description: 
When a slave re-registers with the master, it currently sends the latest task 
state for all tasks that are not both terminal and acknowledged.

However, reconciliation assumes that we always have the latest unacknowledged 
state of the task represented in the master.

As a result, out-of-order updates are possible, e.g.

(1) Slave has task T in TASK_FINISHED, with unacknowledged updates: 
[TASK_RUNNING, TASK_FINISHED].
(2) Master fails over.
(3) New master re-registers the slave with T in TASK_FINISHED.
(4) Reconciliation request arrives, master sends TASK_FINISHED.
(5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING.

I think the fix here is to preserve the task state invariants in the master, 
namely, that the master has the latest unacknowledged state of the task. This 
means when the slave re-registers, it should instead send the latest 
acknowledged state of each task.

  was:
When a slave re-registers with the master, it currently sends the latest task 
state for all tasks that are not both terminal and acknowledged.

However, reconciliation assumes that we always have the latest unacknowledged 
state of the task represented in the master.

As a result, out-of-order updates are possible, e.g.

(1) Slave has task T in TASK_FINISHED, with unacknowledged updates: 
[TASK_RUNNING, TASK_FINISHED].
(2) Master fails over.
(3) New master re-registers the slave with T in TASK_FINISHED.
(4) Reconciliation request arrives, master sends TASK_FINISHED.
(5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING.

I think the fix here is to preserve the task state invariants in the master, 
namely, that the master has the latest unacknowledged state of the task. This 
means when the slave re-registers, it should instead send the latest 
unacknowledged state of each task.


> Reconciliation can send out-of-order updates.
> -
>
> Key: MESOS-1799
> URL: https://issues.apache.org/jira/browse/MESOS-1799
> Project: Mesos
>  Issue Type: Bug
>  Components: master, slave
>Reporter: Benjamin Mahler
>
> When a slave re-registers with the master, it currently sends the latest task 
> state for all tasks that are not both terminal and acknowledged.
> However, reconciliation assumes that we always have the latest unacknowledged 
> state of the task represented in the master.
> As a result, out-of-order updates are possible, e.g.
> (1) Slave has task T in TASK_FINISHED, with unacknowledged updates: 
> [TASK_RUNNING, TASK_FINISHED].
> (2) Master fails over.
> (3) New master re-registers the slave with T in TASK_FINISHED.
> (4) Reconciliation request arrives, master sends TASK_FINISHED.
> (5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING.
> I think the fix here is to preserve the task state invariants in the master, 
> namely, that the master has the latest unacknowledged state of the task. This 
> means when the slave re-registers, it should instead send the latest 
> acknowledged state of each task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1715) The slave does not send pending tasks during re-registration.

2014-09-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1715:
---
Description: 
In what looks like an oversight, the pending tasks and executors in the slave 
(Framework::pending) are not sent in the re-registration message.

For tasks, this can lead to spurious TASK_LOST notifications being generated by 
the master when it falsely thinks the tasks are not present on the slave.

  was:
In what looks like an oversight, the pending tasks and executors in the slave 
(Framework::pending) are not sent in the re-registration message.

For tasks, this can lead to spurious TASK_LOST notifications being generated by 
the master when it falsely thinks the tasks are not present on the slave.

For executors, this can lead to under-accounting in the master, causing an 
overcommit on the slave.


> The slave does not send pending tasks during re-registration.
> -
>
> Key: MESOS-1715
> URL: https://issues.apache.org/jira/browse/MESOS-1715
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 0.21.0
>
>
> In what looks like an oversight, the pending tasks and executors in the slave 
> (Framework::pending) are not sent in the re-registration message.
> For tasks, this can lead to spurious TASK_LOST notifications being generated 
> by the master when it falsely thinks the tasks are not present on the slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1715) The slave does not send pending tasks during re-registration.

2014-09-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1715:
---
Summary: The slave does not send pending tasks during re-registration.  
(was: The slave does not send pending tasks / executors during re-registration.)

> The slave does not send pending tasks during re-registration.
> -
>
> Key: MESOS-1715
> URL: https://issues.apache.org/jira/browse/MESOS-1715
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 0.21.0
>
>
> In what looks like an oversight, the pending tasks and executors in the slave 
> (Framework::pending) are not sent in the re-registration message.
> For tasks, this can lead to spurious TASK_LOST notifications being generated 
> by the master when it falsely thinks the tasks are not present on the slave.
> For executors, this can lead to under-accounting in the master, causing an 
> overcommit on the slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1715) The slave does not send pending tasks during re-registration.

2014-09-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136394#comment-14136394
 ] 

Benjamin Mahler commented on MESOS-1715:


Pulled out MESOS-1800 for the executor side of this.

> The slave does not send pending tasks during re-registration.
> -
>
> Key: MESOS-1715
> URL: https://issues.apache.org/jira/browse/MESOS-1715
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 0.21.0
>
>
> In what looks like an oversight, the pending tasks and executors in the slave 
> (Framework::pending) are not sent in the re-registration message.
> For tasks, this can lead to spurious TASK_LOST notifications being generated 
> by the master when it falsely thinks the tasks are not present on the slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1800) The slave does not send pending executors during re-registration.

2014-09-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1800:
--

 Summary: The slave does not send pending executors during 
re-registration.
 Key: MESOS-1800
 URL: https://issues.apache.org/jira/browse/MESOS-1800
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler


In what looks like an oversight, the pending executors in the slave are not 
sent in the re-registration message.

This can lead to under-accounting in the master, causing an overcommit on the 
slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2014-09-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1466:
--

Assignee: (was: Benjamin Mahler)

> Race between executor exited event and launch task can cause overcommit of 
> resources
> 
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Vinod Kone
>  Labels: reliability
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1688) No offers if no memory is allocatable

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1688:
---
Target Version/s: 0.21.0  (was: 0.20.1)

> No offers if no memory is allocatable
> -
>
> Key: MESOS-1688
> URL: https://issues.apache.org/jira/browse/MESOS-1688
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.18.1, 0.18.2, 0.19.0, 0.19.1
>Reporter: Martin Weindel
>Priority: Critical
> Fix For: 0.20.1
>
>
> The [Spark 
> scheduler|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala]
>  allocates memory only for the executor and cpu only for its tasks.
> So it can happen that all memory is nearly completely allocated by Spark 
> executors, but all cpu resources are idle.
> In this case Mesos does not offer resources anymore, as less than MIN_MEM 
> (=32MB) memory is allocatable.
> This effectively causes a dead lock in the Spark job, as it is not offered 
> cpu resources needed for launching new tasks.
> see {{HierarchicalAllocatorProcess::allocatable(const Resources&)}} called in 
> {{HierarchicalAllocatorProcess::allocate(const hashset&)}}
> {code}
> template 
> bool
> HierarchicalAllocatorProcess::allocatable(
> const Resources& resources)
> {
> ...
>   Option cpus = resources.cpus();
>   Option mem = resources.mem();
>   if (cpus.isSome() && mem.isSome()) {
> return cpus.get() >= MIN_CPUS && mem.get() > MIN_MEM;
>   }
>   return false;
> }
> {code}
> A possible solution may to completely drop the condition on allocatable 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1688) No offers if no memory is allocatable

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1688:
---
Fix Version/s: (was: 0.20.1)
   0.21.0

> No offers if no memory is allocatable
> -
>
> Key: MESOS-1688
> URL: https://issues.apache.org/jira/browse/MESOS-1688
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.18.1, 0.18.2, 0.19.0, 0.19.1
>Reporter: Martin Weindel
>Priority: Critical
> Fix For: 0.21.0
>
>
> The [Spark 
> scheduler|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala]
>  allocates memory only for the executor and cpu only for its tasks.
> So it can happen that all memory is nearly completely allocated by Spark 
> executors, but all cpu resources are idle.
> In this case Mesos does not offer resources anymore, as less than MIN_MEM 
> (=32MB) memory is allocatable.
> This effectively causes a dead lock in the Spark job, as it is not offered 
> cpu resources needed for launching new tasks.
> see {{HierarchicalAllocatorProcess::allocatable(const Resources&)}} called in 
> {{HierarchicalAllocatorProcess::allocate(const hashset&)}}
> {code}
> template 
> bool
> HierarchicalAllocatorProcess::allocatable(
> const Resources& resources)
> {
> ...
>   Option cpus = resources.cpus();
>   Option mem = resources.mem();
>   if (cpus.isSome() && mem.isSome()) {
> return cpus.get() >= MIN_CPUS && mem.get() > MIN_MEM;
>   }
>   return false;
> }
> {code}
> A possible solution may to completely drop the condition on allocatable 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1688) No offers if no memory is allocatable

2014-09-16 Thread Bhuvan Arumugam (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136441#comment-14136441
 ] 

Bhuvan Arumugam commented on MESOS-1688:


targetting it for 0.21.0.

> No offers if no memory is allocatable
> -
>
> Key: MESOS-1688
> URL: https://issues.apache.org/jira/browse/MESOS-1688
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.18.1, 0.18.2, 0.19.0, 0.19.1
>Reporter: Martin Weindel
>Priority: Critical
> Fix For: 0.21.0
>
>
> The [Spark 
> scheduler|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala]
>  allocates memory only for the executor and cpu only for its tasks.
> So it can happen that all memory is nearly completely allocated by Spark 
> executors, but all cpu resources are idle.
> In this case Mesos does not offer resources anymore, as less than MIN_MEM 
> (=32MB) memory is allocatable.
> This effectively causes a dead lock in the Spark job, as it is not offered 
> cpu resources needed for launching new tasks.
> see {{HierarchicalAllocatorProcess::allocatable(const Resources&)}} called in 
> {{HierarchicalAllocatorProcess::allocate(const hashset&)}}
> {code}
> template 
> bool
> HierarchicalAllocatorProcess::allocatable(
> const Resources& resources)
> {
> ...
>   Option cpus = resources.cpus();
>   Option mem = resources.mem();
>   if (cpus.isSome() && mem.isSome()) {
> return cpus.get() >= MIN_CPUS && mem.get() > MIN_MEM;
>   }
>   return false;
> }
> {code}
> A possible solution may to completely drop the condition on allocatable 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.

2014-09-16 Thread Bhuvan Arumugam (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136449#comment-14136449
 ] 

Bhuvan Arumugam commented on MESOS-1195:


[~tstclair] the patch is still not reviewed. i'm going to offload it to 0.21.0.

> systemd.slice + cgroup enablement fails in multiple ways. 
> --
>
> Key: MESOS-1195
> URL: https://issues.apache.org/jira/browse/MESOS-1195
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Timothy St. Clair
>Assignee: Timothy St. Clair
>
> When attempting to configure mesos to use systemd slices on a 'rawhide/f21' 
> machine, it fails creating the isolator: 
> I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: 
> cgroups/cpu,cgroups/mem
> Failed to create a containerizer: Could not create isolator cgroups/cpu: 
> Failed to create isolator: The cpu subsystem is co-mounted at 
> /sys/fs/cgroup/cpu with other subsytems
> -- details --
> /sys/fs/cgroup
> total 0
> drwxr-xr-x. 12 root root 280 Mar 18 08:47 .
> drwxr-xr-x.  6 root root   0 Mar 18 08:47 ..
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 blkio
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpu -> cpu,cpuacct
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpuacct -> cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpuset
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 devices
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 freezer
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 hugetlb
> drwxr-xr-x.  3 root root   0 Apr  3 11:26 memory
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 net_cls
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 perf_event
> drwxr-xr-x.  4 root root   0 Mar 18 08:47 systemd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout.

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1219:
---
Fix Version/s: (was: 0.21.0)
   0.20.1

> Master should disallow frameworks that reconnect after failover timeout.
> 
>
> Key: MESOS-1219
> URL: https://issues.apache.org/jira/browse/MESOS-1219
> Project: Mesos
>  Issue Type: Bug
>  Components: master, webui
>Reporter: Robert Lacroix
>Assignee: Vinod Kone
> Fix For: 0.20.1
>
>
> When a scheduler reconnects after the failover timeout has exceeded, the 
> framework id is usually reused because the scheduler doesn't know that the 
> timeout exceeded and it is actually handled as a new framework.
> The /framework/:framework_id route of the Web UI doesn't handle those cases 
> very well because its key is reused. It only shows the terminated one.
> Would it make sense to ignore the provided framework id when a scheduler 
> reconnects to a terminated framework and generate a new id to make sure it's 
> unique?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1724) Can't include port in DockerInfo's image

2014-09-16 Thread Bhuvan Arumugam (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136482#comment-14136482
 ] 

Bhuvan Arumugam commented on MESOS-1724:


fixed in 0.20.1. it'll be part of this release.

> Can't include port in DockerInfo's image
> 
>
> Key: MESOS-1724
> URL: https://issues.apache.org/jira/browse/MESOS-1724
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jay Buffington
>Assignee: Timothy Chen
>Priority: Minor
>  Labels: docker
> Fix For: 0.20.1
>
>
> The current git tree doesn't allow you to specify a docker image with 
> multiple colons.  It is valid that multiple colons would exist in a docker 
> image.  e.g. docker-registry.example.com:80/centos:6u5
> From 
> https://github.com/apache/mesos/blob/02a35ab213fb074f6c532075cada76f13eb9d552/src/slave/containerizer/docker.cpp#L441
> {code}
>   vector parts = strings::split(dockerInfo.image(), ":");
>   if (parts.size() > 2) {
> return Failure("Not expecting multiple ':' in image: " +
>dockerInfo.image());
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1724) Can't include port in DockerInfo's image

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1724:
---
Fix Version/s: 0.20.1

> Can't include port in DockerInfo's image
> 
>
> Key: MESOS-1724
> URL: https://issues.apache.org/jira/browse/MESOS-1724
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jay Buffington
>Assignee: Timothy Chen
>Priority: Minor
>  Labels: docker
> Fix For: 0.20.1
>
>
> The current git tree doesn't allow you to specify a docker image with 
> multiple colons.  It is valid that multiple colons would exist in a docker 
> image.  e.g. docker-registry.example.com:80/centos:6u5
> From 
> https://github.com/apache/mesos/blob/02a35ab213fb074f6c532075cada76f13eb9d552/src/slave/containerizer/docker.cpp#L441
> {code}
>   vector parts = strings::split(dockerInfo.image(), ":");
>   if (parts.size() > 2) {
> return Failure("Not expecting multiple ':' in image: " +
>dockerInfo.image());
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1737) Isolation=external result in core dump on 0.20.0

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1737:
---
Target Version/s: 0.20.1

> Isolation=external result in core dump on 0.20.0
> 
>
> Key: MESOS-1737
> URL: https://issues.apache.org/jira/browse/MESOS-1737
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.20.0
>Reporter: Tim Nolet
>Assignee: Timothy Chen
> Fix For: 0.20.1
>
>
> When upgrading from 0.19.1 to 0.20.0, any slaves started with the standard 
> deimos setup fail hard on startup. The following command spits out about 
> 20.000 errors before core dumping:
> /etc/mesos-slave# /usr/local/sbin/mesos-slave 
> --master=zk://localhost:2181/mesos --port=5051 --log_dir=/var/log/mesos 
> --ip=172.17.8.101 --work_dir=/var/lib/mesos --isolation=external 
> --containerizer_path=/usr/local/bin/deimos
> output:
> 
> W0827 15:20:18.366271   721 containerizer.cpp:159] The 'external' isolation 
> flag is deprecated, please update your flags to '--containerizers=external'.
> W0827 15:20:18.366580   721 containerizer.cpp:159] The 'external' isolation 
> flag is deprecated, please update your flags to '--containerizers=external'.
> W0827 15:20:18.366631   721 containerizer.cpp:159] The 'external' isolation 
> flag is deprecated, please update your flags to '--containerizers=external'.
> W0827 15:20:18.366683   721 containerizer.cpp:159] The 'external' isolation 
> flag is deprecated, please update your flags to '--containerizers=external'.
> W0827 15:20:18.366714   721 containerizer.cpp:159] The 'external' isolation 
> flag is deprecated, please update your flags to '--containerizers=external'.
> W0827 15:20:18.366752   721 containerizer.cpp:159] The 'external' isolation 
> flag is deprecated, please update your flags to '--containerizers=external'.
> Segmentation fault (core dumped)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1643) Provide APIs to return port resource for a given role

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1643:
---
Target Version/s: 0.20.1
   Fix Version/s: (was: 0.21.0)
  0.20.1

trivial enough to accomodate in 0.20.1.

> Provide APIs to return port resource for a given role
> -
>
> Key: MESOS-1643
> URL: https://issues.apache.org/jira/browse/MESOS-1643
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zuyu Zhang
>Assignee: Zuyu Zhang
>Priority: Trivial
> Fix For: 0.20.1
>
>
> It makes more sense to return port resource for a given role, rather than all 
> ports in Resources.
> In mesos/resource.hpp:
> Option Resources::ports(const string& role = "*");
> // Check whether Resources have the given number (num_port) of ports, and 
> return the begin number of the port range.
> Option Resources::getPorts(long num_port, const string& role = "*");



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1716:
---
Target Version/s: 0.20.1
   Fix Version/s: (was: 0.21.0)
  0.20.1

accommodating in 0.20.1.

> The slave does not add pending tasks as part of the staging tasks metric.
> -
>
> Key: MESOS-1716
> URL: https://issues.apache.org/jira/browse/MESOS-1716
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Trivial
> Fix For: 0.20.1
>
>
> The slave does not represent pending tasks in the "tasks_staging" metric.
> This should be a trivial fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1801) MESOS_work_dir and MESOS_master env vars not honoured

2014-09-16 Thread Cosmin Lehene (JIRA)
Cosmin Lehene created MESOS-1801:


 Summary: MESOS_work_dir and MESOS_master env vars not honoured
 Key: MESOS-1801
 URL: https://issues.apache.org/jira/browse/MESOS-1801
 Project: Mesos
  Issue Type: Bug
  Components: cli
Affects Versions: 0.20.0
 Environment: CentOS 7
Reporter: Cosmin Lehene
 Fix For: 0.20.1


The documentation states that cli params should be substitutable by environment 
variables

{quote}
 Each option can be set in two ways:

By passing it to the binary using --option_name=value.
By setting the environment variable MESOS_OPTION_NAME (the option name with a 
MESOS_ prefix added to it).
{quote}

However at least the master's MESOS_work_dir and slave's "MESOS_master"  env 
vars seem to be ignored:
{noformat}
[root@localhost ~]# echo $MESOS_master
zk://localhost:2181/mesos
[root@localhost ~]# mesos-slave
Missing required option --master
[root@localhost ~]# echo $MESOS_work_dir
/var/lib/mesos
[root@localhost ~]# mesos-master
I0917 08:36:46.242200 31325 main.cpp:155] Build: 2014-08-22 05:06:06 by root
I0917 08:36:46.242369 31325 main.cpp:157] Version: 0.20.0
I0917 08:36:46.242377 31325 main.cpp:160] Git tag: 0.20.0
I0917 08:36:46.242382 31325 main.cpp:164] Git SHA: 
f421ffdf8d32a8834b3a6ee483b5b59f65956497
--work_dir needed for replicated log based registry
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version

2014-09-16 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136550#comment-14136550
 ] 

Vinod Kone commented on MESOS-1675:
---

I see. For the patch you sent, which sets version to 0.0.0, do frameworks have 
to do anything specific to use the new lib (assuming it's compatible)?

> Decouple version of the mesos library from the package release version
> --
>
> Key: MESOS-1675
> URL: https://issues.apache.org/jira/browse/MESOS-1675
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>
> This discussion should be rolled into the larger discussion around how to 
> version Mesos (APIs, packages, libraries etc).
> Some notes from libtool docs.
> http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
> http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-444) Remove --checkpoint flag in the slave once checkpointing is stable.

2014-09-16 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136558#comment-14136558
 ] 

Vinod Kone commented on MESOS-444:
--

This work has not been prioritized. It is a more involved change because of the 
way our tests are written.

> Remove --checkpoint flag in the slave once checkpointing is stable.
> ---
>
> Key: MESOS-444
> URL: https://issues.apache.org/jira/browse/MESOS-444
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>  Labels: newbie
>
> In the interim of slave recovery being worked on (see: MESOS-110), we've 
> added a --checkpoint flag to the slave to enable or disable the feature.
> Prior to releasing this feature, we need to remove this flag so that all 
> slaves have checkpointing available, and frameworks can choose to use it. 
> There's no need to keep this flag around and add configuration complexity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.

2014-09-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1802:
--

 Summary: HealthCheckTest.HealthStatusChange is flaky on jenkins.
 Key: MESOS-1802
 URL: https://issues.apache.org/jira/browse/MESOS-1802
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Timothy Chen


https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull

{noformat}
[ RUN  ] HealthCheckTest.HealthStatusChange
Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2'
I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms
I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns
I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns
I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 642ns
I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the db 
in 343ns
I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery
I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status
I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to STARTING
I0916 22:56:14.036603 21046 master.cpp:286] Master 
20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 
67.195.81.186:47865
I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing authenticated 
frameworks to register
I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing authenticated 
slaves to register
I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for 
authentication from '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials'
I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 480322ns
I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to 
STARTING
I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled
I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status
I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] 
Initializing hierarchical allocator process with master : 
master@67.195.81.186:47865
I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is 
master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026
I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master!
I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar
I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar
I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING
I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 330251ns
I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to VOTING
I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos group
I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated
I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer
I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 92623ns
I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1
I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 144215ns
I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0
I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request for 
position 0
I0916 22:56:14.040267 21047 leveldb.cpp:438] Reading position from leveldb took 
10323ns
I0916 22:56:14.040362 21047 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 79471ns
I0916 22:56:14.040375 21047 replica.cpp:676] Persisted action at 0
I0916 22:56:14.040556 21054 replica.cpp:655] Replica received learned notice 
for position 0
I0916 22:56:14.040658 21054 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 83975ns
I0916 22:56:14.04067

[jira] [Created] (MESOS-1803) Strict/RegistrarTest.remove test is flaky on jenkins.

2014-09-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1803:
--

 Summary: Strict/RegistrarTest.remove test is flaky on jenkins.
 Key: MESOS-1803
 URL: https://issues.apache.org/jira/browse/MESOS-1803
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull

{noformat}
[ RUN  ] Strict/RegistrarTest.remove/1
Using temporary directory '/tmp/Strict_RegistrarTest_remove_1_3QvnOW'
I0916 22:59:02.112568 21026 leveldb.cpp:176] Opened db in 1.779835ms
I0916 22:59:02.112896 21026 leveldb.cpp:183] Compacted db in 301862ns
I0916 22:59:02.112916 21026 leveldb.cpp:198] Created db iterator in 3065ns
I0916 22:59:02.112926 21026 leveldb.cpp:204] Seeked to beginning of db in 475ns
I0916 22:59:02.112936 21026 leveldb.cpp:273] Iterated through 0 keys in the db 
in 330ns
I0916 22:59:02.112951 21026 replica.cpp:741] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0916 22:59:02.113654 21054 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 421460ns
I0916 22:59:02.113674 21054 replica.cpp:320] Persisted replica status to VOTING
I0916 22:59:02.115900 21026 leveldb.cpp:176] Opened db in 1.947919ms
I0916 22:59:02.116263 21026 leveldb.cpp:183] Compacted db in 338043ns
I0916 22:59:02.116283 21026 leveldb.cpp:198] Created db iterator in 2809ns
I0916 22:59:02.116293 21026 leveldb.cpp:204] Seeked to beginning of db in 468ns
I0916 22:59:02.116302 21026 leveldb.cpp:273] Iterated through 0 keys in the db 
in 195ns
I0916 22:59:02.116317 21026 replica.cpp:741] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0916 22:59:02.117013 21043 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 472891ns
I0916 22:59:02.117034 21043 replica.cpp:320] Persisted replica status to VOTING
I0916 22:59:02.119240 21026 leveldb.cpp:176] Opened db in 1.950367ms
I0916 22:59:02.120455 21026 leveldb.cpp:183] Compacted db in 1.188056ms
I0916 22:59:02.120481 21026 leveldb.cpp:198] Created db iterator in 4370ns
I0916 22:59:02.120499 21026 leveldb.cpp:204] Seeked to beginning of db in 7977ns
I0916 22:59:02.120517 21026 leveldb.cpp:273] Iterated through 1 keys in the db 
in 8479ns
I0916 22:59:02.120533 21026 replica.cpp:741] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0916 22:59:02.122890 21026 leveldb.cpp:176] Opened db in 2.301327ms
I0916 22:59:02.124325 21026 leveldb.cpp:183] Compacted db in 1.406223ms
I0916 22:59:02.124351 21026 leveldb.cpp:198] Created db iterator in 4185ns
I0916 22:59:02.124368 21026 leveldb.cpp:204] Seeked to beginning of db in 7167ns
I0916 22:59:02.124387 21026 leveldb.cpp:273] Iterated through 1 keys in the db 
in 8182ns
I0916 22:59:02.124403 21026 replica.cpp:741] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0916 22:59:02.124579 21047 recover.cpp:425] Starting replica recovery
I0916 22:59:02.124651 21047 recover.cpp:451] Replica is in VOTING status
I0916 22:59:02.124793 21047 recover.cpp:440] Recover process terminated
I0916 22:59:02.126404 21046 registrar.cpp:313] Recovering registrar
I0916 22:59:02.126597 21050 log.cpp:656] Attempting to start the writer
I0916 22:59:02.127259 21041 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0916 22:59:02.127321 21050 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0916 22:59:02.127835 21041 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 547018ns
I0916 22:59:02.127858 21041 replica.cpp:342] Persisted promised to 1
I0916 22:59:02.127835 21050 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 487588ns
I0916 22:59:02.127887 21050 replica.cpp:342] Persisted promised to 1
I0916 22:59:02.128387 21055 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0916 22:59:02.129546 21042 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0916 22:59:02.129600 21053 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0916 22:59:02.129982 21042 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 406954ns
I0916 22:59:02.129982 21053 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 357253ns
I0916 22:59:02.130009 21042 replica.cpp:676] Persisted action at 0
I0916 22:59:02.130029 21053 replica.cpp:676] Persisted action at 0
I0916 22:59:02.130543 21041 replica.cpp:508] Replica received write request for 
position 0
I0916 22:59:02.130585 21041 leveldb.cpp:438] Reading position from leveldb took 
17424ns
I0916 22:59:02.130599 21046 replica.cpp:508] Replica received write request for 
position 0
I0916 22:59:02.130635 21046 leveldb.cpp:438] Reading position from leveldb took 
12702ns
I0916 22:59:02.130728 2

[jira] [Updated] (MESOS-1760) MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1760:
---
Target Version/s: 0.20.1  (was: 0.21.0)

> MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky
> -
>
> Key: MESOS-1760
> URL: https://issues.apache.org/jira/browse/MESOS-1760
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
> Fix For: 0.20.1
>
>
> Observed this on Apache CI: 
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2355/changes
> {code}
> [ RUN] MasterAuthorizationTest.FrameworkRemovedBeforeReregistration
> Using temporary directory 
> '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z'
> I0903 22:04:33.520237 25565 leveldb.cpp:176] Opened db in 49.073821ms
> I0903 22:04:33.538331 25565 leveldb.cpp:183] Compacted db in 18.065051ms
> I0903 22:04:33.538363 25565 leveldb.cpp:198] Created db iterator in 4826ns
> I0903 22:04:33.538377 25565 leveldb.cpp:204] Seeked to beginning of db in 
> 682ns
> I0903 22:04:33.538385 25565 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 312ns
> I0903 22:04:33.538399 25565 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0903 22:04:33.538624 25593 recover.cpp:425] Starting replica recovery
> I0903 22:04:33.538707 25598 recover.cpp:451] Replica is in EMPTY status
> I0903 22:04:33.540909 25590 master.cpp:286] Master 
> 20140903-220433-453759884-44122-25565 (hemera.apache.org) started on 
> 140.211.11.27:44122
> I0903 22:04:33.540932 25590 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0903 22:04:33.540936 25590 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0903 22:04:33.540941 25590 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z/credentials'
> I0903 22:04:33.541337 25590 master.cpp:366] Authorization enabled
> I0903 22:04:33.541508 25597 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0903 22:04:33.542343 25582 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@140.211.11.27:44122
> I0903 22:04:33.542445 25592 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0903 22:04:33.543175 25602 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0903 22:04:33.543637 25587 recover.cpp:542] Updating replica status to 
> STARTING
> I0903 22:04:33.544256 25579 master.cpp:1205] The newly elected leader is 
> master@140.211.11.27:44122 with id 20140903-220433-453759884-44122-25565
> I0903 22:04:33.544275 25579 master.cpp:1218] Elected as the leading master!
> I0903 22:04:33.544282 25579 master.cpp:1036] Recovering from registrar
> I0903 22:04:33.544401 25579 registrar.cpp:313] Recovering registrar
> I0903 22:04:33.558487 25593 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.678563ms
> I0903 22:04:33.558531 25593 replica.cpp:320] Persisted replica status to 
> STARTING
> I0903 22:04:33.558653 25593 recover.cpp:451] Replica is in STARTING status
> I0903 22:04:33.559867 25588 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0903 22:04:33.560057 25602 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0903 22:04:33.561280 25584 recover.cpp:542] Updating replica status to VOTING
> I0903 22:04:33.576900 25581 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.712427ms
> I0903 22:04:33.576942 25581 replica.cpp:320] Persisted replica status to 
> VOTING
> I0903 22:04:33.577018 25581 recover.cpp:556] Successfully joined the Paxos 
> group
> I0903 22:04:33.577108 25581 recover.cpp:440] Recover process terminated
> I0903 22:04:33.577401 25581 log.cpp:656] Attempting to start the writer
> I0903 22:04:33.578559 25589 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0903 22:04:33.594611 25589 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 16.029152ms
> I0903 22:04:33.594640 25589 replica.cpp:342] Persisted promised to 1
> I0903 22:04:33.595391 25584 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0903 22:04:33.597512 25588 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0903 22:04:33.613037 25588 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 15.502568ms
> I0903 22:04:33.613065 25588 replica.cpp:676] Persisted action at 0
> I0903 22:04:33.615435 25585 replica.cpp:508] Replica

[jira] [Updated] (MESOS-1760) MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1760:
---
Fix Version/s: (was: 0.21.0)
   0.20.1

> MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky
> -
>
> Key: MESOS-1760
> URL: https://issues.apache.org/jira/browse/MESOS-1760
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
> Fix For: 0.20.1
>
>
> Observed this on Apache CI: 
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2355/changes
> {code}
> [ RUN] MasterAuthorizationTest.FrameworkRemovedBeforeReregistration
> Using temporary directory 
> '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z'
> I0903 22:04:33.520237 25565 leveldb.cpp:176] Opened db in 49.073821ms
> I0903 22:04:33.538331 25565 leveldb.cpp:183] Compacted db in 18.065051ms
> I0903 22:04:33.538363 25565 leveldb.cpp:198] Created db iterator in 4826ns
> I0903 22:04:33.538377 25565 leveldb.cpp:204] Seeked to beginning of db in 
> 682ns
> I0903 22:04:33.538385 25565 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 312ns
> I0903 22:04:33.538399 25565 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0903 22:04:33.538624 25593 recover.cpp:425] Starting replica recovery
> I0903 22:04:33.538707 25598 recover.cpp:451] Replica is in EMPTY status
> I0903 22:04:33.540909 25590 master.cpp:286] Master 
> 20140903-220433-453759884-44122-25565 (hemera.apache.org) started on 
> 140.211.11.27:44122
> I0903 22:04:33.540932 25590 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0903 22:04:33.540936 25590 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0903 22:04:33.540941 25590 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z/credentials'
> I0903 22:04:33.541337 25590 master.cpp:366] Authorization enabled
> I0903 22:04:33.541508 25597 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0903 22:04:33.542343 25582 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@140.211.11.27:44122
> I0903 22:04:33.542445 25592 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0903 22:04:33.543175 25602 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0903 22:04:33.543637 25587 recover.cpp:542] Updating replica status to 
> STARTING
> I0903 22:04:33.544256 25579 master.cpp:1205] The newly elected leader is 
> master@140.211.11.27:44122 with id 20140903-220433-453759884-44122-25565
> I0903 22:04:33.544275 25579 master.cpp:1218] Elected as the leading master!
> I0903 22:04:33.544282 25579 master.cpp:1036] Recovering from registrar
> I0903 22:04:33.544401 25579 registrar.cpp:313] Recovering registrar
> I0903 22:04:33.558487 25593 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.678563ms
> I0903 22:04:33.558531 25593 replica.cpp:320] Persisted replica status to 
> STARTING
> I0903 22:04:33.558653 25593 recover.cpp:451] Replica is in STARTING status
> I0903 22:04:33.559867 25588 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0903 22:04:33.560057 25602 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0903 22:04:33.561280 25584 recover.cpp:542] Updating replica status to VOTING
> I0903 22:04:33.576900 25581 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.712427ms
> I0903 22:04:33.576942 25581 replica.cpp:320] Persisted replica status to 
> VOTING
> I0903 22:04:33.577018 25581 recover.cpp:556] Successfully joined the Paxos 
> group
> I0903 22:04:33.577108 25581 recover.cpp:440] Recover process terminated
> I0903 22:04:33.577401 25581 log.cpp:656] Attempting to start the writer
> I0903 22:04:33.578559 25589 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0903 22:04:33.594611 25589 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 16.029152ms
> I0903 22:04:33.594640 25589 replica.cpp:342] Persisted promised to 1
> I0903 22:04:33.595391 25584 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0903 22:04:33.597512 25588 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0903 22:04:33.613037 25588 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 15.502568ms
> I0903 22:04:33.613065 25588 replica.cpp:676] Persisted action at 0
> I0903 22:04:33.615435 25585 repli

[jira] [Commented] (MESOS-1760) MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky

2014-09-16 Thread Bhuvan Arumugam (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136577#comment-14136577
 ] 

Bhuvan Arumugam commented on MESOS-1760:


including in 0.20.1.

> MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky
> -
>
> Key: MESOS-1760
> URL: https://issues.apache.org/jira/browse/MESOS-1760
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
> Fix For: 0.20.1
>
>
> Observed this on Apache CI: 
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2355/changes
> {code}
> [ RUN] MasterAuthorizationTest.FrameworkRemovedBeforeReregistration
> Using temporary directory 
> '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z'
> I0903 22:04:33.520237 25565 leveldb.cpp:176] Opened db in 49.073821ms
> I0903 22:04:33.538331 25565 leveldb.cpp:183] Compacted db in 18.065051ms
> I0903 22:04:33.538363 25565 leveldb.cpp:198] Created db iterator in 4826ns
> I0903 22:04:33.538377 25565 leveldb.cpp:204] Seeked to beginning of db in 
> 682ns
> I0903 22:04:33.538385 25565 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 312ns
> I0903 22:04:33.538399 25565 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0903 22:04:33.538624 25593 recover.cpp:425] Starting replica recovery
> I0903 22:04:33.538707 25598 recover.cpp:451] Replica is in EMPTY status
> I0903 22:04:33.540909 25590 master.cpp:286] Master 
> 20140903-220433-453759884-44122-25565 (hemera.apache.org) started on 
> 140.211.11.27:44122
> I0903 22:04:33.540932 25590 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0903 22:04:33.540936 25590 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0903 22:04:33.540941 25590 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z/credentials'
> I0903 22:04:33.541337 25590 master.cpp:366] Authorization enabled
> I0903 22:04:33.541508 25597 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0903 22:04:33.542343 25582 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@140.211.11.27:44122
> I0903 22:04:33.542445 25592 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0903 22:04:33.543175 25602 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0903 22:04:33.543637 25587 recover.cpp:542] Updating replica status to 
> STARTING
> I0903 22:04:33.544256 25579 master.cpp:1205] The newly elected leader is 
> master@140.211.11.27:44122 with id 20140903-220433-453759884-44122-25565
> I0903 22:04:33.544275 25579 master.cpp:1218] Elected as the leading master!
> I0903 22:04:33.544282 25579 master.cpp:1036] Recovering from registrar
> I0903 22:04:33.544401 25579 registrar.cpp:313] Recovering registrar
> I0903 22:04:33.558487 25593 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.678563ms
> I0903 22:04:33.558531 25593 replica.cpp:320] Persisted replica status to 
> STARTING
> I0903 22:04:33.558653 25593 recover.cpp:451] Replica is in STARTING status
> I0903 22:04:33.559867 25588 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0903 22:04:33.560057 25602 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0903 22:04:33.561280 25584 recover.cpp:542] Updating replica status to VOTING
> I0903 22:04:33.576900 25581 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 14.712427ms
> I0903 22:04:33.576942 25581 replica.cpp:320] Persisted replica status to 
> VOTING
> I0903 22:04:33.577018 25581 recover.cpp:556] Successfully joined the Paxos 
> group
> I0903 22:04:33.577108 25581 recover.cpp:440] Recover process terminated
> I0903 22:04:33.577401 25581 log.cpp:656] Attempting to start the writer
> I0903 22:04:33.578559 25589 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0903 22:04:33.594611 25589 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 16.029152ms
> I0903 22:04:33.594640 25589 replica.cpp:342] Persisted promised to 1
> I0903 22:04:33.595391 25584 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0903 22:04:33.597512 25588 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0903 22:04:33.613037 25588 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 15.502568ms
> I0903 22:04:33.613065 25588 replica.cpp:676] Persisted action at 0
> I0903 22:04:33.61

[jira] [Updated] (MESOS-1766) MasterAuthorizationTest.DuplicateRegistration test is flaky

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1766:
---
Fix Version/s: (was: 0.21.0)
   0.20.1

> MasterAuthorizationTest.DuplicateRegistration test is flaky
> ---
>
> Key: MESOS-1766
> URL: https://issues.apache.org/jira/browse/MESOS-1766
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
> Fix For: 0.20.1
>
>
> {code}
> [ RUN  ] MasterAuthorizationTest.DuplicateRegistration
> Using temporary directory 
> '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m'
> I0905 15:53:16.398993 25769 leveldb.cpp:176] Opened db in 2.601036ms
> I0905 15:53:16.399566 25769 leveldb.cpp:183] Compacted db in 546216ns
> I0905 15:53:16.399590 25769 leveldb.cpp:198] Created db iterator in 2787ns
> I0905 15:53:16.399605 25769 leveldb.cpp:204] Seeked to beginning of db in 
> 500ns
> I0905 15:53:16.399617 25769 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 185ns
> I0905 15:53:16.399633 25769 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0905 15:53:16.399817 25786 recover.cpp:425] Starting replica recovery
> I0905 15:53:16.399952 25793 recover.cpp:451] Replica is in EMPTY status
> I0905 15:53:16.400683 25795 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0905 15:53:16.400795 25787 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0905 15:53:16.401005 25783 recover.cpp:542] Updating replica status to 
> STARTING
> I0905 15:53:16.401470 25786 master.cpp:286] Master 
> 20140905-155316-3125920579-49188-25769 (penates.apache.org) started on 
> 67.195.81.186:49188
> I0905 15:53:16.401521 25786 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0905 15:53:16.401533 25786 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0905 15:53:16.401543 25786 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m/credentials'
> I0905 15:53:16.401558 25793 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 474683ns
> I0905 15:53:16.401582 25793 replica.cpp:320] Persisted replica status to 
> STARTING
> I0905 15:53:16.401667 25793 recover.cpp:451] Replica is in STARTING status
> I0905 15:53:16.401669 25786 master.cpp:366] Authorization enabled
> I0905 15:53:16.401898 25795 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0905 15:53:16.401936 25796 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@67.195.81.186:49188
> I0905 15:53:16.402160 25784 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0905 15:53:16.402333 25790 master.cpp:1205] The newly elected leader is 
> master@67.195.81.186:49188 with id 20140905-155316-3125920579-49188-25769
> I0905 15:53:16.402359 25790 master.cpp:1218] Elected as the leading master!
> I0905 15:53:16.402371 25790 master.cpp:1036] Recovering from registrar
> I0905 15:53:16.402472 25798 registrar.cpp:313] Recovering registrar
> I0905 15:53:16.402529 25791 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0905 15:53:16.402782 25788 recover.cpp:542] Updating replica status to VOTING
> I0905 15:53:16.403002 25795 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 116403ns
> I0905 15:53:16.403020 25795 replica.cpp:320] Persisted replica status to 
> VOTING
> I0905 15:53:16.403081 25791 recover.cpp:556] Successfully joined the Paxos 
> group
> I0905 15:53:16.403197 25791 recover.cpp:440] Recover process terminated
> I0905 15:53:16.403388 25796 log.cpp:656] Attempting to start the writer
> I0905 15:53:16.403993 25784 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0905 15:53:16.404147 25784 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 132156ns
> I0905 15:53:16.404167 25784 replica.cpp:342] Persisted promised to 1
> I0905 15:53:16.404542 25795 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0905 15:53:16.405498 25787 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0905 15:53:16.405868 25787 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 347231ns
> I0905 15:53:16.405886 25787 replica.cpp:676] Persisted action at 0
> I0905 15:53:16.406553 25788 replica.cpp:508] Replica received write request 
> for position 0
> I0905 15:53:16.406582 25788 leveldb.cpp:438] Reading position from leveldb 
> took 11402ns
> I0905 15:53:16.529067 25788 leveldb.cpp:343] Persisting action (14 byt

[jira] [Commented] (MESOS-1766) MasterAuthorizationTest.DuplicateRegistration test is flaky

2014-09-16 Thread Bhuvan Arumugam (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136579#comment-14136579
 ] 

Bhuvan Arumugam commented on MESOS-1766:


related to MESOS-1760. including in 0.20.1

> MasterAuthorizationTest.DuplicateRegistration test is flaky
> ---
>
> Key: MESOS-1766
> URL: https://issues.apache.org/jira/browse/MESOS-1766
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
> Fix For: 0.20.1
>
>
> {code}
> [ RUN  ] MasterAuthorizationTest.DuplicateRegistration
> Using temporary directory 
> '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m'
> I0905 15:53:16.398993 25769 leveldb.cpp:176] Opened db in 2.601036ms
> I0905 15:53:16.399566 25769 leveldb.cpp:183] Compacted db in 546216ns
> I0905 15:53:16.399590 25769 leveldb.cpp:198] Created db iterator in 2787ns
> I0905 15:53:16.399605 25769 leveldb.cpp:204] Seeked to beginning of db in 
> 500ns
> I0905 15:53:16.399617 25769 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 185ns
> I0905 15:53:16.399633 25769 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0905 15:53:16.399817 25786 recover.cpp:425] Starting replica recovery
> I0905 15:53:16.399952 25793 recover.cpp:451] Replica is in EMPTY status
> I0905 15:53:16.400683 25795 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0905 15:53:16.400795 25787 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0905 15:53:16.401005 25783 recover.cpp:542] Updating replica status to 
> STARTING
> I0905 15:53:16.401470 25786 master.cpp:286] Master 
> 20140905-155316-3125920579-49188-25769 (penates.apache.org) started on 
> 67.195.81.186:49188
> I0905 15:53:16.401521 25786 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0905 15:53:16.401533 25786 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0905 15:53:16.401543 25786 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m/credentials'
> I0905 15:53:16.401558 25793 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 474683ns
> I0905 15:53:16.401582 25793 replica.cpp:320] Persisted replica status to 
> STARTING
> I0905 15:53:16.401667 25793 recover.cpp:451] Replica is in STARTING status
> I0905 15:53:16.401669 25786 master.cpp:366] Authorization enabled
> I0905 15:53:16.401898 25795 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0905 15:53:16.401936 25796 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@67.195.81.186:49188
> I0905 15:53:16.402160 25784 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0905 15:53:16.402333 25790 master.cpp:1205] The newly elected leader is 
> master@67.195.81.186:49188 with id 20140905-155316-3125920579-49188-25769
> I0905 15:53:16.402359 25790 master.cpp:1218] Elected as the leading master!
> I0905 15:53:16.402371 25790 master.cpp:1036] Recovering from registrar
> I0905 15:53:16.402472 25798 registrar.cpp:313] Recovering registrar
> I0905 15:53:16.402529 25791 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0905 15:53:16.402782 25788 recover.cpp:542] Updating replica status to VOTING
> I0905 15:53:16.403002 25795 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 116403ns
> I0905 15:53:16.403020 25795 replica.cpp:320] Persisted replica status to 
> VOTING
> I0905 15:53:16.403081 25791 recover.cpp:556] Successfully joined the Paxos 
> group
> I0905 15:53:16.403197 25791 recover.cpp:440] Recover process terminated
> I0905 15:53:16.403388 25796 log.cpp:656] Attempting to start the writer
> I0905 15:53:16.403993 25784 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0905 15:53:16.404147 25784 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 132156ns
> I0905 15:53:16.404167 25784 replica.cpp:342] Persisted promised to 1
> I0905 15:53:16.404542 25795 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0905 15:53:16.405498 25787 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0905 15:53:16.405868 25787 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 347231ns
> I0905 15:53:16.405886 25787 replica.cpp:676] Persisted action at 0
> I0905 15:53:16.406553 25788 replica.cpp:508] Replica received write request 
> for position 0
> I0905 15:53:16.406582 25788 leveldb.cpp:438] Reading position from leveldb 
> took 11402ns
> I0905 15:53:16.529067 25788 leve

[jira] [Updated] (MESOS-1766) MasterAuthorizationTest.DuplicateRegistration test is flaky

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1766:
---
Target Version/s: 0.20.1

> MasterAuthorizationTest.DuplicateRegistration test is flaky
> ---
>
> Key: MESOS-1766
> URL: https://issues.apache.org/jira/browse/MESOS-1766
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Vinod Kone
>Assignee: Vinod Kone
> Fix For: 0.20.1
>
>
> {code}
> [ RUN  ] MasterAuthorizationTest.DuplicateRegistration
> Using temporary directory 
> '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m'
> I0905 15:53:16.398993 25769 leveldb.cpp:176] Opened db in 2.601036ms
> I0905 15:53:16.399566 25769 leveldb.cpp:183] Compacted db in 546216ns
> I0905 15:53:16.399590 25769 leveldb.cpp:198] Created db iterator in 2787ns
> I0905 15:53:16.399605 25769 leveldb.cpp:204] Seeked to beginning of db in 
> 500ns
> I0905 15:53:16.399617 25769 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 185ns
> I0905 15:53:16.399633 25769 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0905 15:53:16.399817 25786 recover.cpp:425] Starting replica recovery
> I0905 15:53:16.399952 25793 recover.cpp:451] Replica is in EMPTY status
> I0905 15:53:16.400683 25795 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0905 15:53:16.400795 25787 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0905 15:53:16.401005 25783 recover.cpp:542] Updating replica status to 
> STARTING
> I0905 15:53:16.401470 25786 master.cpp:286] Master 
> 20140905-155316-3125920579-49188-25769 (penates.apache.org) started on 
> 67.195.81.186:49188
> I0905 15:53:16.401521 25786 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0905 15:53:16.401533 25786 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0905 15:53:16.401543 25786 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m/credentials'
> I0905 15:53:16.401558 25793 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 474683ns
> I0905 15:53:16.401582 25793 replica.cpp:320] Persisted replica status to 
> STARTING
> I0905 15:53:16.401667 25793 recover.cpp:451] Replica is in STARTING status
> I0905 15:53:16.401669 25786 master.cpp:366] Authorization enabled
> I0905 15:53:16.401898 25795 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0905 15:53:16.401936 25796 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@67.195.81.186:49188
> I0905 15:53:16.402160 25784 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0905 15:53:16.402333 25790 master.cpp:1205] The newly elected leader is 
> master@67.195.81.186:49188 with id 20140905-155316-3125920579-49188-25769
> I0905 15:53:16.402359 25790 master.cpp:1218] Elected as the leading master!
> I0905 15:53:16.402371 25790 master.cpp:1036] Recovering from registrar
> I0905 15:53:16.402472 25798 registrar.cpp:313] Recovering registrar
> I0905 15:53:16.402529 25791 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0905 15:53:16.402782 25788 recover.cpp:542] Updating replica status to VOTING
> I0905 15:53:16.403002 25795 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 116403ns
> I0905 15:53:16.403020 25795 replica.cpp:320] Persisted replica status to 
> VOTING
> I0905 15:53:16.403081 25791 recover.cpp:556] Successfully joined the Paxos 
> group
> I0905 15:53:16.403197 25791 recover.cpp:440] Recover process terminated
> I0905 15:53:16.403388 25796 log.cpp:656] Attempting to start the writer
> I0905 15:53:16.403993 25784 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0905 15:53:16.404147 25784 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 132156ns
> I0905 15:53:16.404167 25784 replica.cpp:342] Persisted promised to 1
> I0905 15:53:16.404542 25795 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0905 15:53:16.405498 25787 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0905 15:53:16.405868 25787 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 347231ns
> I0905 15:53:16.405886 25787 replica.cpp:676] Persisted action at 0
> I0905 15:53:16.406553 25788 replica.cpp:508] Replica received write request 
> for position 0
> I0905 15:53:16.406582 25788 leveldb.cpp:438] Reading position from leveldb 
> took 11402ns
> I0905 15:53:16.529067 25788 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 535803ns
> 

[jira] [Updated] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout.

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1219:
---
Fix Version/s: (was: 0.20.1)
   0.21.0

> Master should disallow frameworks that reconnect after failover timeout.
> 
>
> Key: MESOS-1219
> URL: https://issues.apache.org/jira/browse/MESOS-1219
> Project: Mesos
>  Issue Type: Bug
>  Components: master, webui
>Reporter: Robert Lacroix
>Assignee: Vinod Kone
> Fix For: 0.21.0
>
>
> When a scheduler reconnects after the failover timeout has exceeded, the 
> framework id is usually reused because the scheduler doesn't know that the 
> timeout exceeded and it is actually handled as a new framework.
> The /framework/:framework_id route of the Web UI doesn't handle those cases 
> very well because its key is reused. It only shows the terminated one.
> Would it make sense to ignore the provided framework id when a scheduler 
> reconnects to a terminated framework and generate a new id to make sure it's 
> unique?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1801) MESOS_work_dir and MESOS_master env vars not honoured

2014-09-16 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1801:
--
Fix Version/s: (was: 0.20.1)

> MESOS_work_dir and MESOS_master env vars not honoured
> -
>
> Key: MESOS-1801
> URL: https://issues.apache.org/jira/browse/MESOS-1801
> Project: Mesos
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.20.0
> Environment: CentOS 7
>Reporter: Cosmin Lehene
>
> The documentation states that cli params should be substitutable by 
> environment variables
> {quote}
>  Each option can be set in two ways:
> By passing it to the binary using --option_name=value.
> By setting the environment variable MESOS_OPTION_NAME (the option name with a 
> MESOS_ prefix added to it).
> {quote}
> However at least the master's MESOS_work_dir and slave's "MESOS_master"  env 
> vars seem to be ignored:
> {noformat}
> [root@localhost ~]# echo $MESOS_master
> zk://localhost:2181/mesos
> [root@localhost ~]# mesos-slave
> Missing required option --master
> [root@localhost ~]# echo $MESOS_work_dir
> /var/lib/mesos
> [root@localhost ~]# mesos-master
> I0917 08:36:46.242200 31325 main.cpp:155] Build: 2014-08-22 05:06:06 by root
> I0917 08:36:46.242369 31325 main.cpp:157] Version: 0.20.0
> I0917 08:36:46.242377 31325 main.cpp:160] Git tag: 0.20.0
> I0917 08:36:46.242382 31325 main.cpp:164] Git SHA: 
> f421ffdf8d32a8834b3a6ee483b5b59f65956497
> --work_dir needed for replicated log based registry
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1728) Libprocess: report bind parameters on failure

2014-09-16 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1728:
--
Fix Version/s: (was: 0.20.1)
   0.21.0

> Libprocess: report bind parameters on failure
> -
>
> Key: MESOS-1728
> URL: https://issues.apache.org/jira/browse/MESOS-1728
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Nikita Vetoshkin
>Assignee: Nikita Vetoshkin
>Priority: Trivial
> Fix For: 0.21.0
>
>
> When you attempt to start slave or master and there's another one already 
> running there, it is nice to report what are the actual parameters to 
> {{bind}} call that failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1643) Provide APIs to return port resource for a given role

2014-09-16 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1643:
--
Fix Version/s: (was: 0.20.1)
   0.21.0

> Provide APIs to return port resource for a given role
> -
>
> Key: MESOS-1643
> URL: https://issues.apache.org/jira/browse/MESOS-1643
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zuyu Zhang
>Assignee: Zuyu Zhang
>Priority: Trivial
> Fix For: 0.21.0
>
>
> It makes more sense to return port resource for a given role, rather than all 
> ports in Resources.
> In mesos/resource.hpp:
> Option Resources::ports(const string& role = "*");
> // Check whether Resources have the given number (num_port) of ports, and 
> return the begin number of the port range.
> Option Resources::getPorts(long num_port, const string& role = "*");



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.

2014-09-16 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1716:
--
Fix Version/s: (was: 0.20.1)
   0.21.0

> The slave does not add pending tasks as part of the staging tasks metric.
> -
>
> Key: MESOS-1716
> URL: https://issues.apache.org/jira/browse/MESOS-1716
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Trivial
> Fix For: 0.21.0
>
>
> The slave does not represent pending tasks in the "tasks_staging" metric.
> This should be a trivial fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-1803) Strict/RegistrarTest.remove test is flaky on jenkins.

2014-09-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-1803.

Resolution: Cannot Reproduce

The log timings here look as if the threads were starved of CPU:

{noformat}
I0916 22:59:02.136256 21049 leveldb.cpp:343] Persisting action (165 bytes) to 
leveldb took 141908ns
I0916 22:59:02.136267 21047 leveldb.cpp:343] Persisting action (165 bytes) to 
leveldb took 111061ns
I../../src/tests/registrar_tests.cpp:257: Failure
0916 22:59:02.136276 21049 replica.cpp:676] Persisted action at 1
Failed to wait 10secs for registrar.recover(master)
I0916 22:59:14.265326 21049 replica.cpp:661] Replica learned APPEND action at 
position 1
I0916 22:59:02.136291 21047 replica.cpp:676] Persisted action at 1
E0916 22:59:07.135143 21046 registrar.cpp:500] Registrar aborting: Failed to 
update 'registry': Failed to perform store within 5secs
I0916 22:59:14.265393 21047 replica.cpp:661] Replica learned APPEND action at 
position 1
{noformat}

The logging time stamp is determined at the beginning of the LOG(INFO) 
expression, when the initial LogMessage object is created. The interleaving of 
times looks to be a stall of the VM or thread starvation:

{noformat}
22:59:02.136267 21047 // Thread 1, 1st LogMessage flushed.
22:59:02.136276 21049 // Thread 2, 2nd LogMessage flushed.
22:59:14.265326 21049 // Thread 2, 5th LogMessage flushed.
22:59:02.136291 21047 // Thread 1, 3rd LogMessage flushed.
22:59:07.135143 21046 // Thread 3, 4th LogMessage flushed.
22:59:14.265393 21047 // Thread 1, 6th LogMessage flushed.
{noformat}

> Strict/RegistrarTest.remove test is flaky on jenkins.
> -
>
> Key: MESOS-1803
> URL: https://issues.apache.org/jira/browse/MESOS-1803
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull
> {noformat}
> [ RUN  ] Strict/RegistrarTest.remove/1
> Using temporary directory '/tmp/Strict_RegistrarTest_remove_1_3QvnOW'
> I0916 22:59:02.112568 21026 leveldb.cpp:176] Opened db in 1.779835ms
> I0916 22:59:02.112896 21026 leveldb.cpp:183] Compacted db in 301862ns
> I0916 22:59:02.112916 21026 leveldb.cpp:198] Created db iterator in 3065ns
> I0916 22:59:02.112926 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 475ns
> I0916 22:59:02.112936 21026 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 330ns
> I0916 22:59:02.112951 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:59:02.113654 21054 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 421460ns
> I0916 22:59:02.113674 21054 replica.cpp:320] Persisted replica status to 
> VOTING
> I0916 22:59:02.115900 21026 leveldb.cpp:176] Opened db in 1.947919ms
> I0916 22:59:02.116263 21026 leveldb.cpp:183] Compacted db in 338043ns
> I0916 22:59:02.116283 21026 leveldb.cpp:198] Created db iterator in 2809ns
> I0916 22:59:02.116293 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 468ns
> I0916 22:59:02.116302 21026 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 195ns
> I0916 22:59:02.116317 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:59:02.117013 21043 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 472891ns
> I0916 22:59:02.117034 21043 replica.cpp:320] Persisted replica status to 
> VOTING
> I0916 22:59:02.119240 21026 leveldb.cpp:176] Opened db in 1.950367ms
> I0916 22:59:02.120455 21026 leveldb.cpp:183] Compacted db in 1.188056ms
> I0916 22:59:02.120481 21026 leveldb.cpp:198] Created db iterator in 4370ns
> I0916 22:59:02.120499 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 7977ns
> I0916 22:59:02.120517 21026 leveldb.cpp:273] Iterated through 1 keys in the 
> db in 8479ns
> I0916 22:59:02.120533 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:59:02.122890 21026 leveldb.cpp:176] Opened db in 2.301327ms
> I0916 22:59:02.124325 21026 leveldb.cpp:183] Compacted db in 1.406223ms
> I0916 22:59:02.124351 21026 leveldb.cpp:198] Created db iterator in 4185ns
> I0916 22:59:02.124368 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 7167ns
> I0916 22:59:02.124387 21026 leveldb.cpp:273] Iterated through 1 keys in the 
> db in 8182ns
> I0916 22:59:02.124403 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:59:02.124579 21047 recover.cpp:425] Starting replica recovery
> I0916 22:59:02.124651 21047 recover.cpp:451] Replica is in VOTING status
> I0916 22:59:02.124793 21047 recover.cpp:440] Recover process terminated

[jira] [Commented] (MESOS-1746) clear TaskStatus data to avoid OOM

2014-09-16 Thread Chengwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136605#comment-14136605
 ] 

Chengwei Yang commented on MESOS-1746:
--

[~tstclair], yes, spark stores very large data into TaskStatus, since there is 
a "data" field in TaskStatus which was supposed to be used to store application 
specific data, so we can not prevent applications (like spark) from doing so.

> clear TaskStatus data to avoid OOM
> --
>
> Key: MESOS-1746
> URL: https://issues.apache.org/jira/browse/MESOS-1746
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos-0.19.0
>Reporter: Chengwei Yang
>Assignee: Chengwei Yang
>
> Spark on mesos may use TaskStatus to transfer computed result between worker 
> and scheduler, the source code like below (spark 1.0.2)
> {code}
> val serializedResult = {
>   if (serializedDirectResult.limit >= execBackend.akkaFrameSize() -
>   AkkaUtils.reservedSizeBytes) {  
>   
>   
> 
> logInfo("Storing result for " + taskId + " in local BlockManager")
> val blockId = TaskResultBlockId(taskId)
> env.blockManager.putBytes(
>   blockId, serializedDirectResult, 
> StorageLevel.MEMORY_AND_DISK_SER)
> ser.serialize(new IndirectTaskResult[Any](blockId))   
>   
>   
> 
>   } else {
>   
>   
> 
> logInfo("Sending result for " + taskId + " directly to driver")
> serializedDirectResult
>   
>   
> 
>   }   
>   
>   
> 
> }
> {code}
> And In our test environment, we enlarge akkaFrameSize to 128MB from default 
> value (10MB) and this cause our mesos-master process will be OOM in tens of 
> minutes when running spark tasks in fine-grained mode.
> As you can see, even changed akkaFrameSize back to default value (10MB), it's 
> very likely to make mesos-master OOM too, however more slower.
> So I think it's good to delete data from TaskStatus since this is only 
> designed to on-top framework and we don't interested in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-1746) clear TaskStatus data to avoid OOM

2014-09-16 Thread Chengwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136605#comment-14136605
 ] 

Chengwei Yang edited comment on MESOS-1746 at 9/17/14 1:06 AM:
---

[~tstclair], yes, spark stores very large data into TaskStatus, since there is 
a "data" field in TaskStatus which was supposed to be used to store application 
specific data, so we can not prevent applications (like spark) from doing so.

please help to review: https://reviews.apache.org/r/25184/


was (Author: chengwei-yang):
[~tstclair], yes, spark stores very large data into TaskStatus, since there is 
a "data" field in TaskStatus which was supposed to be used to store application 
specific data, so we can not prevent applications (like spark) from doing so.

> clear TaskStatus data to avoid OOM
> --
>
> Key: MESOS-1746
> URL: https://issues.apache.org/jira/browse/MESOS-1746
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos-0.19.0
>Reporter: Chengwei Yang
>Assignee: Chengwei Yang
>
> Spark on mesos may use TaskStatus to transfer computed result between worker 
> and scheduler, the source code like below (spark 1.0.2)
> {code}
> val serializedResult = {
>   if (serializedDirectResult.limit >= execBackend.akkaFrameSize() -
>   AkkaUtils.reservedSizeBytes) {  
>   
>   
> 
> logInfo("Storing result for " + taskId + " in local BlockManager")
> val blockId = TaskResultBlockId(taskId)
> env.blockManager.putBytes(
>   blockId, serializedDirectResult, 
> StorageLevel.MEMORY_AND_DISK_SER)
> ser.serialize(new IndirectTaskResult[Any](blockId))   
>   
>   
> 
>   } else {
>   
>   
> 
> logInfo("Sending result for " + taskId + " directly to driver")
> serializedDirectResult
>   
>   
> 
>   }   
>   
>   
> 
> }
> {code}
> And In our test environment, we enlarge akkaFrameSize to 128MB from default 
> value (10MB) and this cause our mesos-master process will be OOM in tens of 
> minutes when running spark tasks in fine-grained mode.
> As you can see, even changed akkaFrameSize back to default value (10MB), it's 
> very likely to make mesos-master OOM too, however more slower.
> So I think it's good to delete data from TaskStatus since this is only 
> designed to on-top framework and we don't interested in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.

2014-09-16 Thread Bhuvan Arumugam (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhuvan Arumugam updated MESOS-1195:
---
Target Version/s: 0.21.0  (was: 0.20.1)

> systemd.slice + cgroup enablement fails in multiple ways. 
> --
>
> Key: MESOS-1195
> URL: https://issues.apache.org/jira/browse/MESOS-1195
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Timothy St. Clair
>Assignee: Timothy St. Clair
>
> When attempting to configure mesos to use systemd slices on a 'rawhide/f21' 
> machine, it fails creating the isolator: 
> I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: 
> cgroups/cpu,cgroups/mem
> Failed to create a containerizer: Could not create isolator cgroups/cpu: 
> Failed to create isolator: The cpu subsystem is co-mounted at 
> /sys/fs/cgroup/cpu with other subsytems
> -- details --
> /sys/fs/cgroup
> total 0
> drwxr-xr-x. 12 root root 280 Mar 18 08:47 .
> drwxr-xr-x.  6 root root   0 Mar 18 08:47 ..
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 blkio
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpu -> cpu,cpuacct
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpuacct -> cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpuset
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 devices
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 freezer
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 hugetlb
> drwxr-xr-x.  3 root root   0 Apr  3 11:26 memory
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 net_cls
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 perf_event
> drwxr-xr-x.  4 root root   0 Mar 18 08:47 systemd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.

2014-09-16 Thread Bhuvan Arumugam (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136614#comment-14136614
 ] 

Bhuvan Arumugam commented on MESOS-1195:


moving it to 0.21.0, as discussed in reviewboard.
  http://reviews.apache.org/r/25695/

> systemd.slice + cgroup enablement fails in multiple ways. 
> --
>
> Key: MESOS-1195
> URL: https://issues.apache.org/jira/browse/MESOS-1195
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.18.0
>Reporter: Timothy St. Clair
>Assignee: Timothy St. Clair
>
> When attempting to configure mesos to use systemd slices on a 'rawhide/f21' 
> machine, it fails creating the isolator: 
> I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: 
> cgroups/cpu,cgroups/mem
> Failed to create a containerizer: Could not create isolator cgroups/cpu: 
> Failed to create isolator: The cpu subsystem is co-mounted at 
> /sys/fs/cgroup/cpu with other subsytems
> -- details --
> /sys/fs/cgroup
> total 0
> drwxr-xr-x. 12 root root 280 Mar 18 08:47 .
> drwxr-xr-x.  6 root root   0 Mar 18 08:47 ..
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 blkio
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpu -> cpu,cpuacct
> lrwxrwxrwx.  1 root root  11 Mar 18 08:47 cpuacct -> cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpu,cpuacct
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 cpuset
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 devices
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 freezer
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 hugetlb
> drwxr-xr-x.  3 root root   0 Apr  3 11:26 memory
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 net_cls
> drwxr-xr-x.  2 root root   0 Mar 18 08:47 perf_event
> drwxr-xr-x.  4 root root   0 Mar 18 08:47 systemd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1804) the "store" component cause on-top framework (chronos) crash

2014-09-16 Thread Chengwei Yang (JIRA)
Chengwei Yang created MESOS-1804:


 Summary: the "store" component cause on-top framework (chronos) 
crash
 Key: MESOS-1804
 URL: https://issues.apache.org/jira/browse/MESOS-1804
 Project: Mesos
  Issue Type: Bug
 Environment: mesos-0.19.0
Reporter: Chengwei Yang
Assignee: Chengwei Yang


chronos running with mesos-0.19.0 may crash like below.

{code}
[2014-09-05 15:21:36,095] INFO State J_chronos_job_34 does not exist yet. 
Adding to state (com.airbnb.scheduler.state.MesosStatePersistenceStore:146)
F0905 15:21:36.175230 27727 org_apache_mesos_state_AbstractState.cpp:319] Check 
failed: future->isReady()
*** Check failure stack trace: ***
@ 0x7f4f1ecb199d google::LogMessage::Fail()
@ 0x7f4f1ecb59b7 google::LogMessage::SendToLog()
@ 0x7f4f1ecb3839 google::LogMessage::Flush()
@ 0x7f4f1ecb3b3d google::LogMessageFatal::~LogMessageFatal()
@ 0x7f4f1ec2ef90 Java_org_apache_mesos_state_AbstractState__1_1store_1get
@ 0x7f4f18293d45 (unknown)
Aborted (core dumped)
{code}

The related code snippet as below:
{code}
$ sed -ne '311,334p' src/java/jni/org_apache_mesos_state_AbstractState.cpp
JNIEXPORT jobject JNICALL 
Java_org_apache_mesos_state_AbstractState__1_1store_1get
  (JNIEnv* env, jobject thiz, jlong jfuture)
{
  Future >* future = (Future >*) jfuture;

  future->await();

  if (future->isFailed()) {
jclass clazz = env->FindClass("java/util/concurrent/ExecutionException");
env->ThrowNew(clazz, future->failure().c_str());
return NULL;
  } else if (future->isDiscarded()) {
// TODO(benh): Consider throwing an ExecutionException since we
// never return true for 'isCancelled'.
jclass clazz = env->FindClass("java/util/concurrent/CancellationException");
env->ThrowNew(clazz, "Future was discarded");
return NULL;
  }

  CHECK_READY(*future);

  if (future->get().isSome()) {
Variable* variable = new Variable(future->get().get());
{code}

The root cause seems that CHECK_READY(*future) failed and crashed chronos.

See chronos issue: https://github.com/airbnb/chronos/issues/253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MESOS-1747) Docker image parsing for private repositories

2014-09-16 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reopened MESOS-1747:
---

> Docker image parsing for private repositories
> -
>
> Key: MESOS-1747
> URL: https://issues.apache.org/jira/browse/MESOS-1747
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, slave
>Affects Versions: 0.20.0
>Reporter: Don Laidlaw
>Assignee: Timothy Chen
>  Labels: docker
> Fix For: 0.20.1
>
>
> You cannot specify a port number for the host of a private docker repository. 
> Specified as follows: {noformat}
> "container": {
> "type": "DOCKER",
> "docker": {
> "image": "docker-repo:5000/app-base:v0.1"
> }
> }
> {noformat}
> results in an error:
> {noformat}
> Aug 29 14:33:29 ip-172-16-2-22 mesos-slave[1128]: E0829 14:33:29.487470  1153 
> slave.cpp:2484] Container '250e0479-552f-4e6f-81dd-71550e45adae' for executor 
> 't1-java.71d50bd1-2f89-11e4-ba9a-0adfe6b11716' of framework 
> '20140829-121838-184684716-5050-1177-' failed to start:Not expecting 
> multiple ':' in image: docker-repo:5000/app-base:v0.1
> {noformat}
> The message indicates only one colon character is allowed, but to supply a 
> port number for a private docker repository host you need to have two colons.
> Also if you use a '-' character in a host name you also get an error:
> {noformat}
> Invalid namespace name (docker-repo), only [a-z0-9_] are allowed, size 
> between 4 and 30
> {noformat}
> The hostname parts should not be limited to [a-z0-9_].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-1747) Docker image parsing for private repositories

2014-09-16 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone resolved MESOS-1747.
---
   Resolution: Duplicate
Fix Version/s: (was: 0.20.1)

> Docker image parsing for private repositories
> -
>
> Key: MESOS-1747
> URL: https://issues.apache.org/jira/browse/MESOS-1747
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, slave
>Affects Versions: 0.20.0
>Reporter: Don Laidlaw
>Assignee: Timothy Chen
>  Labels: docker
>
> You cannot specify a port number for the host of a private docker repository. 
> Specified as follows: {noformat}
> "container": {
> "type": "DOCKER",
> "docker": {
> "image": "docker-repo:5000/app-base:v0.1"
> }
> }
> {noformat}
> results in an error:
> {noformat}
> Aug 29 14:33:29 ip-172-16-2-22 mesos-slave[1128]: E0829 14:33:29.487470  1153 
> slave.cpp:2484] Container '250e0479-552f-4e6f-81dd-71550e45adae' for executor 
> 't1-java.71d50bd1-2f89-11e4-ba9a-0adfe6b11716' of framework 
> '20140829-121838-184684716-5050-1177-' failed to start:Not expecting 
> multiple ':' in image: docker-repo:5000/app-base:v0.1
> {noformat}
> The message indicates only one colon character is allowed, but to supply a 
> port number for a private docker repository host you need to have two colons.
> Also if you use a '-' character in a host name you also get an error:
> {noformat}
> Invalid namespace name (docker-repo), only [a-z0-9_] are allowed, size 
> between 4 and 30
> {noformat}
> The hostname parts should not be limited to [a-z0-9_].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1746) clear TaskStatus data to avoid OOM

2014-09-16 Thread Timothy St. Clair (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy St. Clair updated MESOS-1746:
-
Shepherd: Timothy St. Clair

> clear TaskStatus data to avoid OOM
> --
>
> Key: MESOS-1746
> URL: https://issues.apache.org/jira/browse/MESOS-1746
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos-0.19.0
>Reporter: Chengwei Yang
>Assignee: Chengwei Yang
>
> Spark on mesos may use TaskStatus to transfer computed result between worker 
> and scheduler, the source code like below (spark 1.0.2)
> {code}
> val serializedResult = {
>   if (serializedDirectResult.limit >= execBackend.akkaFrameSize() -
>   AkkaUtils.reservedSizeBytes) {  
>   
>   
> 
> logInfo("Storing result for " + taskId + " in local BlockManager")
> val blockId = TaskResultBlockId(taskId)
> env.blockManager.putBytes(
>   blockId, serializedDirectResult, 
> StorageLevel.MEMORY_AND_DISK_SER)
> ser.serialize(new IndirectTaskResult[Any](blockId))   
>   
>   
> 
>   } else {
>   
>   
> 
> logInfo("Sending result for " + taskId + " directly to driver")
> serializedDirectResult
>   
>   
> 
>   }   
>   
>   
> 
> }
> {code}
> And In our test environment, we enlarge akkaFrameSize to 128MB from default 
> value (10MB) and this cause our mesos-master process will be OOM in tens of 
> minutes when running spark tasks in fine-grained mode.
> As you can see, even changed akkaFrameSize back to default value (10MB), it's 
> very likely to make mesos-master OOM too, however more slower.
> So I think it's good to delete data from TaskStatus since this is only 
> designed to on-top framework and we don't interested in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-1621) Docker run networking should be configurable and support bridge network

2014-09-16 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone resolved MESOS-1621.
---
Resolution: Fixed

commit 1453a477511c8f6f22ff16e3dd13d0532e019c5b
Author: Timothy Chen 
Date:   Tue Sep 16 18:29:36 2014 -0700

Enabled bridge network for Docker Containerizer.

Review: https://reviews.apache.org/r/25270


> Docker run networking should be configurable and support bridge network
> ---
>
> Key: MESOS-1621
> URL: https://issues.apache.org/jira/browse/MESOS-1621
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>  Labels: Docker
> Fix For: 0.20.1
>
>
> Currently to easily support running executors in Docker image, we hardcode 
> --net=host into Docker run so slave and executor and reuse the same mechanism 
> to communicate, which is to pass the slave IP/PORT for the framework to 
> respond with it's own hostname and port information back to setup the tunnel.
> We want to see how to abstract this or even get rid of host networking 
> altogether if we have a good way to not rely on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)