[jira] [Created] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite
Dario Rexin created MESOS-1797: -- Summary: Packaged Zookeeper does not compile on OSX Yosemite Key: MESOS-1797 URL: https://issues.apache.org/jira/browse/MESOS-1797 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.19.1, 0.20.0, 0.21.0 Reporter: Dario Rexin Priority: Minor I have been struggling with this for some time (due to my lack of knowledge about C compiler error messages) and finally found a way to make it compile. The problem is that Zookeeper defines a function `htonll` that is a builtin function in Yosemite. For me it worked to just remove this function, but as it needs to keep working on other systems as well, we would need some check for the OS version or if the function is already defined. Here are the links to the source: https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73 https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-1764) Build Fixes from 0.20 release
[ https://issues.apache.org/jira/browse/MESOS-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy St. Clair resolved MESOS-1764. -- Resolution: Fixed Build Fixes from 0.20 release - Key: MESOS-1764 URL: https://issues.apache.org/jira/browse/MESOS-1764 Project: Mesos Issue Type: Bug Components: build Affects Versions: 0.20.0 Reporter: Timothy St. Clair Assignee: Timothy St. Clair Fix For: 0.20.1 This ticket is a catch all for minor issues caught during a rebase and testing. + Add package configuration file to deployment + Updates deploy_dir from localstatedir to sysconfdir -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1764) Build Fixes from 0.20 release
[ https://issues.apache.org/jira/browse/MESOS-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14130604#comment-14130604 ] Timothy St. Clair edited comment on MESOS-1764 at 9/16/14 2:32 PM: --- Punting last update to https://issues.apache.org/jira/browse/MESOS-1675 was (Author: tstclair): add initial -version-info for shared library http://reviews.apache.org/r/25551/ Build Fixes from 0.20 release - Key: MESOS-1764 URL: https://issues.apache.org/jira/browse/MESOS-1764 Project: Mesos Issue Type: Bug Components: build Affects Versions: 0.20.0 Reporter: Timothy St. Clair Assignee: Timothy St. Clair Fix For: 0.20.1 This ticket is a catch all for minor issues caught during a rebase and testing. + Add package configuration file to deployment + Updates deploy_dir from localstatedir to sysconfdir -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version
[ https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135526#comment-14135526 ] Timothy St. Clair commented on MESOS-1675: -- [~vinodkone] Did you want to elaborate on your thoughts here? Decouple version of the mesos library from the package release version -- Key: MESOS-1675 URL: https://issues.apache.org/jira/browse/MESOS-1675 Project: Mesos Issue Type: Bug Reporter: Vinod Kone This discussion should be rolled into the larger discussion around how to version Mesos (APIs, packages, libraries etc). Some notes from libtool docs. http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1621) Docker run networking should be configurable and support bridge network
[ https://issues.apache.org/jira/browse/MESOS-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135531#comment-14135531 ] Timothy St. Clair commented on MESOS-1621: -- I'll open up a separate ticket to discuss the API + override conversation. Docker run networking should be configurable and support bridge network --- Key: MESOS-1621 URL: https://issues.apache.org/jira/browse/MESOS-1621 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Timothy Chen Assignee: Timothy Chen Labels: Docker Fix For: 0.20.1 Currently to easily support running executors in Docker image, we hardcode --net=host into Docker run so slave and executor and reuse the same mechanism to communicate, which is to pass the slave IP/PORT for the framework to respond with it's own hostname and port information back to setup the tunnel. We want to see how to abstract this or even get rid of host networking altogether if we have a good way to not rely on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.
[ https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy St. Clair updated MESOS-1195: - Target Version/s: 0.21.0 reviews.apache.org/r/25695/ systemd.slice + cgroup enablement fails in multiple ways. -- Key: MESOS-1195 URL: https://issues.apache.org/jira/browse/MESOS-1195 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.18.0 Reporter: Timothy St. Clair Assignee: Timothy St. Clair When attempting to configure mesos to use systemd slices on a 'rawhide/f21' machine, it fails creating the isolator: I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: cgroups/cpu,cgroups/mem Failed to create a containerizer: Could not create isolator cgroups/cpu: Failed to create isolator: The cpu subsystem is co-mounted at /sys/fs/cgroup/cpu with other subsytems -- details -- /sys/fs/cgroup total 0 drwxr-xr-x. 12 root root 280 Mar 18 08:47 . drwxr-xr-x. 6 root root 0 Mar 18 08:47 .. drwxr-xr-x. 2 root root 0 Mar 18 08:47 blkio lrwxrwxrwx. 1 root root 11 Mar 18 08:47 cpu - cpu,cpuacct lrwxrwxrwx. 1 root root 11 Mar 18 08:47 cpuacct - cpu,cpuacct drwxr-xr-x. 2 root root 0 Mar 18 08:47 cpu,cpuacct drwxr-xr-x. 2 root root 0 Mar 18 08:47 cpuset drwxr-xr-x. 2 root root 0 Mar 18 08:47 devices drwxr-xr-x. 2 root root 0 Mar 18 08:47 freezer drwxr-xr-x. 2 root root 0 Mar 18 08:47 hugetlb drwxr-xr-x. 3 root root 0 Apr 3 11:26 memory drwxr-xr-x. 2 root root 0 Mar 18 08:47 net_cls drwxr-xr-x. 2 root root 0 Mar 18 08:47 perf_event drwxr-xr-x. 4 root root 0 Mar 18 08:47 systemd -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version
[ https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135698#comment-14135698 ] Vinod Kone commented on MESOS-1675: --- Is adding version info is backwards compatible, i.e., the new lib can be a drop in replacement for the old lib, then that should be fine. {quote} However the release wrangler will need to add a step to their punch-list prior to adoption. {quote} Not sure what this means? Decouple version of the mesos library from the package release version -- Key: MESOS-1675 URL: https://issues.apache.org/jira/browse/MESOS-1675 Project: Mesos Issue Type: Bug Reporter: Vinod Kone This discussion should be rolled into the larger discussion around how to version Mesos (APIs, packages, libraries etc). Some notes from libtool docs. http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite
[ https://issues.apache.org/jira/browse/MESOS-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135714#comment-14135714 ] Benjamin Mahler commented on MESOS-1797: Is there a ZooKeeper ticket related to this? Packaged Zookeeper does not compile on OSX Yosemite --- Key: MESOS-1797 URL: https://issues.apache.org/jira/browse/MESOS-1797 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.20.0, 0.21.0, 0.19.1 Reporter: Dario Rexin Priority: Minor I have been struggling with this for some time (due to my lack of knowledge about C compiler error messages) and finally found a way to make it compile. The problem is that Zookeeper defines a function `htonll` that is a builtin function in Yosemite. For me it worked to just remove this function, but as it needs to keep working on other systems as well, we would need some check for the OS version or if the function is already defined. Here are the links to the source: https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73 https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite
[ https://issues.apache.org/jira/browse/MESOS-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135719#comment-14135719 ] Dario Rexin commented on MESOS-1797: I didn't find one. Packaged Zookeeper does not compile on OSX Yosemite --- Key: MESOS-1797 URL: https://issues.apache.org/jira/browse/MESOS-1797 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.20.0, 0.21.0, 0.19.1 Reporter: Dario Rexin Priority: Minor I have been struggling with this for some time (due to my lack of knowledge about C compiler error messages) and finally found a way to make it compile. The problem is that Zookeeper defines a function `htonll` that is a builtin function in Yosemite. For me it worked to just remove this function, but as it needs to keep working on other systems as well, we would need some check for the OS version or if the function is already defined. Here are the links to the source: https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73 https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version
[ https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135816#comment-14135816 ] Timothy St. Clair commented on MESOS-1675: -- Folks will need to check compatibility and update the revision in src/Makefile.am as outlined here: http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html Decouple version of the mesos library from the package release version -- Key: MESOS-1675 URL: https://issues.apache.org/jira/browse/MESOS-1675 Project: Mesos Issue Type: Bug Reporter: Vinod Kone This discussion should be rolled into the larger discussion around how to version Mesos (APIs, packages, libraries etc). Some notes from libtool docs. http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-444) Remove --checkpoint flag in the slave once checkpointing is stable.
[ https://issues.apache.org/jira/browse/MESOS-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135880#comment-14135880 ] Kevin Sweeney commented on MESOS-444: - Any activity here? I'd like to simplify this flag: https://github.com/apache/incubator-aurora/blob/master/src/main/java/org/apache/aurora/scheduler/DriverFactory.java#L75-L101 Remove --checkpoint flag in the slave once checkpointing is stable. --- Key: MESOS-444 URL: https://issues.apache.org/jira/browse/MESOS-444 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler Labels: newbie In the interim of slave recovery being worked on (see: MESOS-110), we've added a --checkpoint flag to the slave to enable or disable the feature. Prior to releasing this feature, we need to remove this flag so that all slaves have checkpointing available, and frameworks can choose to use it. There's no need to keep this flag around and add configuration complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1746) clear TaskStatus data to avoid OOM
[ https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135982#comment-14135982 ] Timothy St. Clair commented on MESOS-1746: -- Maybe I'm missing something, but how is this a Mesos problem? It seems like an Executor sizing constraint issue in the Spark Scheduler. clear TaskStatus data to avoid OOM -- Key: MESOS-1746 URL: https://issues.apache.org/jira/browse/MESOS-1746 Project: Mesos Issue Type: Bug Environment: mesos-0.19.0 Reporter: Chengwei Yang Assignee: Chengwei Yang Spark on mesos may use TaskStatus to transfer computed result between worker and scheduler, the source code like below (spark 1.0.2) {code} val serializedResult = { if (serializedDirectResult.limit = execBackend.akkaFrameSize() - AkkaUtils.reservedSizeBytes) { logInfo(Storing result for + taskId + in local BlockManager) val blockId = TaskResultBlockId(taskId) env.blockManager.putBytes( blockId, serializedDirectResult, StorageLevel.MEMORY_AND_DISK_SER) ser.serialize(new IndirectTaskResult[Any](blockId)) } else { logInfo(Sending result for + taskId + directly to driver) serializedDirectResult } } {code} And In our test environment, we enlarge akkaFrameSize to 128MB from default value (10MB) and this cause our mesos-master process will be OOM in tens of minutes when running spark tasks in fine-grained mode. As you can see, even changed akkaFrameSize back to default value (10MB), it's very likely to make mesos-master OOM too, however more slower. So I think it's good to delete data from TaskStatus since this is only designed to on-top framework and we don't interested in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1746) clear TaskStatus data to avoid OOM
[ https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135982#comment-14135982 ] Timothy St. Clair edited comment on MESOS-1746 at 9/16/14 7:05 PM: --- Are you saying, a task status update is OOM killing the mesos-master? was (Author: tstclair): Maybe I'm missing something, but how is this a Mesos problem? It seems like an Executor sizing constraint issue in the Spark Scheduler. clear TaskStatus data to avoid OOM -- Key: MESOS-1746 URL: https://issues.apache.org/jira/browse/MESOS-1746 Project: Mesos Issue Type: Bug Environment: mesos-0.19.0 Reporter: Chengwei Yang Assignee: Chengwei Yang Spark on mesos may use TaskStatus to transfer computed result between worker and scheduler, the source code like below (spark 1.0.2) {code} val serializedResult = { if (serializedDirectResult.limit = execBackend.akkaFrameSize() - AkkaUtils.reservedSizeBytes) { logInfo(Storing result for + taskId + in local BlockManager) val blockId = TaskResultBlockId(taskId) env.blockManager.putBytes( blockId, serializedDirectResult, StorageLevel.MEMORY_AND_DISK_SER) ser.serialize(new IndirectTaskResult[Any](blockId)) } else { logInfo(Sending result for + taskId + directly to driver) serializedDirectResult } } {code} And In our test environment, we enlarge akkaFrameSize to 128MB from default value (10MB) and this cause our mesos-master process will be OOM in tens of minutes when running spark tasks in fine-grained mode. As you can see, even changed akkaFrameSize back to default value (10MB), it's very likely to make mesos-master OOM too, however more slower. So I think it's good to delete data from TaskStatus since this is only designed to on-top framework and we don't interested in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1799) Reconciliation can send out-of-order updates.
Benjamin Mahler created MESOS-1799: -- Summary: Reconciliation can send out-of-order updates. Key: MESOS-1799 URL: https://issues.apache.org/jira/browse/MESOS-1799 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler When a slave re-registers with the master, it currently sends the latest task state for all tasks that are not both terminal and acknowledged. However, reconciliation assumes that we always have the latest unacknowledged state of the task represented in the master. As a result, out-of-order updates are possible, e.g. (1) Slave has task T in TASK_FINISHED, with unacknowledged updates: [TASK_RUNNING, TASK_FINISHED]. (2) Master fails over. (3) New master re-registers the slave with T in TASK_FINISHED. (4) Reconciliation request arrives, master sends TASK_FINISHED. (5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING. I think the fix here is to preserve the task state invariants in the master, namely, that the master has the latest unacknowledged state of the task. This means when the slave re-registers, it should instead send the latest unacknowledged state of each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1027) IPv6 support
[ https://issues.apache.org/jira/browse/MESOS-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136253#comment-14136253 ] Oskar Stenman commented on MESOS-1027: -- This would be great if it was resolved, we have a few things holding us back from V6-only (which would allow us to greatly simplify a lot of our infrastructure), one is mesos, the others are most likely weird services we haven't discovered are an issue yet since we can't even run mesos on V6-only. :) IPv6 support Key: MESOS-1027 URL: https://issues.apache.org/jira/browse/MESOS-1027 Project: Mesos Issue Type: Epic Components: framework, libprocess, master, slave Reporter: Dominic Hamon Fix For: 1.0.0 From the CLI down through the various layers of tech we should support IPv6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1800) The slave does not send pending executors during re-registration.
Benjamin Mahler created MESOS-1800: -- Summary: The slave does not send pending executors during re-registration. Key: MESOS-1800 URL: https://issues.apache.org/jira/browse/MESOS-1800 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler In what looks like an oversight, the pending executors in the slave are not sent in the re-registration message. This can lead to under-accounting in the master, causing an overcommit on the slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1466: -- Assignee: (was: Benjamin Mahler) Race between executor exited event and launch task can cause overcommit of resources Key: MESOS-1466 URL: https://issues.apache.org/jira/browse/MESOS-1466 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Vinod Kone Labels: reliability The following sequence of events can cause an overcommit -- Launch task is called for a task whose executor is already running -- Executor's resources are not accounted for on the master -- Executor exits and the event is enqueued behind launch tasks on the master -- Master sends the task to the slave which needs to commit for resources for task and the (new) executor. -- Master processes the executor exited event and re-offers the executor's resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1688) No offers if no memory is allocatable
[ https://issues.apache.org/jira/browse/MESOS-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136441#comment-14136441 ] Bhuvan Arumugam commented on MESOS-1688: targetting it for 0.21.0. No offers if no memory is allocatable - Key: MESOS-1688 URL: https://issues.apache.org/jira/browse/MESOS-1688 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.18.1, 0.18.2, 0.19.0, 0.19.1 Reporter: Martin Weindel Priority: Critical Fix For: 0.21.0 The [Spark scheduler|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala] allocates memory only for the executor and cpu only for its tasks. So it can happen that all memory is nearly completely allocated by Spark executors, but all cpu resources are idle. In this case Mesos does not offer resources anymore, as less than MIN_MEM (=32MB) memory is allocatable. This effectively causes a dead lock in the Spark job, as it is not offered cpu resources needed for launching new tasks. see {{HierarchicalAllocatorProcess::allocatable(const Resources)}} called in {{HierarchicalAllocatorProcess::allocate(const hashsetSlaveID)}} {code} template class RoleSorter, class FrameworkSorter bool HierarchicalAllocatorProcessRoleSorter, FrameworkSorter::allocatable( const Resources resources) { ... Optiondouble cpus = resources.cpus(); OptionBytes mem = resources.mem(); if (cpus.isSome() mem.isSome()) { return cpus.get() = MIN_CPUS mem.get() MIN_MEM; } return false; } {code} A possible solution may to completely drop the condition on allocatable memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.
[ https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136449#comment-14136449 ] Bhuvan Arumugam commented on MESOS-1195: [~tstclair] the patch is still not reviewed. i'm going to offload it to 0.21.0. systemd.slice + cgroup enablement fails in multiple ways. -- Key: MESOS-1195 URL: https://issues.apache.org/jira/browse/MESOS-1195 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.18.0 Reporter: Timothy St. Clair Assignee: Timothy St. Clair When attempting to configure mesos to use systemd slices on a 'rawhide/f21' machine, it fails creating the isolator: I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: cgroups/cpu,cgroups/mem Failed to create a containerizer: Could not create isolator cgroups/cpu: Failed to create isolator: The cpu subsystem is co-mounted at /sys/fs/cgroup/cpu with other subsytems -- details -- /sys/fs/cgroup total 0 drwxr-xr-x. 12 root root 280 Mar 18 08:47 . drwxr-xr-x. 6 root root 0 Mar 18 08:47 .. drwxr-xr-x. 2 root root 0 Mar 18 08:47 blkio lrwxrwxrwx. 1 root root 11 Mar 18 08:47 cpu - cpu,cpuacct lrwxrwxrwx. 1 root root 11 Mar 18 08:47 cpuacct - cpu,cpuacct drwxr-xr-x. 2 root root 0 Mar 18 08:47 cpu,cpuacct drwxr-xr-x. 2 root root 0 Mar 18 08:47 cpuset drwxr-xr-x. 2 root root 0 Mar 18 08:47 devices drwxr-xr-x. 2 root root 0 Mar 18 08:47 freezer drwxr-xr-x. 2 root root 0 Mar 18 08:47 hugetlb drwxr-xr-x. 3 root root 0 Apr 3 11:26 memory drwxr-xr-x. 2 root root 0 Mar 18 08:47 net_cls drwxr-xr-x. 2 root root 0 Mar 18 08:47 perf_event drwxr-xr-x. 4 root root 0 Mar 18 08:47 systemd -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1724) Can't include port in DockerInfo's image
[ https://issues.apache.org/jira/browse/MESOS-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhuvan Arumugam updated MESOS-1724: --- Fix Version/s: 0.20.1 Can't include port in DockerInfo's image Key: MESOS-1724 URL: https://issues.apache.org/jira/browse/MESOS-1724 Project: Mesos Issue Type: Bug Components: containerization Reporter: Jay Buffington Assignee: Timothy Chen Priority: Minor Labels: docker Fix For: 0.20.1 The current git tree doesn't allow you to specify a docker image with multiple colons. It is valid that multiple colons would exist in a docker image. e.g. docker-registry.example.com:80/centos:6u5 From https://github.com/apache/mesos/blob/02a35ab213fb074f6c532075cada76f13eb9d552/src/slave/containerizer/docker.cpp#L441 {code} vectorstring parts = strings::split(dockerInfo.image(), :); if (parts.size() 2) { return Failure(Not expecting multiple ':' in image: + dockerInfo.image()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1737) Isolation=external result in core dump on 0.20.0
[ https://issues.apache.org/jira/browse/MESOS-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhuvan Arumugam updated MESOS-1737: --- Target Version/s: 0.20.1 Isolation=external result in core dump on 0.20.0 Key: MESOS-1737 URL: https://issues.apache.org/jira/browse/MESOS-1737 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.20.0 Reporter: Tim Nolet Assignee: Timothy Chen Fix For: 0.20.1 When upgrading from 0.19.1 to 0.20.0, any slaves started with the standard deimos setup fail hard on startup. The following command spits out about 20.000 errors before core dumping: /etc/mesos-slave# /usr/local/sbin/mesos-slave --master=zk://localhost:2181/mesos --port=5051 --log_dir=/var/log/mesos --ip=172.17.8.101 --work_dir=/var/lib/mesos --isolation=external --containerizer_path=/usr/local/bin/deimos output: W0827 15:20:18.366271 721 containerizer.cpp:159] The 'external' isolation flag is deprecated, please update your flags to '--containerizers=external'. W0827 15:20:18.366580 721 containerizer.cpp:159] The 'external' isolation flag is deprecated, please update your flags to '--containerizers=external'. W0827 15:20:18.366631 721 containerizer.cpp:159] The 'external' isolation flag is deprecated, please update your flags to '--containerizers=external'. W0827 15:20:18.366683 721 containerizer.cpp:159] The 'external' isolation flag is deprecated, please update your flags to '--containerizers=external'. W0827 15:20:18.366714 721 containerizer.cpp:159] The 'external' isolation flag is deprecated, please update your flags to '--containerizers=external'. W0827 15:20:18.366752 721 containerizer.cpp:159] The 'external' isolation flag is deprecated, please update your flags to '--containerizers=external'. Segmentation fault (core dumped) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1643) Provide APIs to return port resource for a given role
[ https://issues.apache.org/jira/browse/MESOS-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhuvan Arumugam updated MESOS-1643: --- Target Version/s: 0.20.1 Fix Version/s: (was: 0.21.0) 0.20.1 trivial enough to accomodate in 0.20.1. Provide APIs to return port resource for a given role - Key: MESOS-1643 URL: https://issues.apache.org/jira/browse/MESOS-1643 Project: Mesos Issue Type: Improvement Reporter: Zuyu Zhang Assignee: Zuyu Zhang Priority: Trivial Fix For: 0.20.1 It makes more sense to return port resource for a given role, rather than all ports in Resources. In mesos/resource.hpp: OptionValue::Ranges Resources::ports(const string role = *); // Check whether Resources have the given number (num_port) of ports, and return the begin number of the port range. Optionlong Resources::getPorts(long num_port, const string role = *); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version
[ https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136550#comment-14136550 ] Vinod Kone commented on MESOS-1675: --- I see. For the patch you sent, which sets version to 0.0.0, do frameworks have to do anything specific to use the new lib (assuming it's compatible)? Decouple version of the mesos library from the package release version -- Key: MESOS-1675 URL: https://issues.apache.org/jira/browse/MESOS-1675 Project: Mesos Issue Type: Bug Reporter: Vinod Kone This discussion should be rolled into the larger discussion around how to version Mesos (APIs, packages, libraries etc). Some notes from libtool docs. http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.
Benjamin Mahler created MESOS-1802: -- Summary: HealthCheckTest.HealthStatusChange is flaky on jenkins. Key: MESOS-1802 URL: https://issues.apache.org/jira/browse/MESOS-1802 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Timothy Chen https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull {noformat} [ RUN ] HealthCheckTest.HealthStatusChange Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2' I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 642ns I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 343ns I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from a replica in EMPTY status I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to STARTING I0916 22:56:14.036603 21046 master.cpp:286] Master 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 67.195.81.186:47865 I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing authenticated frameworks to register I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing authenticated slaves to register I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials' I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 480322ns I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to STARTING I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising offers for all slaves I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@67.195.81.186:47865 I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026 I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master! I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from a replica in STARTING status I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 330251ns I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to VOTING I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos group I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit promise request with proposal 1 I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 92623ns I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1 I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to fill missing position I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 144215ns I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0 I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request for position 0 I0916 22:56:14.040267 21047 leveldb.cpp:438] Reading position from leveldb took 10323ns I0916 22:56:14.040362 21047 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 79471ns I0916 22:56:14.040375 21047 replica.cpp:676] Persisted action at 0 I0916 22:56:14.040556 21054 replica.cpp:655] Replica received learned notice for position 0 I0916 22:56:14.040658 21054 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 83975ns I0916 22:56:14.040676 21054 replica.cpp:676
[jira] [Created] (MESOS-1803) Strict/RegistrarTest.remove test is flaky on jenkins.
Benjamin Mahler created MESOS-1803: -- Summary: Strict/RegistrarTest.remove test is flaky on jenkins. Key: MESOS-1803 URL: https://issues.apache.org/jira/browse/MESOS-1803 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull {noformat} [ RUN ] Strict/RegistrarTest.remove/1 Using temporary directory '/tmp/Strict_RegistrarTest_remove_1_3QvnOW' I0916 22:59:02.112568 21026 leveldb.cpp:176] Opened db in 1.779835ms I0916 22:59:02.112896 21026 leveldb.cpp:183] Compacted db in 301862ns I0916 22:59:02.112916 21026 leveldb.cpp:198] Created db iterator in 3065ns I0916 22:59:02.112926 21026 leveldb.cpp:204] Seeked to beginning of db in 475ns I0916 22:59:02.112936 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 330ns I0916 22:59:02.112951 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.113654 21054 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 421460ns I0916 22:59:02.113674 21054 replica.cpp:320] Persisted replica status to VOTING I0916 22:59:02.115900 21026 leveldb.cpp:176] Opened db in 1.947919ms I0916 22:59:02.116263 21026 leveldb.cpp:183] Compacted db in 338043ns I0916 22:59:02.116283 21026 leveldb.cpp:198] Created db iterator in 2809ns I0916 22:59:02.116293 21026 leveldb.cpp:204] Seeked to beginning of db in 468ns I0916 22:59:02.116302 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 195ns I0916 22:59:02.116317 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.117013 21043 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 472891ns I0916 22:59:02.117034 21043 replica.cpp:320] Persisted replica status to VOTING I0916 22:59:02.119240 21026 leveldb.cpp:176] Opened db in 1.950367ms I0916 22:59:02.120455 21026 leveldb.cpp:183] Compacted db in 1.188056ms I0916 22:59:02.120481 21026 leveldb.cpp:198] Created db iterator in 4370ns I0916 22:59:02.120499 21026 leveldb.cpp:204] Seeked to beginning of db in 7977ns I0916 22:59:02.120517 21026 leveldb.cpp:273] Iterated through 1 keys in the db in 8479ns I0916 22:59:02.120533 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.122890 21026 leveldb.cpp:176] Opened db in 2.301327ms I0916 22:59:02.124325 21026 leveldb.cpp:183] Compacted db in 1.406223ms I0916 22:59:02.124351 21026 leveldb.cpp:198] Created db iterator in 4185ns I0916 22:59:02.124368 21026 leveldb.cpp:204] Seeked to beginning of db in 7167ns I0916 22:59:02.124387 21026 leveldb.cpp:273] Iterated through 1 keys in the db in 8182ns I0916 22:59:02.124403 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.124579 21047 recover.cpp:425] Starting replica recovery I0916 22:59:02.124651 21047 recover.cpp:451] Replica is in VOTING status I0916 22:59:02.124793 21047 recover.cpp:440] Recover process terminated I0916 22:59:02.126404 21046 registrar.cpp:313] Recovering registrar I0916 22:59:02.126597 21050 log.cpp:656] Attempting to start the writer I0916 22:59:02.127259 21041 replica.cpp:474] Replica received implicit promise request with proposal 1 I0916 22:59:02.127321 21050 replica.cpp:474] Replica received implicit promise request with proposal 1 I0916 22:59:02.127835 21041 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 547018ns I0916 22:59:02.127858 21041 replica.cpp:342] Persisted promised to 1 I0916 22:59:02.127835 21050 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 487588ns I0916 22:59:02.127887 21050 replica.cpp:342] Persisted promised to 1 I0916 22:59:02.128387 21055 coordinator.cpp:230] Coordinator attemping to fill missing position I0916 22:59:02.129546 21042 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0916 22:59:02.129600 21053 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0916 22:59:02.129982 21042 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 406954ns I0916 22:59:02.129982 21053 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 357253ns I0916 22:59:02.130009 21042 replica.cpp:676] Persisted action at 0 I0916 22:59:02.130029 21053 replica.cpp:676] Persisted action at 0 I0916 22:59:02.130543 21041 replica.cpp:508] Replica received write request for position 0 I0916 22:59:02.130585 21041 leveldb.cpp:438] Reading position from leveldb took 17424ns I0916 22:59:02.130599 21046 replica.cpp:508] Replica received write request for position 0 I0916 22:59:02.130635 21046 leveldb.cpp:438] Reading position from leveldb took 12702ns I0916 22:59:02.130728
[jira] [Updated] (MESOS-1760) MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky
[ https://issues.apache.org/jira/browse/MESOS-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhuvan Arumugam updated MESOS-1760: --- Target Version/s: 0.20.1 (was: 0.21.0) MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky - Key: MESOS-1760 URL: https://issues.apache.org/jira/browse/MESOS-1760 Project: Mesos Issue Type: Bug Components: test Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.20.1 Observed this on Apache CI: https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2355/changes {code} [ RUN] MasterAuthorizationTest.FrameworkRemovedBeforeReregistration Using temporary directory '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z' I0903 22:04:33.520237 25565 leveldb.cpp:176] Opened db in 49.073821ms I0903 22:04:33.538331 25565 leveldb.cpp:183] Compacted db in 18.065051ms I0903 22:04:33.538363 25565 leveldb.cpp:198] Created db iterator in 4826ns I0903 22:04:33.538377 25565 leveldb.cpp:204] Seeked to beginning of db in 682ns I0903 22:04:33.538385 25565 leveldb.cpp:273] Iterated through 0 keys in the db in 312ns I0903 22:04:33.538399 25565 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0903 22:04:33.538624 25593 recover.cpp:425] Starting replica recovery I0903 22:04:33.538707 25598 recover.cpp:451] Replica is in EMPTY status I0903 22:04:33.540909 25590 master.cpp:286] Master 20140903-220433-453759884-44122-25565 (hemera.apache.org) started on 140.211.11.27:44122 I0903 22:04:33.540932 25590 master.cpp:332] Master only allowing authenticated frameworks to register I0903 22:04:33.540936 25590 master.cpp:337] Master only allowing authenticated slaves to register I0903 22:04:33.540941 25590 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z/credentials' I0903 22:04:33.541337 25590 master.cpp:366] Authorization enabled I0903 22:04:33.541508 25597 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0903 22:04:33.542343 25582 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@140.211.11.27:44122 I0903 22:04:33.542445 25592 master.cpp:120] No whitelist given. Advertising offers for all slaves I0903 22:04:33.543175 25602 recover.cpp:188] Received a recover response from a replica in EMPTY status I0903 22:04:33.543637 25587 recover.cpp:542] Updating replica status to STARTING I0903 22:04:33.544256 25579 master.cpp:1205] The newly elected leader is master@140.211.11.27:44122 with id 20140903-220433-453759884-44122-25565 I0903 22:04:33.544275 25579 master.cpp:1218] Elected as the leading master! I0903 22:04:33.544282 25579 master.cpp:1036] Recovering from registrar I0903 22:04:33.544401 25579 registrar.cpp:313] Recovering registrar I0903 22:04:33.558487 25593 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 14.678563ms I0903 22:04:33.558531 25593 replica.cpp:320] Persisted replica status to STARTING I0903 22:04:33.558653 25593 recover.cpp:451] Replica is in STARTING status I0903 22:04:33.559867 25588 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0903 22:04:33.560057 25602 recover.cpp:188] Received a recover response from a replica in STARTING status I0903 22:04:33.561280 25584 recover.cpp:542] Updating replica status to VOTING I0903 22:04:33.576900 25581 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 14.712427ms I0903 22:04:33.576942 25581 replica.cpp:320] Persisted replica status to VOTING I0903 22:04:33.577018 25581 recover.cpp:556] Successfully joined the Paxos group I0903 22:04:33.577108 25581 recover.cpp:440] Recover process terminated I0903 22:04:33.577401 25581 log.cpp:656] Attempting to start the writer I0903 22:04:33.578559 25589 replica.cpp:474] Replica received implicit promise request with proposal 1 I0903 22:04:33.594611 25589 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 16.029152ms I0903 22:04:33.594640 25589 replica.cpp:342] Persisted promised to 1 I0903 22:04:33.595391 25584 coordinator.cpp:230] Coordinator attemping to fill missing position I0903 22:04:33.597512 25588 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0903 22:04:33.613037 25588 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 15.502568ms I0903 22:04:33.613065 25588 replica.cpp:676] Persisted action at 0 I0903 22:04:33.615435 25585 replica.cpp:508] Replica received write request for position 0 I0903 22:04:33.615463 25585 leveldb.cpp:438]
[jira] [Updated] (MESOS-1760) MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky
[ https://issues.apache.org/jira/browse/MESOS-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhuvan Arumugam updated MESOS-1760: --- Fix Version/s: (was: 0.21.0) 0.20.1 MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky - Key: MESOS-1760 URL: https://issues.apache.org/jira/browse/MESOS-1760 Project: Mesos Issue Type: Bug Components: test Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.20.1 Observed this on Apache CI: https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2355/changes {code} [ RUN] MasterAuthorizationTest.FrameworkRemovedBeforeReregistration Using temporary directory '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z' I0903 22:04:33.520237 25565 leveldb.cpp:176] Opened db in 49.073821ms I0903 22:04:33.538331 25565 leveldb.cpp:183] Compacted db in 18.065051ms I0903 22:04:33.538363 25565 leveldb.cpp:198] Created db iterator in 4826ns I0903 22:04:33.538377 25565 leveldb.cpp:204] Seeked to beginning of db in 682ns I0903 22:04:33.538385 25565 leveldb.cpp:273] Iterated through 0 keys in the db in 312ns I0903 22:04:33.538399 25565 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0903 22:04:33.538624 25593 recover.cpp:425] Starting replica recovery I0903 22:04:33.538707 25598 recover.cpp:451] Replica is in EMPTY status I0903 22:04:33.540909 25590 master.cpp:286] Master 20140903-220433-453759884-44122-25565 (hemera.apache.org) started on 140.211.11.27:44122 I0903 22:04:33.540932 25590 master.cpp:332] Master only allowing authenticated frameworks to register I0903 22:04:33.540936 25590 master.cpp:337] Master only allowing authenticated slaves to register I0903 22:04:33.540941 25590 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z/credentials' I0903 22:04:33.541337 25590 master.cpp:366] Authorization enabled I0903 22:04:33.541508 25597 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0903 22:04:33.542343 25582 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@140.211.11.27:44122 I0903 22:04:33.542445 25592 master.cpp:120] No whitelist given. Advertising offers for all slaves I0903 22:04:33.543175 25602 recover.cpp:188] Received a recover response from a replica in EMPTY status I0903 22:04:33.543637 25587 recover.cpp:542] Updating replica status to STARTING I0903 22:04:33.544256 25579 master.cpp:1205] The newly elected leader is master@140.211.11.27:44122 with id 20140903-220433-453759884-44122-25565 I0903 22:04:33.544275 25579 master.cpp:1218] Elected as the leading master! I0903 22:04:33.544282 25579 master.cpp:1036] Recovering from registrar I0903 22:04:33.544401 25579 registrar.cpp:313] Recovering registrar I0903 22:04:33.558487 25593 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 14.678563ms I0903 22:04:33.558531 25593 replica.cpp:320] Persisted replica status to STARTING I0903 22:04:33.558653 25593 recover.cpp:451] Replica is in STARTING status I0903 22:04:33.559867 25588 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0903 22:04:33.560057 25602 recover.cpp:188] Received a recover response from a replica in STARTING status I0903 22:04:33.561280 25584 recover.cpp:542] Updating replica status to VOTING I0903 22:04:33.576900 25581 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 14.712427ms I0903 22:04:33.576942 25581 replica.cpp:320] Persisted replica status to VOTING I0903 22:04:33.577018 25581 recover.cpp:556] Successfully joined the Paxos group I0903 22:04:33.577108 25581 recover.cpp:440] Recover process terminated I0903 22:04:33.577401 25581 log.cpp:656] Attempting to start the writer I0903 22:04:33.578559 25589 replica.cpp:474] Replica received implicit promise request with proposal 1 I0903 22:04:33.594611 25589 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 16.029152ms I0903 22:04:33.594640 25589 replica.cpp:342] Persisted promised to 1 I0903 22:04:33.595391 25584 coordinator.cpp:230] Coordinator attemping to fill missing position I0903 22:04:33.597512 25588 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0903 22:04:33.613037 25588 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 15.502568ms I0903 22:04:33.613065 25588 replica.cpp:676] Persisted action at 0 I0903 22:04:33.615435 25585 replica.cpp:508] Replica received write request for position 0 I0903 22:04:33.615463 25585
[jira] [Commented] (MESOS-1760) MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky
[ https://issues.apache.org/jira/browse/MESOS-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136577#comment-14136577 ] Bhuvan Arumugam commented on MESOS-1760: including in 0.20.1. MasterAuthorizationTest.FrameworkRemovedBeforeReregistration is flaky - Key: MESOS-1760 URL: https://issues.apache.org/jira/browse/MESOS-1760 Project: Mesos Issue Type: Bug Components: test Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.20.1 Observed this on Apache CI: https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2355/changes {code} [ RUN] MasterAuthorizationTest.FrameworkRemovedBeforeReregistration Using temporary directory '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z' I0903 22:04:33.520237 25565 leveldb.cpp:176] Opened db in 49.073821ms I0903 22:04:33.538331 25565 leveldb.cpp:183] Compacted db in 18.065051ms I0903 22:04:33.538363 25565 leveldb.cpp:198] Created db iterator in 4826ns I0903 22:04:33.538377 25565 leveldb.cpp:204] Seeked to beginning of db in 682ns I0903 22:04:33.538385 25565 leveldb.cpp:273] Iterated through 0 keys in the db in 312ns I0903 22:04:33.538399 25565 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0903 22:04:33.538624 25593 recover.cpp:425] Starting replica recovery I0903 22:04:33.538707 25598 recover.cpp:451] Replica is in EMPTY status I0903 22:04:33.540909 25590 master.cpp:286] Master 20140903-220433-453759884-44122-25565 (hemera.apache.org) started on 140.211.11.27:44122 I0903 22:04:33.540932 25590 master.cpp:332] Master only allowing authenticated frameworks to register I0903 22:04:33.540936 25590 master.cpp:337] Master only allowing authenticated slaves to register I0903 22:04:33.540941 25590 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterAuthorizationTest_FrameworkRemovedBeforeReregistration_0tw16Z/credentials' I0903 22:04:33.541337 25590 master.cpp:366] Authorization enabled I0903 22:04:33.541508 25597 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0903 22:04:33.542343 25582 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@140.211.11.27:44122 I0903 22:04:33.542445 25592 master.cpp:120] No whitelist given. Advertising offers for all slaves I0903 22:04:33.543175 25602 recover.cpp:188] Received a recover response from a replica in EMPTY status I0903 22:04:33.543637 25587 recover.cpp:542] Updating replica status to STARTING I0903 22:04:33.544256 25579 master.cpp:1205] The newly elected leader is master@140.211.11.27:44122 with id 20140903-220433-453759884-44122-25565 I0903 22:04:33.544275 25579 master.cpp:1218] Elected as the leading master! I0903 22:04:33.544282 25579 master.cpp:1036] Recovering from registrar I0903 22:04:33.544401 25579 registrar.cpp:313] Recovering registrar I0903 22:04:33.558487 25593 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 14.678563ms I0903 22:04:33.558531 25593 replica.cpp:320] Persisted replica status to STARTING I0903 22:04:33.558653 25593 recover.cpp:451] Replica is in STARTING status I0903 22:04:33.559867 25588 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0903 22:04:33.560057 25602 recover.cpp:188] Received a recover response from a replica in STARTING status I0903 22:04:33.561280 25584 recover.cpp:542] Updating replica status to VOTING I0903 22:04:33.576900 25581 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 14.712427ms I0903 22:04:33.576942 25581 replica.cpp:320] Persisted replica status to VOTING I0903 22:04:33.577018 25581 recover.cpp:556] Successfully joined the Paxos group I0903 22:04:33.577108 25581 recover.cpp:440] Recover process terminated I0903 22:04:33.577401 25581 log.cpp:656] Attempting to start the writer I0903 22:04:33.578559 25589 replica.cpp:474] Replica received implicit promise request with proposal 1 I0903 22:04:33.594611 25589 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 16.029152ms I0903 22:04:33.594640 25589 replica.cpp:342] Persisted promised to 1 I0903 22:04:33.595391 25584 coordinator.cpp:230] Coordinator attemping to fill missing position I0903 22:04:33.597512 25588 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0903 22:04:33.613037 25588 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 15.502568ms I0903 22:04:33.613065 25588 replica.cpp:676] Persisted action at 0 I0903 22:04:33.615435 25585 replica.cpp:508] Replica received write request for position 0 I0903
[jira] [Updated] (MESOS-1766) MasterAuthorizationTest.DuplicateRegistration test is flaky
[ https://issues.apache.org/jira/browse/MESOS-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhuvan Arumugam updated MESOS-1766: --- Fix Version/s: (was: 0.21.0) 0.20.1 MasterAuthorizationTest.DuplicateRegistration test is flaky --- Key: MESOS-1766 URL: https://issues.apache.org/jira/browse/MESOS-1766 Project: Mesos Issue Type: Bug Components: test Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.20.1 {code} [ RUN ] MasterAuthorizationTest.DuplicateRegistration Using temporary directory '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m' I0905 15:53:16.398993 25769 leveldb.cpp:176] Opened db in 2.601036ms I0905 15:53:16.399566 25769 leveldb.cpp:183] Compacted db in 546216ns I0905 15:53:16.399590 25769 leveldb.cpp:198] Created db iterator in 2787ns I0905 15:53:16.399605 25769 leveldb.cpp:204] Seeked to beginning of db in 500ns I0905 15:53:16.399617 25769 leveldb.cpp:273] Iterated through 0 keys in the db in 185ns I0905 15:53:16.399633 25769 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0905 15:53:16.399817 25786 recover.cpp:425] Starting replica recovery I0905 15:53:16.399952 25793 recover.cpp:451] Replica is in EMPTY status I0905 15:53:16.400683 25795 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0905 15:53:16.400795 25787 recover.cpp:188] Received a recover response from a replica in EMPTY status I0905 15:53:16.401005 25783 recover.cpp:542] Updating replica status to STARTING I0905 15:53:16.401470 25786 master.cpp:286] Master 20140905-155316-3125920579-49188-25769 (penates.apache.org) started on 67.195.81.186:49188 I0905 15:53:16.401521 25786 master.cpp:332] Master only allowing authenticated frameworks to register I0905 15:53:16.401533 25786 master.cpp:337] Master only allowing authenticated slaves to register I0905 15:53:16.401543 25786 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m/credentials' I0905 15:53:16.401558 25793 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 474683ns I0905 15:53:16.401582 25793 replica.cpp:320] Persisted replica status to STARTING I0905 15:53:16.401667 25793 recover.cpp:451] Replica is in STARTING status I0905 15:53:16.401669 25786 master.cpp:366] Authorization enabled I0905 15:53:16.401898 25795 master.cpp:120] No whitelist given. Advertising offers for all slaves I0905 15:53:16.401936 25796 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@67.195.81.186:49188 I0905 15:53:16.402160 25784 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0905 15:53:16.402333 25790 master.cpp:1205] The newly elected leader is master@67.195.81.186:49188 with id 20140905-155316-3125920579-49188-25769 I0905 15:53:16.402359 25790 master.cpp:1218] Elected as the leading master! I0905 15:53:16.402371 25790 master.cpp:1036] Recovering from registrar I0905 15:53:16.402472 25798 registrar.cpp:313] Recovering registrar I0905 15:53:16.402529 25791 recover.cpp:188] Received a recover response from a replica in STARTING status I0905 15:53:16.402782 25788 recover.cpp:542] Updating replica status to VOTING I0905 15:53:16.403002 25795 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 116403ns I0905 15:53:16.403020 25795 replica.cpp:320] Persisted replica status to VOTING I0905 15:53:16.403081 25791 recover.cpp:556] Successfully joined the Paxos group I0905 15:53:16.403197 25791 recover.cpp:440] Recover process terminated I0905 15:53:16.403388 25796 log.cpp:656] Attempting to start the writer I0905 15:53:16.403993 25784 replica.cpp:474] Replica received implicit promise request with proposal 1 I0905 15:53:16.404147 25784 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 132156ns I0905 15:53:16.404167 25784 replica.cpp:342] Persisted promised to 1 I0905 15:53:16.404542 25795 coordinator.cpp:230] Coordinator attemping to fill missing position I0905 15:53:16.405498 25787 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0905 15:53:16.405868 25787 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 347231ns I0905 15:53:16.405886 25787 replica.cpp:676] Persisted action at 0 I0905 15:53:16.406553 25788 replica.cpp:508] Replica received write request for position 0 I0905 15:53:16.406582 25788 leveldb.cpp:438] Reading position from leveldb took 11402ns I0905 15:53:16.529067 25788 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 535803ns I0905 15:53:16.529088 25788 replica.cpp:676] Persisted
[jira] [Commented] (MESOS-1766) MasterAuthorizationTest.DuplicateRegistration test is flaky
[ https://issues.apache.org/jira/browse/MESOS-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136579#comment-14136579 ] Bhuvan Arumugam commented on MESOS-1766: related to MESOS-1760. including in 0.20.1 MasterAuthorizationTest.DuplicateRegistration test is flaky --- Key: MESOS-1766 URL: https://issues.apache.org/jira/browse/MESOS-1766 Project: Mesos Issue Type: Bug Components: test Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.20.1 {code} [ RUN ] MasterAuthorizationTest.DuplicateRegistration Using temporary directory '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m' I0905 15:53:16.398993 25769 leveldb.cpp:176] Opened db in 2.601036ms I0905 15:53:16.399566 25769 leveldb.cpp:183] Compacted db in 546216ns I0905 15:53:16.399590 25769 leveldb.cpp:198] Created db iterator in 2787ns I0905 15:53:16.399605 25769 leveldb.cpp:204] Seeked to beginning of db in 500ns I0905 15:53:16.399617 25769 leveldb.cpp:273] Iterated through 0 keys in the db in 185ns I0905 15:53:16.399633 25769 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0905 15:53:16.399817 25786 recover.cpp:425] Starting replica recovery I0905 15:53:16.399952 25793 recover.cpp:451] Replica is in EMPTY status I0905 15:53:16.400683 25795 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0905 15:53:16.400795 25787 recover.cpp:188] Received a recover response from a replica in EMPTY status I0905 15:53:16.401005 25783 recover.cpp:542] Updating replica status to STARTING I0905 15:53:16.401470 25786 master.cpp:286] Master 20140905-155316-3125920579-49188-25769 (penates.apache.org) started on 67.195.81.186:49188 I0905 15:53:16.401521 25786 master.cpp:332] Master only allowing authenticated frameworks to register I0905 15:53:16.401533 25786 master.cpp:337] Master only allowing authenticated slaves to register I0905 15:53:16.401543 25786 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterAuthorizationTest_DuplicateRegistration_pVJg7m/credentials' I0905 15:53:16.401558 25793 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 474683ns I0905 15:53:16.401582 25793 replica.cpp:320] Persisted replica status to STARTING I0905 15:53:16.401667 25793 recover.cpp:451] Replica is in STARTING status I0905 15:53:16.401669 25786 master.cpp:366] Authorization enabled I0905 15:53:16.401898 25795 master.cpp:120] No whitelist given. Advertising offers for all slaves I0905 15:53:16.401936 25796 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@67.195.81.186:49188 I0905 15:53:16.402160 25784 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0905 15:53:16.402333 25790 master.cpp:1205] The newly elected leader is master@67.195.81.186:49188 with id 20140905-155316-3125920579-49188-25769 I0905 15:53:16.402359 25790 master.cpp:1218] Elected as the leading master! I0905 15:53:16.402371 25790 master.cpp:1036] Recovering from registrar I0905 15:53:16.402472 25798 registrar.cpp:313] Recovering registrar I0905 15:53:16.402529 25791 recover.cpp:188] Received a recover response from a replica in STARTING status I0905 15:53:16.402782 25788 recover.cpp:542] Updating replica status to VOTING I0905 15:53:16.403002 25795 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 116403ns I0905 15:53:16.403020 25795 replica.cpp:320] Persisted replica status to VOTING I0905 15:53:16.403081 25791 recover.cpp:556] Successfully joined the Paxos group I0905 15:53:16.403197 25791 recover.cpp:440] Recover process terminated I0905 15:53:16.403388 25796 log.cpp:656] Attempting to start the writer I0905 15:53:16.403993 25784 replica.cpp:474] Replica received implicit promise request with proposal 1 I0905 15:53:16.404147 25784 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 132156ns I0905 15:53:16.404167 25784 replica.cpp:342] Persisted promised to 1 I0905 15:53:16.404542 25795 coordinator.cpp:230] Coordinator attemping to fill missing position I0905 15:53:16.405498 25787 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0905 15:53:16.405868 25787 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 347231ns I0905 15:53:16.405886 25787 replica.cpp:676] Persisted action at 0 I0905 15:53:16.406553 25788 replica.cpp:508] Replica received write request for position 0 I0905 15:53:16.406582 25788 leveldb.cpp:438] Reading position from leveldb took 11402ns I0905 15:53:16.529067 25788 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 535803ns I0905 15:53:16.529088
[jira] [Updated] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout.
[ https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhuvan Arumugam updated MESOS-1219: --- Fix Version/s: (was: 0.20.1) 0.21.0 Master should disallow frameworks that reconnect after failover timeout. Key: MESOS-1219 URL: https://issues.apache.org/jira/browse/MESOS-1219 Project: Mesos Issue Type: Bug Components: master, webui Reporter: Robert Lacroix Assignee: Vinod Kone Fix For: 0.21.0 When a scheduler reconnects after the failover timeout has exceeded, the framework id is usually reused because the scheduler doesn't know that the timeout exceeded and it is actually handled as a new framework. The /framework/:framework_id route of the Web UI doesn't handle those cases very well because its key is reused. It only shows the terminated one. Would it make sense to ignore the provided framework id when a scheduler reconnects to a terminated framework and generate a new id to make sure it's unique? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1801) MESOS_work_dir and MESOS_master env vars not honoured
[ https://issues.apache.org/jira/browse/MESOS-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1801: -- Fix Version/s: (was: 0.20.1) MESOS_work_dir and MESOS_master env vars not honoured - Key: MESOS-1801 URL: https://issues.apache.org/jira/browse/MESOS-1801 Project: Mesos Issue Type: Bug Components: cli Affects Versions: 0.20.0 Environment: CentOS 7 Reporter: Cosmin Lehene The documentation states that cli params should be substitutable by environment variables {quote} Each option can be set in two ways: By passing it to the binary using --option_name=value. By setting the environment variable MESOS_OPTION_NAME (the option name with a MESOS_ prefix added to it). {quote} However at least the master's MESOS_work_dir and slave's MESOS_master env vars seem to be ignored: {noformat} [root@localhost ~]# echo $MESOS_master zk://localhost:2181/mesos [root@localhost ~]# mesos-slave Missing required option --master [root@localhost ~]# echo $MESOS_work_dir /var/lib/mesos [root@localhost ~]# mesos-master I0917 08:36:46.242200 31325 main.cpp:155] Build: 2014-08-22 05:06:06 by root I0917 08:36:46.242369 31325 main.cpp:157] Version: 0.20.0 I0917 08:36:46.242377 31325 main.cpp:160] Git tag: 0.20.0 I0917 08:36:46.242382 31325 main.cpp:164] Git SHA: f421ffdf8d32a8834b3a6ee483b5b59f65956497 --work_dir needed for replicated log based registry {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1728) Libprocess: report bind parameters on failure
[ https://issues.apache.org/jira/browse/MESOS-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1728: -- Fix Version/s: (was: 0.20.1) 0.21.0 Libprocess: report bind parameters on failure - Key: MESOS-1728 URL: https://issues.apache.org/jira/browse/MESOS-1728 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Nikita Vetoshkin Assignee: Nikita Vetoshkin Priority: Trivial Fix For: 0.21.0 When you attempt to start slave or master and there's another one already running there, it is nice to report what are the actual parameters to {{bind}} call that failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1643) Provide APIs to return port resource for a given role
[ https://issues.apache.org/jira/browse/MESOS-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1643: -- Fix Version/s: (was: 0.20.1) 0.21.0 Provide APIs to return port resource for a given role - Key: MESOS-1643 URL: https://issues.apache.org/jira/browse/MESOS-1643 Project: Mesos Issue Type: Improvement Reporter: Zuyu Zhang Assignee: Zuyu Zhang Priority: Trivial Fix For: 0.21.0 It makes more sense to return port resource for a given role, rather than all ports in Resources. In mesos/resource.hpp: OptionValue::Ranges Resources::ports(const string role = *); // Check whether Resources have the given number (num_port) of ports, and return the begin number of the port range. Optionlong Resources::getPorts(long num_port, const string role = *); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1716) The slave does not add pending tasks as part of the staging tasks metric.
[ https://issues.apache.org/jira/browse/MESOS-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1716: -- Fix Version/s: (was: 0.20.1) 0.21.0 The slave does not add pending tasks as part of the staging tasks metric. - Key: MESOS-1716 URL: https://issues.apache.org/jira/browse/MESOS-1716 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Trivial Fix For: 0.21.0 The slave does not represent pending tasks in the tasks_staging metric. This should be a trivial fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-1803) Strict/RegistrarTest.remove test is flaky on jenkins.
[ https://issues.apache.org/jira/browse/MESOS-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-1803. Resolution: Cannot Reproduce The log timings here look as if the threads were starved of CPU: {noformat} I0916 22:59:02.136256 21049 leveldb.cpp:343] Persisting action (165 bytes) to leveldb took 141908ns I0916 22:59:02.136267 21047 leveldb.cpp:343] Persisting action (165 bytes) to leveldb took 111061ns I../../src/tests/registrar_tests.cpp:257: Failure 0916 22:59:02.136276 21049 replica.cpp:676] Persisted action at 1 Failed to wait 10secs for registrar.recover(master) I0916 22:59:14.265326 21049 replica.cpp:661] Replica learned APPEND action at position 1 I0916 22:59:02.136291 21047 replica.cpp:676] Persisted action at 1 E0916 22:59:07.135143 21046 registrar.cpp:500] Registrar aborting: Failed to update 'registry': Failed to perform store within 5secs I0916 22:59:14.265393 21047 replica.cpp:661] Replica learned APPEND action at position 1 {noformat} The logging time stamp is determined at the beginning of the LOG(INFO) expression, when the initial LogMessage object is created. The interleaving of times looks to be a stall of the VM or thread starvation: {noformat} 22:59:02.136267 21047 // Thread 1, 1st LogMessage flushed. 22:59:02.136276 21049 // Thread 2, 2nd LogMessage flushed. 22:59:14.265326 21049 // Thread 2, 5th LogMessage flushed. 22:59:02.136291 21047 // Thread 1, 3rd LogMessage flushed. 22:59:07.135143 21046 // Thread 3, 4th LogMessage flushed. 22:59:14.265393 21047 // Thread 1, 6th LogMessage flushed. {noformat} Strict/RegistrarTest.remove test is flaky on jenkins. - Key: MESOS-1803 URL: https://issues.apache.org/jira/browse/MESOS-1803 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull {noformat} [ RUN ] Strict/RegistrarTest.remove/1 Using temporary directory '/tmp/Strict_RegistrarTest_remove_1_3QvnOW' I0916 22:59:02.112568 21026 leveldb.cpp:176] Opened db in 1.779835ms I0916 22:59:02.112896 21026 leveldb.cpp:183] Compacted db in 301862ns I0916 22:59:02.112916 21026 leveldb.cpp:198] Created db iterator in 3065ns I0916 22:59:02.112926 21026 leveldb.cpp:204] Seeked to beginning of db in 475ns I0916 22:59:02.112936 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 330ns I0916 22:59:02.112951 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.113654 21054 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 421460ns I0916 22:59:02.113674 21054 replica.cpp:320] Persisted replica status to VOTING I0916 22:59:02.115900 21026 leveldb.cpp:176] Opened db in 1.947919ms I0916 22:59:02.116263 21026 leveldb.cpp:183] Compacted db in 338043ns I0916 22:59:02.116283 21026 leveldb.cpp:198] Created db iterator in 2809ns I0916 22:59:02.116293 21026 leveldb.cpp:204] Seeked to beginning of db in 468ns I0916 22:59:02.116302 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 195ns I0916 22:59:02.116317 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.117013 21043 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 472891ns I0916 22:59:02.117034 21043 replica.cpp:320] Persisted replica status to VOTING I0916 22:59:02.119240 21026 leveldb.cpp:176] Opened db in 1.950367ms I0916 22:59:02.120455 21026 leveldb.cpp:183] Compacted db in 1.188056ms I0916 22:59:02.120481 21026 leveldb.cpp:198] Created db iterator in 4370ns I0916 22:59:02.120499 21026 leveldb.cpp:204] Seeked to beginning of db in 7977ns I0916 22:59:02.120517 21026 leveldb.cpp:273] Iterated through 1 keys in the db in 8479ns I0916 22:59:02.120533 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.122890 21026 leveldb.cpp:176] Opened db in 2.301327ms I0916 22:59:02.124325 21026 leveldb.cpp:183] Compacted db in 1.406223ms I0916 22:59:02.124351 21026 leveldb.cpp:198] Created db iterator in 4185ns I0916 22:59:02.124368 21026 leveldb.cpp:204] Seeked to beginning of db in 7167ns I0916 22:59:02.124387 21026 leveldb.cpp:273] Iterated through 1 keys in the db in 8182ns I0916 22:59:02.124403 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.124579 21047 recover.cpp:425] Starting replica recovery I0916 22:59:02.124651 21047 recover.cpp:451] Replica is in VOTING status I0916 22:59:02.124793 21047 recover.cpp:440] Recover process terminated I0916 22:59:02.126404 21046 registrar.cpp:313] Recovering
[jira] [Comment Edited] (MESOS-1746) clear TaskStatus data to avoid OOM
[ https://issues.apache.org/jira/browse/MESOS-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136605#comment-14136605 ] Chengwei Yang edited comment on MESOS-1746 at 9/17/14 1:06 AM: --- [~tstclair], yes, spark stores very large data into TaskStatus, since there is a data field in TaskStatus which was supposed to be used to store application specific data, so we can not prevent applications (like spark) from doing so. please help to review: https://reviews.apache.org/r/25184/ was (Author: chengwei-yang): [~tstclair], yes, spark stores very large data into TaskStatus, since there is a data field in TaskStatus which was supposed to be used to store application specific data, so we can not prevent applications (like spark) from doing so. clear TaskStatus data to avoid OOM -- Key: MESOS-1746 URL: https://issues.apache.org/jira/browse/MESOS-1746 Project: Mesos Issue Type: Bug Environment: mesos-0.19.0 Reporter: Chengwei Yang Assignee: Chengwei Yang Spark on mesos may use TaskStatus to transfer computed result between worker and scheduler, the source code like below (spark 1.0.2) {code} val serializedResult = { if (serializedDirectResult.limit = execBackend.akkaFrameSize() - AkkaUtils.reservedSizeBytes) { logInfo(Storing result for + taskId + in local BlockManager) val blockId = TaskResultBlockId(taskId) env.blockManager.putBytes( blockId, serializedDirectResult, StorageLevel.MEMORY_AND_DISK_SER) ser.serialize(new IndirectTaskResult[Any](blockId)) } else { logInfo(Sending result for + taskId + directly to driver) serializedDirectResult } } {code} And In our test environment, we enlarge akkaFrameSize to 128MB from default value (10MB) and this cause our mesos-master process will be OOM in tens of minutes when running spark tasks in fine-grained mode. As you can see, even changed akkaFrameSize back to default value (10MB), it's very likely to make mesos-master OOM too, however more slower. So I think it's good to delete data from TaskStatus since this is only designed to on-top framework and we don't interested in it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.
[ https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhuvan Arumugam updated MESOS-1195: --- Target Version/s: 0.21.0 (was: 0.20.1) systemd.slice + cgroup enablement fails in multiple ways. -- Key: MESOS-1195 URL: https://issues.apache.org/jira/browse/MESOS-1195 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.18.0 Reporter: Timothy St. Clair Assignee: Timothy St. Clair When attempting to configure mesos to use systemd slices on a 'rawhide/f21' machine, it fails creating the isolator: I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: cgroups/cpu,cgroups/mem Failed to create a containerizer: Could not create isolator cgroups/cpu: Failed to create isolator: The cpu subsystem is co-mounted at /sys/fs/cgroup/cpu with other subsytems -- details -- /sys/fs/cgroup total 0 drwxr-xr-x. 12 root root 280 Mar 18 08:47 . drwxr-xr-x. 6 root root 0 Mar 18 08:47 .. drwxr-xr-x. 2 root root 0 Mar 18 08:47 blkio lrwxrwxrwx. 1 root root 11 Mar 18 08:47 cpu - cpu,cpuacct lrwxrwxrwx. 1 root root 11 Mar 18 08:47 cpuacct - cpu,cpuacct drwxr-xr-x. 2 root root 0 Mar 18 08:47 cpu,cpuacct drwxr-xr-x. 2 root root 0 Mar 18 08:47 cpuset drwxr-xr-x. 2 root root 0 Mar 18 08:47 devices drwxr-xr-x. 2 root root 0 Mar 18 08:47 freezer drwxr-xr-x. 2 root root 0 Mar 18 08:47 hugetlb drwxr-xr-x. 3 root root 0 Apr 3 11:26 memory drwxr-xr-x. 2 root root 0 Mar 18 08:47 net_cls drwxr-xr-x. 2 root root 0 Mar 18 08:47 perf_event drwxr-xr-x. 4 root root 0 Mar 18 08:47 systemd -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1195) systemd.slice + cgroup enablement fails in multiple ways.
[ https://issues.apache.org/jira/browse/MESOS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136614#comment-14136614 ] Bhuvan Arumugam commented on MESOS-1195: moving it to 0.21.0, as discussed in reviewboard. http://reviews.apache.org/r/25695/ systemd.slice + cgroup enablement fails in multiple ways. -- Key: MESOS-1195 URL: https://issues.apache.org/jira/browse/MESOS-1195 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 0.18.0 Reporter: Timothy St. Clair Assignee: Timothy St. Clair When attempting to configure mesos to use systemd slices on a 'rawhide/f21' machine, it fails creating the isolator: I0407 12:39:28.035354 14916 containerizer.cpp:180] Using isolation: cgroups/cpu,cgroups/mem Failed to create a containerizer: Could not create isolator cgroups/cpu: Failed to create isolator: The cpu subsystem is co-mounted at /sys/fs/cgroup/cpu with other subsytems -- details -- /sys/fs/cgroup total 0 drwxr-xr-x. 12 root root 280 Mar 18 08:47 . drwxr-xr-x. 6 root root 0 Mar 18 08:47 .. drwxr-xr-x. 2 root root 0 Mar 18 08:47 blkio lrwxrwxrwx. 1 root root 11 Mar 18 08:47 cpu - cpu,cpuacct lrwxrwxrwx. 1 root root 11 Mar 18 08:47 cpuacct - cpu,cpuacct drwxr-xr-x. 2 root root 0 Mar 18 08:47 cpu,cpuacct drwxr-xr-x. 2 root root 0 Mar 18 08:47 cpuset drwxr-xr-x. 2 root root 0 Mar 18 08:47 devices drwxr-xr-x. 2 root root 0 Mar 18 08:47 freezer drwxr-xr-x. 2 root root 0 Mar 18 08:47 hugetlb drwxr-xr-x. 3 root root 0 Apr 3 11:26 memory drwxr-xr-x. 2 root root 0 Mar 18 08:47 net_cls drwxr-xr-x. 2 root root 0 Mar 18 08:47 perf_event drwxr-xr-x. 4 root root 0 Mar 18 08:47 systemd -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MESOS-1747) Docker image parsing for private repositories
[ https://issues.apache.org/jira/browse/MESOS-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reopened MESOS-1747: --- Docker image parsing for private repositories - Key: MESOS-1747 URL: https://issues.apache.org/jira/browse/MESOS-1747 Project: Mesos Issue Type: Bug Components: containerization, slave Affects Versions: 0.20.0 Reporter: Don Laidlaw Assignee: Timothy Chen Labels: docker Fix For: 0.20.1 You cannot specify a port number for the host of a private docker repository. Specified as follows: {noformat} container: { type: DOCKER, docker: { image: docker-repo:5000/app-base:v0.1 } } {noformat} results in an error: {noformat} Aug 29 14:33:29 ip-172-16-2-22 mesos-slave[1128]: E0829 14:33:29.487470 1153 slave.cpp:2484] Container '250e0479-552f-4e6f-81dd-71550e45adae' for executor 't1-java.71d50bd1-2f89-11e4-ba9a-0adfe6b11716' of framework '20140829-121838-184684716-5050-1177-' failed to start:Not expecting multiple ':' in image: docker-repo:5000/app-base:v0.1 {noformat} The message indicates only one colon character is allowed, but to supply a port number for a private docker repository host you need to have two colons. Also if you use a '-' character in a host name you also get an error: {noformat} Invalid namespace name (docker-repo), only [a-z0-9_] are allowed, size between 4 and 30 {noformat} The hostname parts should not be limited to [a-z0-9_]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-1747) Docker image parsing for private repositories
[ https://issues.apache.org/jira/browse/MESOS-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone resolved MESOS-1747. --- Resolution: Duplicate Fix Version/s: (was: 0.20.1) Docker image parsing for private repositories - Key: MESOS-1747 URL: https://issues.apache.org/jira/browse/MESOS-1747 Project: Mesos Issue Type: Bug Components: containerization, slave Affects Versions: 0.20.0 Reporter: Don Laidlaw Assignee: Timothy Chen Labels: docker You cannot specify a port number for the host of a private docker repository. Specified as follows: {noformat} container: { type: DOCKER, docker: { image: docker-repo:5000/app-base:v0.1 } } {noformat} results in an error: {noformat} Aug 29 14:33:29 ip-172-16-2-22 mesos-slave[1128]: E0829 14:33:29.487470 1153 slave.cpp:2484] Container '250e0479-552f-4e6f-81dd-71550e45adae' for executor 't1-java.71d50bd1-2f89-11e4-ba9a-0adfe6b11716' of framework '20140829-121838-184684716-5050-1177-' failed to start:Not expecting multiple ':' in image: docker-repo:5000/app-base:v0.1 {noformat} The message indicates only one colon character is allowed, but to supply a port number for a private docker repository host you need to have two colons. Also if you use a '-' character in a host name you also get an error: {noformat} Invalid namespace name (docker-repo), only [a-z0-9_] are allowed, size between 4 and 30 {noformat} The hostname parts should not be limited to [a-z0-9_]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-1621) Docker run networking should be configurable and support bridge network
[ https://issues.apache.org/jira/browse/MESOS-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone resolved MESOS-1621. --- Resolution: Fixed commit 1453a477511c8f6f22ff16e3dd13d0532e019c5b Author: Timothy Chen tnac...@apache.org Date: Tue Sep 16 18:29:36 2014 -0700 Enabled bridge network for Docker Containerizer. Review: https://reviews.apache.org/r/25270 Docker run networking should be configurable and support bridge network --- Key: MESOS-1621 URL: https://issues.apache.org/jira/browse/MESOS-1621 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Timothy Chen Assignee: Timothy Chen Labels: Docker Fix For: 0.20.1 Currently to easily support running executors in Docker image, we hardcode --net=host into Docker run so slave and executor and reuse the same mechanism to communicate, which is to pass the slave IP/PORT for the framework to respond with it's own hostname and port information back to setup the tunnel. We want to see how to abstract this or even get rid of host networking altogether if we have a good way to not rely on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)