[jira] [Commented] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper
[ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746694#comment-14746694 ] Shuai Lin commented on MESOS-1806: -- Progress update: I have rebased the code to the latest upstream master, and made some fixes so it could compile. https://github.com/lins05/mesos/tree/etcd However the etcd_test.sh script is failing: {code:title=./src/tests/etcd_test.sh error output|borderStyle=solid} Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins {code} I'll check where the problem is this week. > Substituting etcd or ReplicatedLog for Zookeeper > > > Key: MESOS-1806 > URL: https://issues.apache.org/jira/browse/MESOS-1806 > Project: Mesos > Issue Type: Task >Reporter: Ed Ropple >Assignee: Shuai Lin >Priority: Minor > >eropple: Could you also file a new JIRA for Mesos to drop ZK > in favor of etcd or ReplicatedLog? Would love to get some momentum going on > that one. > -- > Consider it filed. =) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2907) Slave : Create Basic Functionality to handle /call endpoint
[ https://issues.apache.org/jira/browse/MESOS-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-2907: -- Description: This is the first basic step in ensuring the basic /call functionality: - Set up the route on the slave for "api/v1/executor" endpoint. - The endpoint should perform basic header/protobuf validation and return {501 NotImplemented} for now. - Introduce initial tests in executor_api_tests.cpp that just verify the status code. was: This is the first basic step in ensuring the basic /call functionality: processing a POST /call and returning: 202 if all goes well; 401 if not authorized; and 403 if the request is malformed. Also , we might need to store some identifier which enables us to reject calls to /call if the client has not issued a SUBSCRIBE/RESUBSCRIBE Request. > Slave : Create Basic Functionality to handle /call endpoint > --- > > Key: MESOS-2907 > URL: https://issues.apache.org/jira/browse/MESOS-2907 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: HTTP, mesosphere > > This is the first basic step in ensuring the basic /call functionality: > - Set up the route on the slave for "api/v1/executor" endpoint. > - The endpoint should perform basic header/protobuf validation and return > {501 NotImplemented} for now. > - Introduce initial tests in executor_api_tests.cpp that just verify the > status code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2907) Agent : Create Basic Functionality to handle /call endpoint
[ https://issues.apache.org/jira/browse/MESOS-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-2907: -- Description: This is the first basic step in ensuring the basic /call functionality: - Set up the route on the agent for "api/v1/executor" endpoint. - The endpoint should perform basic header/protobuf validation and return {{501 NotImplemented}} for now. - Introduce initial tests in executor_api_tests.cpp that just verify the status code. was: This is the first basic step in ensuring the basic /call functionality: - Set up the route on the slave for "api/v1/executor" endpoint. - The endpoint should perform basic header/protobuf validation and return {{501 NotImplemented}} for now. - Introduce initial tests in executor_api_tests.cpp that just verify the status code. > Agent : Create Basic Functionality to handle /call endpoint > --- > > Key: MESOS-2907 > URL: https://issues.apache.org/jira/browse/MESOS-2907 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: HTTP, mesosphere > > This is the first basic step in ensuring the basic /call functionality: > - Set up the route on the agent for "api/v1/executor" endpoint. > - The endpoint should perform basic header/protobuf validation and return > {{501 NotImplemented}} for now. > - Introduce initial tests in executor_api_tests.cpp that just verify the > status code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3435) Add Hyper as Mesos Docker alike support
Deshi Xiao created MESOS-3435: - Summary: Add Hyper as Mesos Docker alike support Key: MESOS-3435 URL: https://issues.apache.org/jira/browse/MESOS-3435 Project: Mesos Issue Type: Improvement Reporter: Deshi Xiao Hyper is Hypervisor-agnostic Docker Engine, I hope marathon can support it.(https://github.com/mesosphere/marathon/issues/1815) https://hyper.sh/ In earlier talk about the implement possible with with Tim Chen, He suggest firstly implement the engine like mesos-src/docker/docker.hpp -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3421) Support sharing persistent volumes across task instances
[ https://issues.apache.org/jira/browse/MESOS-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743761#comment-14743761 ] Anindya Sinha edited comment on MESOS-3421 at 9/15/15 9:47 PM: --- I am proposing the following for persistent volumes to be shared across task containers: i) Added a "optional bool shared" to the "Persistence" section of the protobuf Resource.DiskInfo. It defaults to false which means it retains the current behavior. Setting this to "true" when CREATE/DELETE of volumes is done would mark them as sharable persistent volumes. ii) All persistent volumes that are sharable shall be offered to the corresponding framework(s) matching the "role" (even if a task used this volume in its task constraint). Tasks should be therefore be able to use this resource in its task constraint to schedule tasks on the same slave. The idea is to maintain the list of "shared" persistent volumes, and offer them to the appropriate frameworks as valid resources inspite of this being assigned to a scheduled task on that agent node. was (Author: anindya.sinha): I am proposing the following for persistent volumes to be shared across task containers: i) Added a "optional bool shared" to the "Persistence" section of the protobuf Resource.DiskInfo. It defaults to false which means it retains the current behavior. Setting this to "true" when CREATE/DELETE of volumes is done would mark them as sharable persistent volumes. ii) All persistent volumes that are sharable shall be offered to the corresponding framework which own these persistent volumes (even if a task used this volume in its task constraint). Tasks should be therefore be able to use this resource in its task constraint to schedule tasks on the same slave. The idea is to maintain the list of "shared" persistent volumes, and offer them to the appropriate frameworks as valid resources inspite of this being assigned to a scheduled task on that agent node. > Support sharing persistent volumes across task instances > > > Key: MESOS-3421 > URL: https://issues.apache.org/jira/browse/MESOS-3421 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.23.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha > > A service that needs persistent volume needs to have access to the same > persistent volume (RW) from multiple task(s) instances on the same agent > node. Currently, a persistent volume once offered to the framework(s) can be > scheduled to a task and until that tasks terminates, that persistent volume > cannot be used by another task. > Explore providing the capability of sharing persistent volumes across task > instances scheduled on a single agent node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3421) Support sharing persistent volumes across task instances
[ https://issues.apache.org/jira/browse/MESOS-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746327#comment-14746327 ] Anindya Sinha commented on MESOS-3421: -- I have updated my earlier comment. Since persistent volumes CREATED by a framework is based on a per-role, the persistent volumes that are sharable shall be offered to all frameworks matching that role (and not ONLY to the framework that created it). I do not think we need to move to persistent volume per framework model necessarily before moving ahead with this JIRA. Thanks for catching that. > Support sharing persistent volumes across task instances > > > Key: MESOS-3421 > URL: https://issues.apache.org/jira/browse/MESOS-3421 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.23.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha > > A service that needs persistent volume needs to have access to the same > persistent volume (RW) from multiple task(s) instances on the same agent > node. Currently, a persistent volume once offered to the framework(s) can be > scheduled to a task and until that tasks terminates, that persistent volume > cannot be used by another task. > Explore providing the capability of sharing persistent volumes across task > instances scheduled on a single agent node. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3434) Add inline build process for third-party libraries in Windows, using CMake
Alex Clemmer created MESOS-3434: --- Summary: Add inline build process for third-party libraries in Windows, using CMake Key: MESOS-3434 URL: https://issues.apache.org/jira/browse/MESOS-3434 Project: Mesos Issue Type: Task Components: build Reporter: Alex Clemmer Assignee: haosdent Right now to build Mesos on Windows, we need to start the build process in VS (or NMake or whatever), then when it fails, we need to go to the third-party libraries like glog and build them individually and separately. A better idea would be to have batch scripts that will build them inline, as part of the normal build process. haosdent has a good start here: https://reviews.apache.org/r/37273/ and https://reviews.apache.org/r/37275/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3136) COMMAND health checks with Marathon 0.10.0 are broken
[ https://issues.apache.org/jira/browse/MESOS-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744933#comment-14744933 ] Adam B commented on MESOS-3136: --- [~greggomann] is doing some (internal) backport testing for 0.21.2,0.22.2,0.23.1,0.24.1 for the docker versioning patches from MESOS-2986 (one of which you [~haosd...@gmail.com] wrote). Although this patch is likely unrelated to those others, if we can land it soon, it may be critical enough to include in at least one of those patch releases. Let's bring it up on the release proposal email thread: http://search-hadoop.com/m/0Vlr6PBeaOUhF241 > COMMAND health checks with Marathon 0.10.0 are broken > - > > Key: MESOS-3136 > URL: https://issues.apache.org/jira/browse/MESOS-3136 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.23.0 >Reporter: Dr. Stefan Schimanski >Assignee: haosdent >Priority: Critical > > When deploying Mesos 0.23rc4 with latest Marathon 0.10.0 RC3 command health > check stop working. Rolling back to Mesos 0.22.1 fixes the problem. > Containerizer is Docker. > All packages are from official Mesosphere Ubuntu 14.04 sources. > The issue must be analyzed further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3430) LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744970#comment-14744970 ] haosdent commented on MESOS-3430: - I use CentOS 7.1 and XFS to test it. The behaviours is strange in CentOS 7.1: after make --bind sandbox sandbox and mark it as shared, every bind mount point under sandbox would create two records in /proc/self/mountinfo. I also could reproduce this through shell {code} $ mkdir /tmp/sandbox $ mkdir /tmp/source $ mkdir /tmp/sandbox/target $ mount --bind /tmp/sandbox /tmp/sandbox $ mount --make-shared /tmp/sandbox $ mount --bind /tmp/source /tmp/sandbox/target {code} {code} 104 38 8:3 /tmp/sandbox /tmp/sandbox rw,relatime shared:1 - xfs /dev/sda3 rw,seclabel,attr2,inode64,noquota 107 104 8:3 /tmp/source /tmp/sandbox/target rw,relatime shared:1 - xfs /dev/sda3 rw,seclabel,attr2,inode64,noquota 108 38 8:3 /tmp/source /tmp/sandbox/target rw,relatime shared:1 - xfs /dev/sda3 rw,seclabel,attr2,inode64,noquota {code} I think it maybe caused by xfs. Need more investigation to xfs. And a quick way to solve this is to mark the sandbox as slave and then mark the sandbox as shared. {code} diff --git a/src/slave/containerizer/isolators/filesystem/linux.cpp b/src/slave/containerizer/isolators/filesystem/linux.cpp index dbdbf87..9149838 100644 --- a/src/slave/containerizer/isolators/filesystem/linux.cpp +++ b/src/slave/containerizer/isolators/filesystem/linux.cpp @@ -312,6 +312,13 @@ Future
[jira] [Commented] (MESOS-3340) Command-line flags should take precedence over OS Env variables
[ https://issues.apache.org/jira/browse/MESOS-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744944#comment-14744944 ] Klaus Ma commented on MESOS-3340: - Sure, I've sent an email to user@ mailing list :). > Command-line flags should take precedence over OS Env variables > --- > > Key: MESOS-3340 > URL: https://issues.apache.org/jira/browse/MESOS-3340 > Project: Mesos > Issue Type: Improvement > Components: stout >Affects Versions: 0.24.0 >Reporter: Marco Massenzio >Assignee: Klaus Ma > Labels: mesosphere, tech-debt > > Currently, it appears that re-defining a flag on the command-line that was > already defined via a OS Env var ({{MESOS_*}}) causes the Master to fail with > a not very helpful message. > For example, if one has {{MESOS_QUORUM}} defined, this happens: > {noformat} > $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1 > --hostname=192.168.1.4 --ip=192.168.1.4 > Duplicate flag 'quorum' on command line > {noformat} > which is not very helpful. > Ideally, we would parse the flags with a "well-known" priority (command-line > first, environment last) - but at the very least, the error message should be > more helpful in explaining what the issue is. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2063) Add InverseOffer to C++ Scheduler API.
[ https://issues.apache.org/jira/browse/MESOS-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744909#comment-14744909 ] Qian Zhang commented on MESOS-2063: --- Currently, there is no plan to add this to old scheduler API, see the following mail thread for detailed discussion: http://www.mail-archive.com/dev@mesos.apache.org/msg33184.html > Add InverseOffer to C++ Scheduler API. > -- > > Key: MESOS-2063 > URL: https://issues.apache.org/jira/browse/MESOS-2063 > Project: Mesos > Issue Type: Task > Components: c++ api >Reporter: Benjamin Mahler >Assignee: Qian Zhang > Labels: mesosphere, twitter > > The initial use case for InverseOffer in the framework API will be the > maintenance primitives in mesos: MESOS-1474. > One way to add these to the C++ Scheduler API is to add a new callback: > {code} > virtual void inverseResourceOffers( > SchedulerDriver* driver, > const std::vector& inverseOffers) = 0; > {code} > libmesos compatibility will need to be figured out here. > We may want to leave the C++ binding untouched in favor of Event/Call, in > order to not break API compatibility for schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3015) Add hooks for Slave exits
[ https://issues.apache.org/jira/browse/MESOS-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744908#comment-14744908 ] Adam B commented on MESOS-3015: --- This came up again in the context of an external storage manager that needs to know when a slave exits, so it can unmount any volumes that were attached to that slave. Maybe the real solution would be a Mesos event bus that the other process can listen to. > Add hooks for Slave exits > - > > Key: MESOS-3015 > URL: https://issues.apache.org/jira/browse/MESOS-3015 > Project: Mesos > Issue Type: Task >Reporter: Kapil Arya >Assignee: Kapil Arya > Labels: mesosphere > > The hook will be triggered on slave exits. A master hook module can use this > to do Slave-specific cleanups. > In our particular use case, the hook would trigger cleanup of IPs assigned to > the given Slave (see the [design doc | > https://docs.google.com/document/d/17mXtAmdAXcNBwp_JfrxmZcQrs7EO6ancSbejrqjLQ0g/edit#]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3420) Resolve shutdown semantics for Machine/Down
[ https://issues.apache.org/jira/browse/MESOS-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma reassigned MESOS-3420: --- Assignee: Klaus Ma > Resolve shutdown semantics for Machine/Down > --- > > Key: MESOS-3420 > URL: https://issues.apache.org/jira/browse/MESOS-3420 > Project: Mesos > Issue Type: Task >Reporter: Joris Van Remoortere >Assignee: Klaus Ma > Labels: maintenance, mesosphere > > When an operator uses the {{machine/down}} endpoint, the master sends a > shutdown message to the agent. > We need to discuss and resolve the semantics that we want regarding the > operators and frameworks knowing when their tasks are terminated. > One option is to explicitly remove the agent from the master which will send > the {{TASK_LOST}} updates and {{SlaveLostMessage}} directly from the master. > The concern around this is that during a network partition, or if the agent > was down at the time, that these tasks could still be running. > This is a general problem related to task life-times being dissociated with > that life-time of the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3432) Unify the implementations of the image provisioners.
Jie Yu created MESOS-3432: - Summary: Unify the implementations of the image provisioners. Key: MESOS-3432 URL: https://issues.apache.org/jira/browse/MESOS-3432 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Jie Yu The current design uses separate provisioner implementation for each type of image (e.g., APPC, DOCKER). This creates a lot of code duplications. Since we already have a unified provisioner backend (e.g., copy, bind, overlayfs), we should be able to unify the implementations of image provisioners and hide the image specific logics in the corresponding 'Store' implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3430) LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745881#comment-14745881 ] Jie Yu commented on MESOS-3430: --- [~haosd...@gmail.com] do you understand why this works? marking the sandbox as slave first and then shared makes me feel that the first operation is a no-op. > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails > on CentOS 7.1 > -- > > Key: MESOS-3430 > URL: https://issues.apache.org/jira/browse/MESOS-3430 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Marco Massenzio >Assignee: Michael Park > Labels: ROOT_Tests, flaky-test > Attachments: verbose.log > > > Just ran ROOT tests on CentOS 7.1 and had the following failure (clean build, > just pulled from {{master}}): > {noformat} > [ RUN ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > ../../src/tests/containerizer/filesystem_isolator_tests.cpp:498: Failure > (wait).failure(): Failed to clean up an isolator when destroying container > '366b6d37-b326-4ed1-8a5f-43d483dbbace' :Failed to unmount volume > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Failed to unmount > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Invalid argument > ../../src/tests/utils.cpp:75: Failure > os::rmdir(sandbox.get()): Device or resource busy > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem (1943 > ms) > [--] 1 test from LinuxFilesystemIsolatorTest (1943 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (1951 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3430) LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745859#comment-14745859 ] haosdent edited comment on MESOS-3430 at 9/15/15 6:15 PM: -- Thank you very much! Ubuntu 14.04 (ext4) also have the problem like this if mark the root as shared mount point. Is it would more simple to use make-slave to stop propagate between mount points? So that we don't need concern about the outside environment. Assume someone use make-private/make-shared to change the root mount point during the slave running. was (Author: haosd...@gmail.com): Thank you very much! Ubuntu 14.04 (ext4) also have the problem like this if mark the root as shared mount point. Is it would more simple to use make-slave to stop propagate between mount points. So that we don't need concern about the outside environment. Assume someone use make-private/make-shared to change the root mount point during the slave running. > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails > on CentOS 7.1 > -- > > Key: MESOS-3430 > URL: https://issues.apache.org/jira/browse/MESOS-3430 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Marco Massenzio >Assignee: Michael Park > Labels: ROOT_Tests, flaky-test > Attachments: verbose.log > > > Just ran ROOT tests on CentOS 7.1 and had the following failure (clean build, > just pulled from {{master}}): > {noformat} > [ RUN ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > ../../src/tests/containerizer/filesystem_isolator_tests.cpp:498: Failure > (wait).failure(): Failed to clean up an isolator when destroying container > '366b6d37-b326-4ed1-8a5f-43d483dbbace' :Failed to unmount volume > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Failed to unmount > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Invalid argument > ../../src/tests/utils.cpp:75: Failure > os::rmdir(sandbox.get()): Device or resource busy > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem (1943 > ms) > [--] 1 test from LinuxFilesystemIsolatorTest (1943 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (1951 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3366) Allow resources/attributes discovery
[ https://issues.apache.org/jira/browse/MESOS-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745769#comment-14745769 ] Felix Abecassis commented on MESOS-3366: [~cdoyle] [~nnielsen] Thank you for the reviews on my initial patch. Before I add proper comments and tests, are you fine with this API proposal in the first place? For instance, how should we also tackle attributes decoration? Should we add a second hook or have one hook for both attributes and resources? I would say we should have 2 different hooks, but it might be at the cost of some code/tests duplication. > Allow resources/attributes discovery > > > Key: MESOS-3366 > URL: https://issues.apache.org/jira/browse/MESOS-3366 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Felix Abecassis > > In heterogeneous clusters, tasks sometimes have strong constraints on the > type of hardware they need to execute on. The current solution is to use > custom resources and attributes on the agents. Detecting non-standard > resources/attributes requires wrapping the "mesos-slave" binary behind a > script and use custom code to probe the agent. Unfortunately, this approach > doesn't allow composition. The solution would be to provide a hook/module > mechanism to allow users to use custom code performing resources/attributes > discovery. > Please review the detailed document below: > https://docs.google.com/document/d/15OkebDezFxzeyLsyQoU0upB0eoVECAlzEkeg0HQAX9w > Feel free to express comments/concerns by annotating the document or by > replying to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3430) LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745814#comment-14745814 ] Jie Yu commented on MESOS-3430: --- We should add a CHECK and abort if the parent mount of work_dir is a shared mount. > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails > on CentOS 7.1 > -- > > Key: MESOS-3430 > URL: https://issues.apache.org/jira/browse/MESOS-3430 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Marco Massenzio >Assignee: Michael Park > Labels: ROOT_Tests, flaky-test > Attachments: verbose.log > > > Just ran ROOT tests on CentOS 7.1 and had the following failure (clean build, > just pulled from {{master}}): > {noformat} > [ RUN ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > ../../src/tests/containerizer/filesystem_isolator_tests.cpp:498: Failure > (wait).failure(): Failed to clean up an isolator when destroying container > '366b6d37-b326-4ed1-8a5f-43d483dbbace' :Failed to unmount volume > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Failed to unmount > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Invalid argument > ../../src/tests/utils.cpp:75: Failure > os::rmdir(sandbox.get()): Device or resource busy > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem (1943 > ms) > [--] 1 test from LinuxFilesystemIsolatorTest (1943 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (1951 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3430) LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745859#comment-14745859 ] haosdent commented on MESOS-3430: - Thank you very much! Ubuntu 14.04 (ext4) also have the problem like this if mark the root as shared mount point. Is it would more simple to use make-slave to stop propagate between mount points. So that we don't need concern about the outside environment. Assume someone use make-private/make-shared to change the root mount point during the slave running. > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails > on CentOS 7.1 > -- > > Key: MESOS-3430 > URL: https://issues.apache.org/jira/browse/MESOS-3430 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Marco Massenzio >Assignee: Michael Park > Labels: ROOT_Tests, flaky-test > Attachments: verbose.log > > > Just ran ROOT tests on CentOS 7.1 and had the following failure (clean build, > just pulled from {{master}}): > {noformat} > [ RUN ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > ../../src/tests/containerizer/filesystem_isolator_tests.cpp:498: Failure > (wait).failure(): Failed to clean up an isolator when destroying container > '366b6d37-b326-4ed1-8a5f-43d483dbbace' :Failed to unmount volume > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Failed to unmount > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Invalid argument > ../../src/tests/utils.cpp:75: Failure > os::rmdir(sandbox.get()): Device or resource busy > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem (1943 > ms) > [--] 1 test from LinuxFilesystemIsolatorTest (1943 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (1951 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3430) LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745892#comment-14745892 ] haosdent commented on MESOS-3430: - According this document, https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt slave+shared also is a state. But I not very clear about that. And I find another problem, I just do a simple try in Ubuntu 14.04 {code} $ mount --make-shared / $ mkdir /tmp/sandbox $ mkdir /tmp/source $ mkdir /tmp/sandbox/target $ mount --bind /tmp/sandbox /tmp/sandbox $ mount --make-shared /tmp/sandbox $ mount --bind /tmp/source /tmp/sandbox/target {code} It could reproduce the problem. {code} 45 22 8:3 /tmp/sandbox /tmp/sandbox rw,relatime - ext4 /dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316 rw,errors=remount-ro,data=ordered 47 45 8:3 /tmp/source /tmp/sandbox/target rw,relatime shared:1 - ext4 /dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316 rw,errors=remount-ro,data=ordered 48 22 8:3 /tmp/source /tmp/sandbox/target rw,relatime shared:1 - ext4 /dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316 rw,errors=remount-ro,data=ordered {code} And if I change the root mount point to private before unmount /tmp/sandbox/target. I found them could not unmount. {code} $ mount --make-private /tmp/sandbox {code} First time success. {code} $ umount /tmp/sandbox/target {code} Second time failed. {code} $ umount /tmp/sandbox/target umount: /tmp/sandbox/target: not mounted {code} The record still exists in /proc/self/mountinfo {code} 45 22 8:3 /tmp/sandbox /tmp/sandbox rw,relatime - ext4 /dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316 rw,errors=remount-ro,data=ordered 48 22 8:3 /tmp/source /tmp/sandbox/target rw,relatime shared:1 - ext4 /dev/disk/by-uuid/98708f21-a59d-4b80-a85c-27b78c22e316 rw,errors=remount-ro,data=ordered {code} > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails > on CentOS 7.1 > -- > > Key: MESOS-3430 > URL: https://issues.apache.org/jira/browse/MESOS-3430 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Marco Massenzio >Assignee: Michael Park > Labels: ROOT_Tests, flaky-test > Attachments: verbose.log > > > Just ran ROOT tests on CentOS 7.1 and had the following failure (clean build, > just pulled from {{master}}): > {noformat} > [ RUN ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > ../../src/tests/containerizer/filesystem_isolator_tests.cpp:498: Failure > (wait).failure(): Failed to clean up an isolator when destroying container > '366b6d37-b326-4ed1-8a5f-43d483dbbace' :Failed to unmount volume > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Failed to unmount > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Invalid argument > ../../src/tests/utils.cpp:75: Failure > os::rmdir(sandbox.get()): Device or resource busy > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem (1943 > ms) > [--] 1 test from LinuxFilesystemIsolatorTest (1943 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (1951 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3386) Port remaining Stout tests to Windows
[ https://issues.apache.org/jira/browse/MESOS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745064#comment-14745064 ] Alex Clemmer commented on MESOS-3386: - [~hartem] Oh sorry I didn't see this until just now. I'll do it tomorrow (it's already almost 2) > Port remaining Stout tests to Windows > - > > Key: MESOS-3386 > URL: https://issues.apache.org/jira/browse/MESOS-3386 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: build, mesosphere, tests > > Here is a concise list of the Stout tests that don't work yet, and their > dependencies, and comments about how hard they are to port. Asterisks are > next to tests that seem to block Windows MVP. > {quote} > *dynamiclibrary_tests.cpp -- depends on dynamic load libraries [probably > easy, just map to windows dll load API] > *flags_tests.cpp -- depends on os.hpp [probably will "just work" if we port > os.hpp > *gzip_tests.cpp -- depends on gzip.hpp [need to make API-compatible impl of > gzip.hpp, which is a medium amount of work] > *ip_tests.cpp -- depends on net.hpp and abort.hpp [will probably "just work" > after we port net.hpp] > *mac_tests.cpp -- depends on abort.hpp and mac.hpp [may or may not be > nontrivial, will probably work if we can get mac.hpp] > *os_tests.cpp -- depends on a bunch of stuff [probably hardest and most > important] > *path_tests.cpp -- depends on os.hpp [will probably "just work" if we port > os.hpp] > protobuf_tests.cpp -- depends on stout/protobuf.hpp (and it can't seem to > find the protobuf include dir) > *sendfile_test.cpp -- depends on os.hpp and sendfile.hpp [simple port of > sendfile is possible; os.hpp is harder] > signals_tests.cpp -- depends on os.hpp and signal.hpp [signals will probably > be easy; os.hpp is the hard part] > *subcommand_tests.cpp -- depends on flags.hpp (which depends on os.hpp) > [probably will "just work" if we get os.hpp] > svn_tests.cpp -- depends on libapr and libsvn [simple if we get windows to > pull these deps] > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3157) only perform batch resource allocations
[ https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745087#comment-14745087 ] Klaus Ma commented on MESOS-3157: - [~bmahler], is it possible to ask framework to handle shorter tasks? For such short running tasks case, build a framework to load a long running executor to run tasks; the framework hold the resources until all tasks are done, and dispatch next task to the executor without waiting time. > only perform batch resource allocations > --- > > Key: MESOS-3157 > URL: https://issues.apache.org/jira/browse/MESOS-3157 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: James Peach >Assignee: James Peach > > Our deployment environments have a lot of churn, with many short-live > frameworks that often revive offers. Running the allocator takes a long time > (from seconds up to minutes). > In this situation, event-triggered allocation causes the event queue in the > allocator process to get very long, and the allocator effectively becomes > unresponsive (eg. a revive offers message takes too long to come to the head > of the queue). > We have been running a patch to remove all the event-triggered allocations > and only allocate from the batch task > {{HierarchicalAllocatorProcess::batch}}. This works great and really improves > responsiveness. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3408) Labels field of FrameworkInfo should be added into v1 mesos.proto
[ https://issues.apache.org/jira/browse/MESOS-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang updated MESOS-3408: -- Shepherd: Adam B > Labels field of FrameworkInfo should be added into v1 mesos.proto > - > > Key: MESOS-3408 > URL: https://issues.apache.org/jira/browse/MESOS-3408 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Assignee: Qian Zhang > Fix For: 0.25.0 > > > In [MESOS-2841|https://issues.apache.org/jira/browse/MESOS-2841], a new field > "Labels" has been added into FrameworkInfo in mesos.proto, but is missed in > v1 mesos.proto. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1845) CommandInfo tasks may fail when scheduled after another task with the same id has finished.
[ https://issues.apache.org/jira/browse/MESOS-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated MESOS-1845: Assignee: (was: Klaus Ma) > CommandInfo tasks may fail when scheduled after another task with the same id > has finished. > --- > > Key: MESOS-1845 > URL: https://issues.apache.org/jira/browse/MESOS-1845 > Project: Mesos > Issue Type: Bug >Reporter: Andreas Raster > > I created a little test framework where I wanted to experiment with > scheduling tasks where running one task relies on the results of another, > previously run task. So in my test framework I would first schedule a task > that would append the string "foo" to a file, and after that one finishes I > would schedule a task that appends "bar" to the same file. > This worked well when using ExecutorInfo, but when I switched to using > CommandInfo instead (specifying commands like 'echo foo >> /share/foobar.txt' > in set_value()), it would most of the time fail in the second step when > attempting to append "bar". Occasionally, but very rarely, it would work > though. > I couldn't find any meaningful log messages indicating what exactly went > wrong. The slave log would indicate that the tasks status changed to > TASK_FAILED and that that status update was sent correctly. The stdout log in > the Sandbox would indicate that the command 'exited with status 0'. > I could work around the issue when I specified task ids that were always > unique. Previously I would reuse the id of a previously run task, one that > appended "foo" to a file, after it finished in the followup task that would > append "bar" to a file. > It seems to me there might be something wrong when scheduling very short > running tasks with the same id quickly after each other. > Source code for my foobar framework: > http://paste.ubuntu.com/8459083 > Build with: > g++ -std=c++0x -g -Wall foobar_framework.cpp -I. -L/usr/local/lib -lmesos -o > foobar-framework -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3422) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
[ https://issues.apache.org/jira/browse/MESOS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745606#comment-14745606 ] Guangya Liu edited comment on MESOS-3422 at 9/15/15 3:39 PM: - I test on Ubuntu and works well. [~vi...@twitter.com] does this related to platform? Thansk. {code} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from MasterSlaveReconciliationTest [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask Using temporary directory '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn' I0915 22:28:40.800787 3733 leveldb.cpp:176] Opened db in 252.206266ms I0915 22:28:40.851069 3733 leveldb.cpp:183] Compacted db in 50.197346ms I0915 22:28:40.851210 3733 leveldb.cpp:198] Created db iterator in 63324ns I0915 22:28:40.851256 3733 leveldb.cpp:204] Seeked to beginning of db in 4562ns I0915 22:28:40.851286 3733 leveldb.cpp:273] Iterated through 0 keys in the db in 322ns I0915 22:28:40.871953 3733 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0915 22:28:40.886368 3756 recover.cpp:449] Starting replica recovery I0915 22:28:40.90 3756 recover.cpp:475] Replica is in EMPTY status I0915 22:28:40.916332 3759 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0915 22:28:40.917351 3756 recover.cpp:195] Received a recover response from a replica in EMPTY status I0915 22:28:40.918557 3755 recover.cpp:566] Updating replica status to STARTING I0915 22:28:40.928189 3759 master.cpp:380] Master 20150915-222840-16842879-54960-3733 (devstack007.cn.ibm.com) started on 127.0.1.1:54960 I0915 22:28:40.928261 3759 master.cpp:382] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/credentials" --framework_sorter="drf" --help="false" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/master" --zk_session_timeout="10secs" I0915 22:28:40.993895 3759 master.cpp:427] Master only allowing authenticated frameworks to register I0915 22:28:40.993962 3759 master.cpp:432] Master only allowing authenticated slaves to register I0915 22:28:40.994010 3759 credentials.hpp:37] Loading credentials for authentication from '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/credentials' I0915 22:28:40.994776 3759 master.cpp:471] Using default 'crammd5' authenticator I0915 22:28:40.995053 3759 authenticator.cpp:512] Initializing server SASL I0915 22:28:41.009496 3757 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 90.341573ms I0915 22:28:41.009570 3757 replica.cpp:323] Persisted replica status to STARTING I0915 22:28:41.010040 3756 recover.cpp:475] Replica is in STARTING status I0915 22:28:41.011255 3757 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0915 22:28:41.011551 3752 recover.cpp:195] Received a recover response from a replica in STARTING status I0915 22:28:41.012073 3756 recover.cpp:566] Updating replica status to VOTING I0915 22:28:41.084720 3753 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 72.469042ms I0915 22:28:41.084803 3753 replica.cpp:323] Persisted replica status to VOTING I0915 22:28:41.084935 3752 recover.cpp:580] Successfully joined the Paxos group I0915 22:28:41.085227 3752 recover.cpp:464] Recover process terminated I0915 22:28:41.191287 3759 auxprop.cpp:66] Initialized in-memory auxiliary property plugin I0915 22:28:41.191455 3759 master.cpp:508] Authorization enabled I0915 22:28:41.192039 3758 hierarchical.hpp:408] Initialized hierarchical allocator process I0915 22:28:41.210978 3752 whitelist_watcher.cpp:79] No whitelist given I0915 22:28:41.226894 3757 master.cpp:1605] The newly elected leader is master@127.0.1.1:54960 with id 20150915-222840-16842879-54960-3733 I0915 22:28:41.227022 3757 master.cpp:1618] Elected as the leading mast
[jira] [Commented] (MESOS-3422) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
[ https://issues.apache.org/jira/browse/MESOS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745606#comment-14745606 ] Guangya Liu commented on MESOS-3422: I test on Ubuntu and works well. [~vi...@twitter.com] does this related to platform? Thansk. [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from MasterSlaveReconciliationTest [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask Using temporary directory '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn' I0915 22:28:40.800787 3733 leveldb.cpp:176] Opened db in 252.206266ms I0915 22:28:40.851069 3733 leveldb.cpp:183] Compacted db in 50.197346ms I0915 22:28:40.851210 3733 leveldb.cpp:198] Created db iterator in 63324ns I0915 22:28:40.851256 3733 leveldb.cpp:204] Seeked to beginning of db in 4562ns I0915 22:28:40.851286 3733 leveldb.cpp:273] Iterated through 0 keys in the db in 322ns I0915 22:28:40.871953 3733 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0915 22:28:40.886368 3756 recover.cpp:449] Starting replica recovery I0915 22:28:40.90 3756 recover.cpp:475] Replica is in EMPTY status I0915 22:28:40.916332 3759 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0915 22:28:40.917351 3756 recover.cpp:195] Received a recover response from a replica in EMPTY status I0915 22:28:40.918557 3755 recover.cpp:566] Updating replica status to STARTING I0915 22:28:40.928189 3759 master.cpp:380] Master 20150915-222840-16842879-54960-3733 (devstack007.cn.ibm.com) started on 127.0.1.1:54960 I0915 22:28:40.928261 3759 master.cpp:382] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/credentials" --framework_sorter="drf" --help="false" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/master" --zk_session_timeout="10secs" I0915 22:28:40.993895 3759 master.cpp:427] Master only allowing authenticated frameworks to register I0915 22:28:40.993962 3759 master.cpp:432] Master only allowing authenticated slaves to register I0915 22:28:40.994010 3759 credentials.hpp:37] Loading credentials for authentication from '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/credentials' I0915 22:28:40.994776 3759 master.cpp:471] Using default 'crammd5' authenticator I0915 22:28:40.995053 3759 authenticator.cpp:512] Initializing server SASL I0915 22:28:41.009496 3757 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 90.341573ms I0915 22:28:41.009570 3757 replica.cpp:323] Persisted replica status to STARTING I0915 22:28:41.010040 3756 recover.cpp:475] Replica is in STARTING status I0915 22:28:41.011255 3757 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0915 22:28:41.011551 3752 recover.cpp:195] Received a recover response from a replica in STARTING status I0915 22:28:41.012073 3756 recover.cpp:566] Updating replica status to VOTING I0915 22:28:41.084720 3753 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 72.469042ms I0915 22:28:41.084803 3753 replica.cpp:323] Persisted replica status to VOTING I0915 22:28:41.084935 3752 recover.cpp:580] Successfully joined the Paxos group I0915 22:28:41.085227 3752 recover.cpp:464] Recover process terminated I0915 22:28:41.191287 3759 auxprop.cpp:66] Initialized in-memory auxiliary property plugin I0915 22:28:41.191455 3759 master.cpp:508] Authorization enabled I0915 22:28:41.192039 3758 hierarchical.hpp:408] Initialized hierarchical allocator process I0915 22:28:41.210978 3752 whitelist_watcher.cpp:79] No whitelist given I0915 22:28:41.226894 3757 master.cpp:1605] The newly elected leader is master@127.0.1.1:54960 with id 20150915-222840-16842879-54960-3733 I0915 22:28:41.227022 3757 master.cpp:1618] Elected as the leading master! I0915 22:28:41.227073 3757 master.cpp:1378] Reco
[jira] [Issue Comment Deleted] (MESOS-3422) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
[ https://issues.apache.org/jira/browse/MESOS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-3422: --- Comment: was deleted (was: [~vi...@twitter.com] I'm sorry that I updated the problem description by mistake, can you please help update the description again? Thanks!) > MasterSlaveReconciliationTest.ReconcileLostTask test is flaky > - > > Key: MESOS-3422 > URL: https://issues.apache.org/jira/browse/MESOS-3422 > Project: Mesos > Issue Type: Bug > Components: technical debt, test >Affects Versions: 0.25.0 > Environment: CentOS >Reporter: Vinod Kone > > Observed this on internal CI > {code} > DEBUG: [--] 5 tests from MasterSlaveReconciliationTest > DEBUG: [ RUN ] MasterSlaveReconciliationTest.SlaveReregisterTerminatedExecutor > DEBUG: Using temporary directory > '/tmp/MasterSlaveReconciliationTest_SlaveReregisterTerminatedExecutor_QJPUzf' > DEBUG: [ OK ] MasterSlaveReconciliationTest.SlaveReregisterTerminatedExecutor > (78 ms) > DEBUG: [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask > DEBUG: Using temporary directory > '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_16KDgE' > DEBUG: tests/master_slave_reconciliation_tests.cpp:226: Failure > DEBUG: Failed to wait 15secs for statusUpdateMessage > DEBUG: tests/master_slave_reconciliation_tests.cpp:216: Failure > DEBUG: Actual function call count doesn't match EXPECT_CALL(sched, > statusUpdate(, _))... > DEBUG: Expected: to be called once > DEBUG: Actual: never called - unsatisfied and active > DEBUG: I0914 08:51:27.825984 16062 leveldb.cpp:438] Reading position from > leveldb took 16151ns > DEBUG: I0914 08:51:27.828069 16049 registrar.cpp:342] Successfully fetched > the registry (0B) in 7648us > DEBUG: I0914 08:51:27.828119 16049 registrar.cpp:441] Applied 1 operations in > 2805ns; attempting to update the 'registry' > DEBUG: I0914 08:51:27.829991 16066 log.cpp:685] Attempting to append 222 > bytes to the log > DEBUG: I0914 08:51:27.830029 16066 coordinator.cpp:341] Coordinator > attempting to write APPEND action at position 1 > DEBUG: I0914 08:51:27.830729 16053 replica.cpp:511] Replica received write > request for position 1 > DEBUG: I0914 08:51:27.831167 16053 leveldb.cpp:343] Persisting action (241 > bytes) to leveldb took 414748ns > DEBUG: I0914 08:51:27.831185 16053 replica.cpp:679] Persisted action at 1 > DEBUG: I0914 08:51:27.831493 16058 replica.cpp:658] Replica received learned > notice for position 1 > DEBUG: I0914 08:51:27.831698 16058 leveldb.cpp:343] Persisting action (243 > bytes) to leveldb took 185223ns > DEBUG: I0914 08:51:27.831714 16058 replica.cpp:679] Persisted action at 1 > DEBUG: I0914 08:51:27.831722 16058 replica.cpp:664] Replica learned APPEND > action at position 1 > DEBUG: I0914 08:51:27.831989 16056 registrar.cpp:486] Successfully updated > the 'registry' in 3.827968ms > DEBUG: I0914 08:51:27.832041 16052 log.cpp:704] Attempting to truncate the > log to 1 > DEBUG: I0914 08:51:27.832093 16056 registrar.cpp:372] Successfully recovered > registrar > DEBUG: I0914 08:51:27.832259 16072 coordinator.cpp:341] Coordinator > attempting to write TRUNCATE action at position 2 > DEBUG: I0914 08:51:27.832259 16062 master.cpp:1404] Recovered 0 slaves from > the Registry (183B) ; allowing 10mins for slaves to re-register > DEBUG: I0914 08:51:27.832882 16060 replica.cpp:511] Replica received write > request for position 2 > DEBUG: I0914 08:51:27.833243 16060 leveldb.cpp:343] Persisting action (16 > bytes) to leveldb took 340843ns > DEBUG: I0914 08:51:27.833261 16060 replica.cpp:679] Persisted action at 2 > DEBUG: I0914 08:51:27.833593 16050 replica.cpp:658] Replica received learned > notice for position 2 > DEBUG: I0914 08:51:27.833724 16050 leveldb.cpp:343] Persisting action (18 > bytes) to leveldb took 112560ns > DEBUG: I0914 08:51:27.833755 16050 leveldb.cpp:401] Deleting ~1 keys from > leveldb took 16580ns > DEBUG: I0914 08:51:27.833765 16050 replica.cpp:679] Persisted action at 2 > DEBUG: I0914 08:51:27.833775 16050 replica.cpp:664] Replica learned TRUNCATE > action at position 2 > DEBUG: I0914 08:51:27.843340 16057 http.cpp:333] HTTP POST for > /master/maintenance/schedule from 172.18.4.102:46471 > DEBUG: I0914 08:51:27.843801 16050 registrar.cpp:441] Applied 1 operations in > 25197ns; attempting to update the 'registry' > DEBUG: I0914 08:51:27.845721 16068 log.cpp:685] Attempting to append 328 > bytes to the log > DEBUG: I0914 08:51:27.845772 16068 coordinator.cpp:341] Coordinator > attempting to write APPEND action at position 3 > DEBUG: I0914 08:51:27.846606 16052 replica.cpp:511] Replica received write > request for position 3 > DEBUG: I0914 08:51:27.847012 16052 leveldb.cpp:343] Persisting action (347 > bytes) to leveldb took 387519ns > DEBUG:
[jira] [Updated] (MESOS-3422) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
[ https://issues.apache.org/jira/browse/MESOS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-3422: --- Description: [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from MasterSlaveReconciliationTest [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask Using temporary directory '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn' I0915 22:28:40.800787 3733 leveldb.cpp:176] Opened db in 252.206266ms I0915 22:28:40.851069 3733 leveldb.cpp:183] Compacted db in 50.197346ms I0915 22:28:40.851210 3733 leveldb.cpp:198] Created db iterator in 63324ns I0915 22:28:40.851256 3733 leveldb.cpp:204] Seeked to beginning of db in 4562ns I0915 22:28:40.851286 3733 leveldb.cpp:273] Iterated through 0 keys in the db in 322ns I0915 22:28:40.871953 3733 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0915 22:28:40.886368 3756 recover.cpp:449] Starting replica recovery I0915 22:28:40.90 3756 recover.cpp:475] Replica is in EMPTY status I0915 22:28:40.916332 3759 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0915 22:28:40.917351 3756 recover.cpp:195] Received a recover response from a replica in EMPTY status I0915 22:28:40.918557 3755 recover.cpp:566] Updating replica status to STARTING I0915 22:28:40.928189 3759 master.cpp:380] Master 20150915-222840-16842879-54960-3733 (devstack007.cn.ibm.com) started on 127.0.1.1:54960 I0915 22:28:40.928261 3759 master.cpp:382] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/credentials" --framework_sorter="drf" --help="false" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/master" --zk_session_timeout="10secs" I0915 22:28:40.993895 3759 master.cpp:427] Master only allowing authenticated frameworks to register I0915 22:28:40.993962 3759 master.cpp:432] Master only allowing authenticated slaves to register I0915 22:28:40.994010 3759 credentials.hpp:37] Loading credentials for authentication from '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/credentials' I0915 22:28:40.994776 3759 master.cpp:471] Using default 'crammd5' authenticator I0915 22:28:40.995053 3759 authenticator.cpp:512] Initializing server SASL I0915 22:28:41.009496 3757 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 90.341573ms I0915 22:28:41.009570 3757 replica.cpp:323] Persisted replica status to STARTING I0915 22:28:41.010040 3756 recover.cpp:475] Replica is in STARTING status I0915 22:28:41.011255 3757 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0915 22:28:41.011551 3752 recover.cpp:195] Received a recover response from a replica in STARTING status I0915 22:28:41.012073 3756 recover.cpp:566] Updating replica status to VOTING I0915 22:28:41.084720 3753 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 72.469042ms I0915 22:28:41.084803 3753 replica.cpp:323] Persisted replica status to VOTING I0915 22:28:41.084935 3752 recover.cpp:580] Successfully joined the Paxos group I0915 22:28:41.085227 3752 recover.cpp:464] Recover process terminated I0915 22:28:41.191287 3759 auxprop.cpp:66] Initialized in-memory auxiliary property plugin I0915 22:28:41.191455 3759 master.cpp:508] Authorization enabled I0915 22:28:41.192039 3758 hierarchical.hpp:408] Initialized hierarchical allocator process I0915 22:28:41.210978 3752 whitelist_watcher.cpp:79] No whitelist given I0915 22:28:41.226894 3757 master.cpp:1605] The newly elected leader is master@127.0.1.1:54960 with id 20150915-222840-16842879-54960-3733 I0915 22:28:41.227022 3757 master.cpp:1618] Elected as the leading master! I0915 22:28:41.227073 3757 master.cpp:1378] Recovering from registrar I0915 22:28:41.227442 3756 registrar.cpp:309] Recovering registrar I0915 22:28:41.228864 3759 lo
[jira] [Commented] (MESOS-3419) Add HELP message for reserve/unreserve endpoint
[ https://issues.apache.org/jira/browse/MESOS-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745253#comment-14745253 ] Guangya Liu commented on MESOS-3419: [~mcypark] Any comments on the RR? Thanks. > Add HELP message for reserve/unreserve endpoint > --- > > Key: MESOS-3419 > URL: https://issues.apache.org/jira/browse/MESOS-3419 > Project: Mesos > Issue Type: Task >Affects Versions: 0.25.0 >Reporter: Guangya Liu >Assignee: Guangya Liu > Fix For: 0.25.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2077) Ensure that TASK_LOSTs for a hard slave drain (SIGUSR1) include a Reason.
[ https://issues.apache.org/jira/browse/MESOS-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745262#comment-14745262 ] Guangya Liu commented on MESOS-2077: [~bmahler] can you please show some comments for the RR? Thanks! > Ensure that TASK_LOSTs for a hard slave drain (SIGUSR1) include a Reason. > - > > Key: MESOS-2077 > URL: https://issues.apache.org/jira/browse/MESOS-2077 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Benjamin Mahler >Assignee: Guangya Liu > Labels: mesosphere, twitter > > For maintenance, sometimes operators will force the drain of a slave (via > SIGUSR1), when deemed safe (e.g. non-critical tasks running) and/or necessary > (e.g. bad hardware). > To eliminate alerting noise, we'd like to add a 'Reason' that expresses the > forced drain of the slave, so that these are not considered to be a generic > slave removal TASK_LOST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2224) Add explanatory comments for Allocator interface
[ https://issues.apache.org/jira/browse/MESOS-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745259#comment-14745259 ] Guangya Liu commented on MESOS-2224: [~mcypark] [~alex-mesos] can you help review this? It has been reviewed for several rounds. Thanks! > Add explanatory comments for Allocator interface > > > Key: MESOS-2224 > URL: https://issues.apache.org/jira/browse/MESOS-2224 > Project: Mesos > Issue Type: Task > Components: allocation >Affects Versions: 0.25.0 >Reporter: Alexander Rukletsov >Assignee: Guangya Liu >Priority: Minor > Fix For: 0.25.0 > > > Allocator is the public API and it would be great to have comments on all > calls to be implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3184) Scheduler driver accepts (re-)registration message while re-authentication is in progress.
[ https://issues.apache.org/jira/browse/MESOS-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727511#comment-14727511 ] Guangya Liu edited comment on MESOS-3184 at 9/15/15 11:06 AM: -- [~bmahler] can you please show more detail for this? Based on my understanding, if authentication failure, the master will send FrameworkErrorMessage to framework and framework will register failed. {code} if (authorizationError.isSome()) { LOG(INFO) << "Refusing subscription of framework" << " '" << frameworkInfo.name() << "'" << ": " << authorizationError.get().message; FrameworkErrorMessage message; message.set_message(authorizationError.get().message); http.send(message); http.close(); return; } {code} was (Author: gyliu): [~bmahler] can you please show more detail for this? Based on my understanding, if authentication failure, the master will send FrameworkErrorMessage to framework and framework will register failed. if (authorizationError.isSome()) { LOG(INFO) << "Refusing subscription of framework" << " '" << frameworkInfo.name() << "'" << ": " << authorizationError.get().message; FrameworkErrorMessage message; message.set_message(authorizationError.get().message); http.send(message); http.close(); return; } > Scheduler driver accepts (re-)registration message while re-authentication is > in progress. > -- > > Key: MESOS-3184 > URL: https://issues.apache.org/jira/browse/MESOS-3184 > Project: Mesos > Issue Type: Bug > Components: scheduler driver >Reporter: Benjamin Mahler >Assignee: Guangya Liu > > The scheduler driver currently accepts re-registered when it is > re-authenticating with the master. This can occur due a race with the > authentication timeout and the master sending a (re-)registration message. > This is fairly innocuous currently, but if the subsequent re-authentication > fails, the driver keeps retrying authentication, but both the master and > driver continue to act as though the scheduler is registered. > The authentication check in _(re-)registerFramework in the master doesn't > provide any benefit to this, it is still a race, so this should likely be > removed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1935) Replace hard-coded reap interval with a constant
[ https://issues.apache.org/jira/browse/MESOS-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745263#comment-14745263 ] Guangya Liu commented on MESOS-1935: [~bmahler] [~alex-mesos] This has been reviewed for several round, can you help review the latest RR again? Thanks! > Replace hard-coded reap interval with a constant > > > Key: MESOS-1935 > URL: https://issues.apache.org/jira/browse/MESOS-1935 > Project: Mesos > Issue Type: Task > Components: test >Affects Versions: 0.25.0 >Reporter: Alexander Rukletsov >Assignee: Guangya Liu >Priority: Trivial > Labels: newbie > Fix For: 0.25.0 > > > With https://issues.apache.org/jira/browse/MESOS-1846 implemented, replace > the hard-coded value for the maximal reap interval (1s) with the constant > from {{reap.hpp}}. This will mostly affect tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2647) Slave should validate tasks using oversubscribed resources
[ https://issues.apache.org/jira/browse/MESOS-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745268#comment-14745268 ] Guangya Liu commented on MESOS-2647: [~vi...@twitter.com] One question for this task: How can this case happen? In my understanding, when launch task, the revocable resource should be available and there should be no task running on this revocable resource. Only when the QoS controller make some correction by killing some executors/tasks can release some revocable resource. Comments? Thanks! > Slave should validate tasks using oversubscribed resources > -- > > Key: MESOS-2647 > URL: https://issues.apache.org/jira/browse/MESOS-2647 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: Guangya Liu > Labels: twitter > > The latest oversubscribed resource estimate might render a revocable task > launch invalid. Slave should check this and send TASK_LOST with appropriate > REASON. > We need to add a new REASON for this (REASON_RESOURCE_OVERSUBSCRIBED?). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3169) FrameworkInfo should only be updated if the re-registration is valid
[ https://issues.apache.org/jira/browse/MESOS-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745255#comment-14745255 ] Guangya Liu commented on MESOS-3169: [~jvanremoortere] can you help review the RR? Thanks. > FrameworkInfo should only be updated if the re-registration is valid > > > Key: MESOS-3169 > URL: https://issues.apache.org/jira/browse/MESOS-3169 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.25.0 >Reporter: Joris Van Remoortere >Assignee: Guangya Liu > Labels: framework, master, mesosphere, tech-debt > Fix For: 0.25.0 > > > See Ben Mahler's comment in https://reviews.apache.org/r/32961/ > FrameworkInfo should not be updated if the re-registration is invalid. This > can happen in a few cases under the branching logic, so this requires some > refactoring. > Notice that a {code}FrameworkErrorMessage{code} can be generated both inside > {code}else if (from != framework->pid){code} as well as from inside > {code}failoverFramework(framework, from);{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3037) Add a QUIESCE call to the scheduler
[ https://issues.apache.org/jira/browse/MESOS-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745258#comment-14745258 ] Guangya Liu commented on MESOS-3037: [~vi...@twitter.com] Now all of your comments are addressed, can you help review? Thanks! > Add a QUIESCE call to the scheduler > --- > > Key: MESOS-3037 > URL: https://issues.apache.org/jira/browse/MESOS-3037 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.25.0 >Reporter: Vinod Kone >Assignee: Guangya Liu > Labels: September23th > Fix For: 0.25.0 > > > SUPPRESS call is the complement to the current REVIVE call i.e., it will > inform Mesos to stop sending offers to the framework. > For the scheduler driver to send only Call messages (MESOS-2913), > DeactivateFrameworkMessage needs to be converted to Call(s). We can implement > this by having the driver send a SUPPRESS call followed by a DECLINE call for > outstanding offers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3431) Refactor Protobuf tests
Alexander Rukletsov created MESOS-3431: -- Summary: Refactor Protobuf tests Key: MESOS-3431 URL: https://issues.apache.org/jira/browse/MESOS-3431 Project: Mesos Issue Type: Task Components: test Reporter: Alexander Rukletsov Priority: Minor {{ProtobufTest.JSON}} test does several things simultaneously, including message instantiation, conversion, parsing. We should split this test into several independent ones that test just one thing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3408) Labels field of FrameworkInfo should be added into v1 mesos.proto
[ https://issues.apache.org/jira/browse/MESOS-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-3408: -- Labels: mesosphere (was: ) > Labels field of FrameworkInfo should be added into v1 mesos.proto > - > > Key: MESOS-3408 > URL: https://issues.apache.org/jira/browse/MESOS-3408 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Assignee: Qian Zhang > Labels: mesosphere > Fix For: 0.25.0 > > > In [MESOS-2841|https://issues.apache.org/jira/browse/MESOS-2841], a new field > "Labels" has been added into FrameworkInfo in mesos.proto, but is missed in > v1 mesos.proto. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3430) LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails on CentOS 7.1
[ https://issues.apache.org/jira/browse/MESOS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745810#comment-14745810 ] Jie Yu commented on MESOS-3430: --- OK, the problem is: by default, centos7.1 mark / mount as a shared mount: {noformat} [vagrant@localhost ~]$ cat /proc/self/mountinfo 17 37 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw 18 37 0:16 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs rw,seclabel 19 37 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,seclabel,size=224872k,nr_inodes=56218,mode=755 20 18 0:15 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - securityfs securityfs rw 21 19 0:17 / /dev/shm rw,nosuid,nodev shared:3 - tmpfs tmpfs rw,seclabel 22 19 0:11 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=000 23 37 0:18 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755 24 18 0:19 / /sys/fs/cgroup rw,nosuid,nodev,noexec shared:8 - tmpfs tmpfs rw,seclabel,mode=755 25 24 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 26 18 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:19 - pstore pstore rw 27 24 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,cpuset 28 24 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,cpuacct,cpu 29 24 0:24 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,memory 30 24 0:25 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,devices 31 24 0:26 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,freezer 32 24 0:27 / /sys/fs/cgroup/net_cls rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,net_cls 33 24 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,blkio 34 24 0:29 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,perf_event 35 24 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,hugetlb 36 18 0:31 / /sys/kernel/config rw,relatime shared:20 - configfs configfs rw 37 1 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota 38 18 0:14 / /sys/fs/selinux rw,relatime shared:21 - selinuxfs selinuxfs rw 39 17 0:32 / /proc/sys/fs/binfmt_misc rw,relatime shared:23 - autofs systemd-1 rw,fd=33,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 40 19 0:33 / /dev/hugepages rw,relatime shared:24 - hugetlbfs hugetlbfs rw,seclabel 41 19 0:13 / /dev/mqueue rw,relatime shared:25 - mqueue mqueue rw,seclabel 42 18 0:7 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw 44 37 8:1 / /boot rw,relatime shared:27 - xfs /dev/sda1 rw,seclabel,attr2,inode64,noquota 45 37 0:35 / /vagrant rw,nodev,relatime shared:28 - vboxsf none rw {noformat} > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem fails > on CentOS 7.1 > -- > > Key: MESOS-3430 > URL: https://issues.apache.org/jira/browse/MESOS-3430 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Marco Massenzio >Assignee: Michael Park > Labels: ROOT_Tests, flaky-test > Attachments: verbose.log > > > Just ran ROOT tests on CentOS 7.1 and had the following failure (clean build, > just pulled from {{master}}): > {noformat} > [ RUN ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > ../../src/tests/containerizer/filesystem_isolator_tests.cpp:498: Failure > (wait).failure(): Failed to clean up an isolator when destroying container > '366b6d37-b326-4ed1-8a5f-43d483dbbace' :Failed to unmount volume > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Failed to unmount > '/tmp/LinuxFilesystemIsolatorTest_ROOT_PersistentVolumeWithoutRootFilesystem_KXgvoH/sandbox/volume': > Invalid argument > ../../src/tests/utils.cpp:75: Failure > os::rmdir(sandbox.get()): Device or resource busy > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem (1943 > ms) > [--] 1 test from LinuxFilesystemIsolatorTest (1943 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (1951 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] > LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithoutRootFilesystem > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3422) MasterSlaveReconciliationTest.ReconcileLostTask test is flaky
[ https://issues.apache.org/jira/browse/MESOS-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745611#comment-14745611 ] Guangya Liu commented on MESOS-3422: [~vi...@twitter.com] I'm sorry that I updated the problem description by mistake, can you please help update the description again? Thanks! > MasterSlaveReconciliationTest.ReconcileLostTask test is flaky > - > > Key: MESOS-3422 > URL: https://issues.apache.org/jira/browse/MESOS-3422 > Project: Mesos > Issue Type: Bug > Components: technical debt, test >Affects Versions: 0.25.0 > Environment: CentOS >Reporter: Vinod Kone > > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from MasterSlaveReconciliationTest > [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask > Using temporary directory > '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn' > I0915 22:28:40.800787 3733 leveldb.cpp:176] Opened db in 252.206266ms > I0915 22:28:40.851069 3733 leveldb.cpp:183] Compacted db in 50.197346ms > I0915 22:28:40.851210 3733 leveldb.cpp:198] Created db iterator in 63324ns > I0915 22:28:40.851256 3733 leveldb.cpp:204] Seeked to beginning of db in > 4562ns > I0915 22:28:40.851286 3733 leveldb.cpp:273] Iterated through 0 keys in the > db in 322ns > I0915 22:28:40.871953 3733 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0915 22:28:40.886368 3756 recover.cpp:449] Starting replica recovery > I0915 22:28:40.90 3756 recover.cpp:475] Replica is in EMPTY status > I0915 22:28:40.916332 3759 replica.cpp:641] Replica in EMPTY status received > a broadcasted recover request > I0915 22:28:40.917351 3756 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I0915 22:28:40.918557 3755 recover.cpp:566] Updating replica status to > STARTING > I0915 22:28:40.928189 3759 master.cpp:380] Master > 20150915-222840-16842879-54960-3733 (devstack007.cn.ibm.com) started on > 127.0.1.1:54960 > I0915 22:28:40.928261 3759 master.cpp:382] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" > --authorizers="local" > --credentials="/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/credentials" > --framework_sorter="drf" --help="false" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" > --registry_strict="true" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/master" > --zk_session_timeout="10secs" > I0915 22:28:40.993895 3759 master.cpp:427] Master only allowing > authenticated frameworks to register > I0915 22:28:40.993962 3759 master.cpp:432] Master only allowing > authenticated slaves to register > I0915 22:28:40.994010 3759 credentials.hpp:37] Loading credentials for > authentication from > '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_2tUQZn/credentials' > I0915 22:28:40.994776 3759 master.cpp:471] Using default 'crammd5' > authenticator > I0915 22:28:40.995053 3759 authenticator.cpp:512] Initializing server SASL > I0915 22:28:41.009496 3757 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 90.341573ms > I0915 22:28:41.009570 3757 replica.cpp:323] Persisted replica status to > STARTING > I0915 22:28:41.010040 3756 recover.cpp:475] Replica is in STARTING status > I0915 22:28:41.011255 3757 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I0915 22:28:41.011551 3752 recover.cpp:195] Received a recover response from > a replica in STARTING status > I0915 22:28:41.012073 3756 recover.cpp:566] Updating replica status to VOTING > I0915 22:28:41.084720 3753 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 72.469
[jira] [Assigned] (MESOS-3280) Master fails to access replicated log after network partition
[ https://issues.apache.org/jira/browse/MESOS-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway reassigned MESOS-3280: -- Assignee: Neil Conway > Master fails to access replicated log after network partition > - > > Key: MESOS-3280 > URL: https://issues.apache.org/jira/browse/MESOS-3280 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.23.0 > Environment: Zookeeper version 3.4.5--1 >Reporter: Bernd Mathiske >Assignee: Neil Conway > Labels: mesosphere > > In a 5 node cluster with 3 masters and 2 slaves, and ZK on each node, when a > network partition is forced, all the masters apparently lose access to their > replicated log. The leading master halts. Unknown reasons, but presumably > related to replicated log access. The others fail to recover from the > replicated log. Unknown reasons. This could have to do with ZK setup, but it > might also be a Mesos bug. > This was observed in a Chronos test drive scenario described in detail here: > https://github.com/mesos/chronos/issues/511 > With setup instructions here: > https://github.com/mesos/chronos/issues/508 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3433) Unmount work dir and persistent volume mounts of other containers in the new mount namespace.
[ https://issues.apache.org/jira/browse/MESOS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-3433: -- Assignee: Yan Xu > Unmount work dir and persistent volume mounts of other containers in the new > mount namespace. > - > > Key: MESOS-3433 > URL: https://issues.apache.org/jira/browse/MESOS-3433 > Project: Mesos > Issue Type: Task >Reporter: Yan Xu >Assignee: Yan Xu > > As described in this > [TODO|https://github.com/apache/mesos/blob/e601e469c64594dd8339352af405cbf26a574ea8/src/slave/containerizer/isolators/filesystem/linux.cpp#L418]: > {noformat:title=} > // TODO(jieyu): Try to unmount work directory mounts and persistent > // volume mounts for other containers to release the extra > // references to those mounts. > {noformat} > This will a best effort attempt to alleviate the race condition between > provisioner's container cleanup and new containers copying host mount table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-3433) Unmount work dir and persistent volume mounts of other containers in the new mount namespace.
Yan Xu created MESOS-3433: - Summary: Unmount work dir and persistent volume mounts of other containers in the new mount namespace. Key: MESOS-3433 URL: https://issues.apache.org/jira/browse/MESOS-3433 Project: Mesos Issue Type: Task Reporter: Yan Xu As described in this [TODO|https://github.com/apache/mesos/blob/e601e469c64594dd8339352af405cbf26a574ea8/src/slave/containerizer/isolators/filesystem/linux.cpp#L418]: {noformat:title=} // TODO(jieyu): Try to unmount work directory mounts and persistent // volume mounts for other containers to release the extra // references to those mounts. {noformat} This will a best effort attempt to alleviate the race condition between provisioner's container cleanup and new containers copying host mount table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3433) Unmount work dir and persistent volume mounts of other containers in the new mount namespace.
[ https://issues.apache.org/jira/browse/MESOS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-3433: -- Sprint: Twitter Mesos Q3 Sprint 5 > Unmount work dir and persistent volume mounts of other containers in the new > mount namespace. > - > > Key: MESOS-3433 > URL: https://issues.apache.org/jira/browse/MESOS-3433 > Project: Mesos > Issue Type: Task >Reporter: Yan Xu >Assignee: Yan Xu > Labels: twitter > > As described in this > [TODO|https://github.com/apache/mesos/blob/e601e469c64594dd8339352af405cbf26a574ea8/src/slave/containerizer/isolators/filesystem/linux.cpp#L418]: > {noformat:title=} > // TODO(jieyu): Try to unmount work directory mounts and persistent > // volume mounts for other containers to release the extra > // references to those mounts. > {noformat} > This will a best effort attempt to alleviate the race condition between > provisioner's container cleanup and new containers copying host mount table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3433) Unmount work dir and persistent volume mounts of other containers in the new mount namespace.
[ https://issues.apache.org/jira/browse/MESOS-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-3433: -- Labels: twitter (was: ) > Unmount work dir and persistent volume mounts of other containers in the new > mount namespace. > - > > Key: MESOS-3433 > URL: https://issues.apache.org/jira/browse/MESOS-3433 > Project: Mesos > Issue Type: Task >Reporter: Yan Xu >Assignee: Yan Xu > Labels: twitter > > As described in this > [TODO|https://github.com/apache/mesos/blob/e601e469c64594dd8339352af405cbf26a574ea8/src/slave/containerizer/isolators/filesystem/linux.cpp#L418]: > {noformat:title=} > // TODO(jieyu): Try to unmount work directory mounts and persistent > // volume mounts for other containers to release the extra > // references to those mounts. > {noformat} > This will a best effort attempt to alleviate the race condition between > provisioner's container cleanup and new containers copying host mount table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3431) Refactor Protobuf tests
[ https://issues.apache.org/jira/browse/MESOS-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14745359#comment-14745359 ] Klaus Ma commented on MESOS-3431: - Sure, I'll handle it after MESOS-3405. > Refactor Protobuf tests > --- > > Key: MESOS-3431 > URL: https://issues.apache.org/jira/browse/MESOS-3431 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Alexander Rukletsov >Assignee: Klaus Ma >Priority: Minor > > {{ProtobufTest.JSON}} test does several things simultaneously, including > message instantiation, conversion, parsing. We should split this test into > several independent ones that test just one thing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3418) Factor out V1 API test helper functions
[ https://issues.apache.org/jira/browse/MESOS-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-3418: -- Assignee: Guangya Liu > Factor out V1 API test helper functions > --- > > Key: MESOS-3418 > URL: https://issues.apache.org/jira/browse/MESOS-3418 > Project: Mesos > Issue Type: Improvement >Reporter: Joris Van Remoortere >Assignee: Guangya Liu > Labels: beginner, mesosphere, newbie, v1_api > > We currently have some helper functionality for V1 API tests. This is copied > in a few test files. > Factor this out into a common place once the API is stabilized. > {code} > // Helper class for using EXPECT_CALL since the Mesos scheduler API > // is callback based. > class Callbacks > { > public: > MOCK_METHOD0(connected, void(void)); > MOCK_METHOD0(disconnected, void(void)); > MOCK_METHOD1(received, void(const std::queue&)); > }; > {code} > {code} > // Enqueues all received events into a libprocess queue. > // TODO(jmlvanre): Factor this common code out of tests into V1 > // helper. > ACTION_P(Enqueue, queue) > { > std::queue events = arg0; > while (!events.empty()) { > // Note that we currently drop HEARTBEATs because most of these tests > // are not designed to deal with heartbeats. > // TODO(vinod): Implement DROP_HTTP_CALLS that can filter heartbeats. > if (events.front().type() == Event::HEARTBEAT) { > VLOG(1) << "Ignoring HEARTBEAT event"; > } else { > queue->put(events.front()); > } > events.pop(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3431) Refactor Protobuf tests
[ https://issues.apache.org/jira/browse/MESOS-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma reassigned MESOS-3431: --- Assignee: Klaus Ma > Refactor Protobuf tests > --- > > Key: MESOS-3431 > URL: https://issues.apache.org/jira/browse/MESOS-3431 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Alexander Rukletsov >Assignee: Klaus Ma >Priority: Minor > > {{ProtobufTest.JSON}} test does several things simultaneously, including > message instantiation, conversion, parsing. We should split this test into > several independent ones that test just one thing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)