[jira] [Created] (MESOS-4744) mesos-execute should allow setting role
Jian Qiu created MESOS-4744: --- Summary: mesos-execute should allow setting role Key: MESOS-4744 URL: https://issues.apache.org/jira/browse/MESOS-4744 Project: Mesos Issue Type: Bug Components: cli Reporter: Jian Qiu Priority: Minor It will be quite useful if we can set role when running mesos-execute -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4580) Consider returning `202` (Accepted) for /reserve and related endpoints
[ https://issues.apache.org/jira/browse/MESOS-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158385#comment-15158385 ] Jay Guo commented on MESOS-4580: Hi, we found this bug interesting. Can we proceed to confirm it as accepted and contribute? We are quite new to this community and still need to get familiar with work processes. Thanks /IBM Pair: Jay Guo & Zhou Xing > Consider returning `202` (Accepted) for /reserve and related endpoints > -- > > Key: MESOS-4580 > URL: https://issues.apache.org/jira/browse/MESOS-4580 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Assignee: Jay Guo > Labels: mesosphere > > We currently return {{200}} (OK) when a POST to {{/reserve}}, {{/unreserve}}, > {{/create-volumes}}, and {{/destroy-volumes}} is validated successfully. This > is misleading, because the underlying operation is still dispatched > asynchronously and might subsequently fail. It would be more accurate to > return {{202}} (Accepted) instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4743) Mesos fetcher not working correctly on docker apps on CoreOS
[ https://issues.apache.org/jira/browse/MESOS-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158382#comment-15158382 ] Shuai Lin commented on MESOS-4743: -- Is your mesos slave running inside a container itself? If so it may be related to [MESOS-4249]. > Mesos fetcher not working correctly on docker apps on CoreOS > > > Key: MESOS-4743 > URL: https://issues.apache.org/jira/browse/MESOS-4743 > Project: Mesos > Issue Type: Bug > Components: docker, fetcher >Affects Versions: 0.26.0 >Reporter: Guillermo Rodriguez > > I initially sent this issue to the Marathon group. They asked me to send it > here. This is the original thread: > https://github.com/mesosphere/marathon/issues/3179 > Then they closed it so I had to ask again with more proof. > https://github.com/mesosphere/marathon/issues/3213 > In a nutshell, when I start a Marathon task that uses the URI while running > on CoreOS. The file is effectively fetched but not passed to the container. I > can see the file in the mesos UI but the file is not in the container. It is, > however, downloaded to another folder. > It is very simple to test. The original ticket has two files attaches with a > Marathon JSON for a Prometheus server and a prometheus.yml config file. The > objective is to start prometheus with the config file. > CoreOS 899.6 > Mesos 0.26 > Marathon 0.15.2 > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4580) Consider returning `202` (Accepted) for /reserve and related endpoints
[ https://issues.apache.org/jira/browse/MESOS-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Guo reassigned MESOS-4580: -- Assignee: Jay Guo > Consider returning `202` (Accepted) for /reserve and related endpoints > -- > > Key: MESOS-4580 > URL: https://issues.apache.org/jira/browse/MESOS-4580 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Assignee: Jay Guo > Labels: mesosphere > > We currently return {{200}} (OK) when a POST to {{/reserve}}, {{/unreserve}}, > {{/create-volumes}}, and {{/destroy-volumes}} is validated successfully. This > is misleading, because the underlying operation is still dispatched > asynchronously and might subsequently fail. It would be more accurate to > return {{202}} (Accepted) instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4312) Porting Mesos on Power (ppc64le)
[ https://issues.apache.org/jira/browse/MESOS-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158359#comment-15158359 ] Qian Zhang edited comment on MESOS-4312 at 2/23/16 6:20 AM: Had some off-line discussion with [~hartem] and [~vinodkone], here is the final plan we all agree: # Porting Mesos on Power by upgrading/patching some 3rd party libraries. #* Update leveldb to v1.18, libev to v4.22, and protobuf to v2.6.1, ry-http-parser-1c3624a to nodejs/http-parser v2.6.1 for all platforms, since these are their latest upstream stable releases which has officially supported Power. Create JIRA ticket for each of them and link them to this JIRA ticket. #* Patch zookeeper, glog, protobuf only for Power. #* Change "src/linux/fs.cpp" for all platforms as what I did in https://reviews.apache.org/r/42551/. # Verify SSL, perf and docker related test cases work as expected on all platforms. # Do the validation on all platforms by "sudo make dist check" and "sudo make check --benchmark --gtest_filter="\*Benchmark\*"" and make sure the perf numbers are OK. # Run compatibility tests between scheduler <=> master <=> slave <=> executor where each of the components is running either the patched/upgraded version or not. Will contact with Niklas/Kapil for their script to automate this and improve the script if needed. was (Author: qianzhang): Had some off-line discussion with [~hartem] and [~vinodkone], here is the final plan we all agree: # Porting Mesos on Power by upgrading/patching some 3rd party libraries. #* Update leveldb to v1.18, libev to v4.22, and protobuf to v2.6.1, ry-http-parser-1c3624a to nodejs/http-parser v2.6.1 for all platforms, since these are their latest upstream stable releases which has officially supported Power. Create JIRA ticket for each of them and link them to this JIRA ticket. #* Patch zookeeper, glog, protobuf only for Power. #* Change "src/linux/fs.cpp" for all platforms as what I did in https://reviews.apache.org/r/42551/. # Verify SSL, perf and docker related test cases work as expected on all platforms. # Do the validation on all platforms by "sudo make dist check" and "sudo make check --benchmark --gtest_filter="\*Benchmark\*"" and make sure the perf numbers are OK. # Run compatibility tests between scheduler <=> master <=> slave <=> executor where each of the components is running either the patched/upgraded version or not. Will contact with Niklas/Kapil for their script and improve it if needed. > Porting Mesos on Power (ppc64le) > > > Key: MESOS-4312 > URL: https://issues.apache.org/jira/browse/MESOS-4312 > Project: Mesos > Issue Type: Improvement >Reporter: Qian Zhang >Assignee: Qian Zhang > > The goal of this ticket is to make IBM Power (ppc64le) as a supported > hardware platform of Mesos. Currently the latest Mesos code can not be > successfully built on ppc64le, we will resolve the build errors in this > ticket, and also make sure Mesos test suite ("make check") can be ran > successfully on ppc64le. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4312) Porting Mesos on Power (ppc64le)
[ https://issues.apache.org/jira/browse/MESOS-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158359#comment-15158359 ] Qian Zhang edited comment on MESOS-4312 at 2/23/16 6:19 AM: Had some off-line discussion with [~hartem] and [~vinodkone], here is the final plan we all agree: # Porting Mesos on Power by upgrading/patching some 3rd party libraries. #* Update leveldb to v1.18, libev to v4.22, and protobuf to v2.6.1, ry-http-parser-1c3624a to nodejs/http-parser v2.6.1 for all platforms, since these are their latest upstream stable releases which has officially supported Power. Create JIRA ticket for each of them and link them to this JIRA ticket. #* Patch zookeeper, glog, protobuf only for Power. #* Change "src/linux/fs.cpp" for all platforms as what I did in https://reviews.apache.org/r/42551/. # Verify SSL, perf and docker related test cases work as expected on all platforms. # Do the validation on all platforms by "sudo make dist check" and "sudo make check --benchmark --gtest_filter="\*Benchmark\*"" and make sure the perf numbers are OK. # Run compatibility tests between scheduler <=> master <=> slave <=> executor where each of the components is running either the patched/upgraded version or not. Will contact with Niklas/Kapil for their script and improve it if needed. was (Author: qianzhang): Had some off-line discussion with [~hartem] and [~vinodkone], here is the final plan we all agree: # Porting Mesos on Power by upgrading/patching some 3rd party libraries. #* Update leveldb to v1.18, libev to v4.22, and protobuf to v2.6.1, ry-http-parser-1c3624a to nodejs/http-parser v2.6.1 for all platforms, since these are their latest upstream stable releases which has officially supported Power. Create JIRA ticket for each of them and link them to this JIRA ticket. #* Patch zookeeper, glog, protobuf only for Power. #* Change "src/linux/fs.cpp" for all platforms as what I did in https://reviews.apache.org/r/42551/. # Verify SSL, perf and docker related test cases work as expected on all platforms. # Do the validation on all platforms by "sudo make dist check" and "sudo make check --benchmark --gtest_filter="*Benchmark*"" and make sure the perf numbers are OK. # Run compatibility tests between scheduler <=> master <=> slave <=> executor where each of the components is running either the patched/upgraded version or not. Will contact with Niklas/Kapil for their script and improve it if needed. > Porting Mesos on Power (ppc64le) > > > Key: MESOS-4312 > URL: https://issues.apache.org/jira/browse/MESOS-4312 > Project: Mesos > Issue Type: Improvement >Reporter: Qian Zhang >Assignee: Qian Zhang > > The goal of this ticket is to make IBM Power (ppc64le) as a supported > hardware platform of Mesos. Currently the latest Mesos code can not be > successfully built on ppc64le, we will resolve the build errors in this > ticket, and also make sure Mesos test suite ("make check") can be ran > successfully on ppc64le. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4312) Porting Mesos on Power (ppc64le)
[ https://issues.apache.org/jira/browse/MESOS-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158359#comment-15158359 ] Qian Zhang edited comment on MESOS-4312 at 2/23/16 6:17 AM: Had some off-line discussion with [~hartem] and [~vinodkone], here is the final plan we all agree: # Porting Mesos on Power by upgrading/patching some 3rd party libraries. #* Update leveldb to v1.18, libev to v4.22, and protobuf to v2.6.1, ry-http-parser-1c3624a to nodejs/http-parser v2.6.1 for all platforms, since these are their latest upstream stable releases which has officially supported Power. Create JIRA ticket for each of them and link them to this JIRA ticket. #* Patch zookeeper, glog, protobuf only for Power. #* Change "src/linux/fs.cpp" for all platforms as what I did in https://reviews.apache.org/r/42551/. # Verify SSL, perf and docker related test cases work as expected on all platforms. # Do the validation on all platforms by "sudo make dist check" and "sudo make check --benchmark --gtest_filter="*Benchmark*"" and make sure the perf numbers are OK. # Run compatibility tests between scheduler <=> master <=> slave <=> executor where each of the components is running either the patched/upgraded version or not. Will contact with Niklas/Kapil for their script and improve it if needed. was (Author: qianzhang): Had some off-line discussion with [~hartem] and [~vinodkone], here is the final plan we all agree: # Porting Mesos on Power by upgrading/patching some 3rd party libraries. #* Update leveldb to v1.18, libev to v4.22, and protobuf to v2.6.1, ry-http-parser-1c3624a to nodejs/http-parser v2.6.1 for all platforms, since these are their latest upstream stable releases which has officially supported Power. Create JIRA ticket for each of them and link them to this JIRA ticket. #* Patch zookeeper, glog, protobuf only for Power. #* Change "src/linux/fs.cpp" for all platforms as what I did in https://reviews.apache.org/r/42551/. # Verify SSL, perf and docker related test cases work as expected on all platforms. # Do the validation on all platforms by "sudo make dist check" and "sudo make check --benchmark --gtest_filter="*Benchmark*"" and make sure the perf numbers are OK. # Run compatibility tests between scheduler <-> master <-> slave <-> executor where each of the components is running either the patched/upgraded version or not. Will contact with Niklas/Kapil for their script and improve it if needed. > Porting Mesos on Power (ppc64le) > > > Key: MESOS-4312 > URL: https://issues.apache.org/jira/browse/MESOS-4312 > Project: Mesos > Issue Type: Improvement >Reporter: Qian Zhang >Assignee: Qian Zhang > > The goal of this ticket is to make IBM Power (ppc64le) as a supported > hardware platform of Mesos. Currently the latest Mesos code can not be > successfully built on ppc64le, we will resolve the build errors in this > ticket, and also make sure Mesos test suite ("make check") can be ran > successfully on ppc64le. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4312) Porting Mesos on Power (ppc64le)
[ https://issues.apache.org/jira/browse/MESOS-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158359#comment-15158359 ] Qian Zhang commented on MESOS-4312: --- Had some off-line discussion with [~hartem] and [~vinodkone], here is the final plan we all agree: # Porting Mesos on Power by upgrading/patching some 3rd party libraries. #* Update leveldb to v1.18, libev to v4.22, and protobuf to v2.6.1, ry-http-parser-1c3624a to nodejs/http-parser v2.6.1 for all platforms, since these are their latest upstream stable releases which has officially supported Power. Create JIRA ticket for each of them and link them to this JIRA ticket. #* Patch zookeeper, glog, protobuf only for Power. #* Change "src/linux/fs.cpp" for all platforms as what I did in https://reviews.apache.org/r/42551/. # Verify SSL, perf and docker related test cases work as expected on all platforms. # Do the validation on all platforms by "sudo make dist check" and "sudo make check --benchmark --gtest_filter="*Benchmark*"" and make sure the perf numbers are OK. # Run compatibility tests between scheduler <-> master <-> slave <-> executor where each of the components is running either the patched/upgraded version or not. Will contact with Niklas/Kapil for their script and improve it if needed. > Porting Mesos on Power (ppc64le) > > > Key: MESOS-4312 > URL: https://issues.apache.org/jira/browse/MESOS-4312 > Project: Mesos > Issue Type: Improvement >Reporter: Qian Zhang >Assignee: Qian Zhang > > The goal of this ticket is to make IBM Power (ppc64le) as a supported > hardware platform of Mesos. Currently the latest Mesos code can not be > successfully built on ppc64le, we will resolve the build errors in this > ticket, and also make sure Mesos test suite ("make check") can be ran > successfully on ppc64le. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3024) HTTP endpoint authN is enabled merely by specifying --credentials
[ https://issues.apache.org/jira/browse/MESOS-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-3024: -- Fix Version/s: 0.27.0 > HTTP endpoint authN is enabled merely by specifying --credentials > - > > Key: MESOS-3024 > URL: https://issues.apache.org/jira/browse/MESOS-3024 > Project: Mesos > Issue Type: Bug > Components: master, security >Reporter: Adam B >Assignee: Till Toenshoff > Labels: authentication, http, mesosphere > Fix For: 0.27.0 > > > If I set `--credentials` on the master, framework and slave authentication > are allowed, but not required. On the other hand, http authentication is now > required for authenticated endpoints (currently only `/shutdown`). That means > that I cannot enable framework or slave authentication without also enabling > http endpoint authentication. This is undesirable. > Framework and slave authentication have separate flags (`\--authenticate` and > `\--authenticate_slaves`) to require authentication for each. It would be > great if there was also such a flag for http authentication. Or maybe we get > rid of these flags altogether and rely on ACLs to determine which > unauthenticated principals are even allowed to authenticate for each > endpoint/action. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3481) Add const accessor to Master flags
[ https://issues.apache.org/jira/browse/MESOS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158302#comment-15158302 ] Jay Guo commented on MESOS-3481: We have submitted patch for review: https://reviews.apache.org/r/43868/ IBM community pair: Jay Guo & Zhou Xing > Add const accessor to Master flags > -- > > Key: MESOS-3481 > URL: https://issues.apache.org/jira/browse/MESOS-3481 > Project: Mesos > Issue Type: Task >Reporter: Joseph Wu >Assignee: zhou xing >Priority: Trivial > Labels: mesosphere, newbie > > It would make sense to have an accessor to the master's flags, especially for > tests. > For example, see [this > test|https://github.com/apache/mesos/blob/2876b8c918814347dd56f6f87d461e414a90650a/src/tests/master_maintenance_tests.cpp#L1231-L1235]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3727) File permission inconsistency for mesos-master executable and mesos-init-wrapper.
[ https://issues.apache.org/jira/browse/MESOS-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Guo reassigned MESOS-3727: -- Assignee: (was: Jay Guo) > File permission inconsistency for mesos-master executable and > mesos-init-wrapper. > - > > Key: MESOS-3727 > URL: https://issues.apache.org/jira/browse/MESOS-3727 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Sarjeet Singh >Priority: Trivial > > There seems some file permission inconsistency for mesos-master executable > and mesos-init-wrapper script with mesos-version 0.25. > node-1:~# dpkg -l | grep mesos > ii mesos 0.25.0-0.2.70.ubuntu1404 > node-1:~# ls -ld /usr/sbin/mesos-master > -rwxr-xr-x 1 root root 289173 Oct 12 14:07 /usr/sbin/mesos-master > node-1:~# ls -ld /usr/bin/mesos-init-wrapper > -rwxrwx--- 1 root root 5202 Oct 1 11:17 /usr/bin/mesos-init-wrapper > Observed the issue when tried to execute the mesos-master executable with > non-root user and since, init-wrapper doesn't have any non-root user > permission, it didn't get executed and mesos-master didn't get started. > Should be make these file permission consistent for executable & init-script? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3727) File permission inconsistency for mesos-master executable and mesos-init-wrapper.
[ https://issues.apache.org/jira/browse/MESOS-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158251#comment-15158251 ] Jay Guo edited comment on MESOS-3727 at 2/23/16 4:11 AM: - We have just confirmed in *0.27.0-0.2.190.ubuntu1404*, the problem persists. We should modify permissions of following files in release: /usr/bin/mesos-init-wrapper 770 --> 775 /etc/default/mesos 640 --> 644 /etc/default/mesos-master 640 --> 644 /etc/default/mesos-slave 640 --> 644 However, where is Mesos release maintained? was (Author: guoger): We have just confirmed in *0.27.0-0.2.190.ubuntu1404*, the problem persists. We should modify permissions of following files in release: /usr/bin/mesos-init-wrapper 770 --> 775 /etc/default/mesos 640 --> 644 /etc/default/mesos-master 640 --> 644 /etc/default/mesos-slave 640 --> 644 However, where is Mesos release maintained? > File permission inconsistency for mesos-master executable and > mesos-init-wrapper. > - > > Key: MESOS-3727 > URL: https://issues.apache.org/jira/browse/MESOS-3727 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Sarjeet Singh >Assignee: Jay Guo >Priority: Trivial > > There seems some file permission inconsistency for mesos-master executable > and mesos-init-wrapper script with mesos-version 0.25. > node-1:~# dpkg -l | grep mesos > ii mesos 0.25.0-0.2.70.ubuntu1404 > node-1:~# ls -ld /usr/sbin/mesos-master > -rwxr-xr-x 1 root root 289173 Oct 12 14:07 /usr/sbin/mesos-master > node-1:~# ls -ld /usr/bin/mesos-init-wrapper > -rwxrwx--- 1 root root 5202 Oct 1 11:17 /usr/bin/mesos-init-wrapper > Observed the issue when tried to execute the mesos-master executable with > non-root user and since, init-wrapper doesn't have any non-root user > permission, it didn't get executed and mesos-master didn't get started. > Should be make these file permission consistent for executable & init-script? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3727) File permission inconsistency for mesos-master executable and mesos-init-wrapper.
[ https://issues.apache.org/jira/browse/MESOS-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158251#comment-15158251 ] Jay Guo commented on MESOS-3727: We have just confirmed in *0.27.0-0.2.190.ubuntu1404*, the problem persists. We should modify permissions of following files in release: /usr/bin/mesos-init-wrapper 770 --> 775 /etc/default/mesos 640 --> 644 /etc/default/mesos-master 640 --> 644 /etc/default/mesos-slave 640 --> 644 However, where is Mesos release maintained? > File permission inconsistency for mesos-master executable and > mesos-init-wrapper. > - > > Key: MESOS-3727 > URL: https://issues.apache.org/jira/browse/MESOS-3727 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Sarjeet Singh >Assignee: Jay Guo >Priority: Trivial > > There seems some file permission inconsistency for mesos-master executable > and mesos-init-wrapper script with mesos-version 0.25. > node-1:~# dpkg -l | grep mesos > ii mesos 0.25.0-0.2.70.ubuntu1404 > node-1:~# ls -ld /usr/sbin/mesos-master > -rwxr-xr-x 1 root root 289173 Oct 12 14:07 /usr/sbin/mesos-master > node-1:~# ls -ld /usr/bin/mesos-init-wrapper > -rwxrwx--- 1 root root 5202 Oct 1 11:17 /usr/bin/mesos-init-wrapper > Observed the issue when tried to execute the mesos-master executable with > non-root user and since, init-wrapper doesn't have any non-root user > permission, it didn't get executed and mesos-master didn't get started. > Should be make these file permission consistent for executable & init-script? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4743) Mesos fetcher not working correctly on docker apps on CoreOS
[ https://issues.apache.org/jira/browse/MESOS-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guillermo Rodriguez updated MESOS-4743: --- Description: I initially sent this issue to the Marathon group. They asked me to send it here. This is the original thread: https://github.com/mesosphere/marathon/issues/3179 Then they closed it so I had to ask again with more proof. https://github.com/mesosphere/marathon/issues/3213 In a nutshell, when I start a Marathon task that uses the URI while running on CoreOS. The file is effectively fetched but not passed to the container. I can see the file in the mesos UI but the file is not in the container. It is, however, downloaded to another folder. It is very simple to test. The original ticket has two files attaches with a Marathon JSON for a Prometheus server and a prometheus.yml config file. The objective is to start prometheus with the config file. CoreOS 899.6 Mesos 0.26 Marathon 0.15.2 Thanks! was: I initially sent this issue to the Marathon group. They asked me to send it here. This is the original thread: https://github.com/mesosphere/marathon/issues/3179 Then they closed it so I had to ask again with more proof. https://github.com/mesosphere/marathon/issues/3213 In a nutshell, when I start a Marathon task that uses the URI while running on CoreOS. The file is effectively fetched but not passed to the container. I can see the file in the mesos UI but the file is not in the container. It is, however, downloaded to another folder. It is very simple to test. The original ticket has two files attaches with a Marathon JSON for a Prometheus server and a prometheus.yml config file. The objective is to start prometheus with the config file. Thanks! > Mesos fetcher not working correctly on docker apps on CoreOS > > > Key: MESOS-4743 > URL: https://issues.apache.org/jira/browse/MESOS-4743 > Project: Mesos > Issue Type: Bug > Components: docker, fetcher >Affects Versions: 0.26.0 >Reporter: Guillermo Rodriguez > > I initially sent this issue to the Marathon group. They asked me to send it > here. This is the original thread: > https://github.com/mesosphere/marathon/issues/3179 > Then they closed it so I had to ask again with more proof. > https://github.com/mesosphere/marathon/issues/3213 > In a nutshell, when I start a Marathon task that uses the URI while running > on CoreOS. The file is effectively fetched but not passed to the container. I > can see the file in the mesos UI but the file is not in the container. It is, > however, downloaded to another folder. > It is very simple to test. The original ticket has two files attaches with a > Marathon JSON for a Prometheus server and a prometheus.yml config file. The > objective is to start prometheus with the config file. > CoreOS 899.6 > Mesos 0.26 > Marathon 0.15.2 > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4743) Mesos fetcher not working correctly on docker apps on CoreOS
Guillermo Rodriguez created MESOS-4743: -- Summary: Mesos fetcher not working correctly on docker apps on CoreOS Key: MESOS-4743 URL: https://issues.apache.org/jira/browse/MESOS-4743 Project: Mesos Issue Type: Bug Components: docker, fetcher Affects Versions: 0.26.0 Reporter: Guillermo Rodriguez I initially sent this issue to the Marathon group. They asked me to send it here. This is the original thread: https://github.com/mesosphere/marathon/issues/3179 Then they closed it so I had to ask again with more proof. https://github.com/mesosphere/marathon/issues/3213 In a nutshell, when I start a Marathon task that uses the URI while running on CoreOS. The file is effectively fetched but not passed to the container. I can see the file in the mesos UI but the file is not in the container. It is, however, downloaded to another folder. It is very simple to test. The original ticket has two files attaches with a Marathon JSON for a Prometheus server and a prometheus.yml config file. The objective is to start prometheus with the config file. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4742) Design doc for CNI isolator
[ https://issues.apache.org/jira/browse/MESOS-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158175#comment-15158175 ] Qian Zhang commented on MESOS-4742: --- https://docs.google.com/document/d/1FFZwPHPZqS17cRQvsbbWyQbZpwIoHFR_N6AAApRv514/edit?usp=sharing > Design doc for CNI isolator > --- > > Key: MESOS-4742 > URL: https://issues.apache.org/jira/browse/MESOS-4742 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > This ticket is for the design of isolator for Container Network Interface > (CNI). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4742) Design doc for CNI isolator
Qian Zhang created MESOS-4742: - Summary: Design doc for CNI isolator Key: MESOS-4742 URL: https://issues.apache.org/jira/browse/MESOS-4742 Project: Mesos Issue Type: Bug Components: isolation Reporter: Qian Zhang Assignee: Qian Zhang This ticket is for the design of isolator for Container Network Interface (CNI). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4741) Add role information for static reservation in /master/roles
[ https://issues.apache.org/jira/browse/MESOS-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated MESOS-4741: Description: In {{/master/roles}}, it should show static reservation roles if there's no tasks. {code} Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles.json | python -m json.tool % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 10093 100930 0 13907 0 --:--:-- --:--:-- --:--:-- 15500 { "roles": [ { "frameworks": [], "name": "*", "resources": { "cpus": 0, "disk": 0, "mem": 0 }, "weight": 1.0 } ] } {code} After submit tasks to r1, it'll show roles. {code} Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles | python -m json.tool % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 221 100 2210 0 32721 0 --:--:-- --:--:-- --:--:-- 36833 { "roles": [ { "frameworks": [], "name": "*", "resources": { "cpus": 0, "disk": 0, "mem": 0 }, "weight": 1.0 }, { "frameworks": [ "b4f15a2e-5d9a-4d31-a29e-7737af41c8e4-0002" ], "name": "r1", "resources": { "cpus": 1.0, "disk": 0, "mem": 0 }, "weight": 1.0 } ] } {code} was: In {{/master/roles}}, it should show static reservation roles if there's no tasks. {code} Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles.json | python -m json.tool % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 221 100 2210 0 28612 0 --:--:-- --:--:-- --:--:-- 31571 { "roles": [ { "frameworks": [], "name": "*", "resources": { "cpus": 0, "disk": 0, "mem": 0 }, "weight": 1.0 }, { "frameworks": [ "b4f15a2e-5d9a-4d31-a29e-7737af41c8e4-0002" ], "name": "r1", "resources": { "cpus": 1.0, "disk": 0, "mem": 0 }, "weight": 1.0 } ] } {code} After submit tasks to r1, it'll show roles. {code} Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles | python -m json.tool % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 221 100 2210 0 32721 0 --:--:-- --:--:-- --:--:-- 36833 { "roles": [ { "frameworks": [], "name": "*", "resources": { "cpus": 0, "disk": 0, "mem": 0 }, "weight": 1.0 }, { "frameworks": [ "b4f15a2e-5d9a-4d31-a29e-7737af41c8e4-0002" ], "name": "r1", "resources": { "cpus": 1.0, "disk": 0, "mem": 0 }, "weight": 1.0 } ] } {code} > Add role information for static reservation in /master/roles > > > Key: MESOS-4741 > URL: https://issues.apache.org/jira/browse/MESOS-4741 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Klaus Ma >Assignee: Klaus Ma > > In {{/master/roles}}, it should show static reservation roles if there's no > tasks. > {code} > Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles.json > | python -m json.tool > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 10093 100930 0 13907 0 --:--:-- --:--:-- --:--:-- 15500 > { > "roles": [ > { > "frameworks": [], > "name": "*", > "resources": { > "cpus": 0, > "disk": 0, > "mem": 0 > }, > "weight": 1.0 > } > ] > } > {code} > After submit tasks to r1, it'll show roles. > {code} > Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/ro
[jira] [Created] (MESOS-4741) Add role information for static reservation in /master/roles
Klaus Ma created MESOS-4741: --- Summary: Add role information for static reservation in /master/roles Key: MESOS-4741 URL: https://issues.apache.org/jira/browse/MESOS-4741 Project: Mesos Issue Type: Bug Components: HTTP API Reporter: Klaus Ma Assignee: Klaus Ma In {{/master/roles}}, it should show static reservation roles if there's no tasks. {code} Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles.json | python -m json.tool % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 221 100 2210 0 28612 0 --:--:-- --:--:-- --:--:-- 31571 { "roles": [ { "frameworks": [], "name": "*", "resources": { "cpus": 0, "disk": 0, "mem": 0 }, "weight": 1.0 }, { "frameworks": [ "b4f15a2e-5d9a-4d31-a29e-7737af41c8e4-0002" ], "name": "r1", "resources": { "cpus": 1.0, "disk": 0, "mem": 0 }, "weight": 1.0 } ] } {code} After submit tasks to r1, it'll show roles. {code} Klauss-MacBook-Pro:mesos klaus$ curl http://localhost:5050/master/roles | python -m json.tool % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 221 100 2210 0 32721 0 --:--:-- --:--:-- --:--:-- 36833 { "roles": [ { "frameworks": [], "name": "*", "resources": { "cpus": 0, "disk": 0, "mem": 0 }, "weight": 1.0 }, { "frameworks": [ "b4f15a2e-5d9a-4d31-a29e-7737af41c8e4-0002" ], "name": "r1", "resources": { "cpus": 1.0, "disk": 0, "mem": 0 }, "weight": 1.0 } ] } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2198) Document that TaskIDs should not be reused
[ https://issues.apache.org/jira/browse/MESOS-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-2198: - Assignee: Qian Zhang > Document that TaskIDs should not be reused > -- > > Key: MESOS-2198 > URL: https://issues.apache.org/jira/browse/MESOS-2198 > Project: Mesos > Issue Type: Bug > Components: documentation, framework >Reporter: Robert Lacroix >Assignee: Qian Zhang > Labels: documentation > > Let's update the documentation for TaskID to indicate that reuse is not > recommended, as per the discussion below. > - > Old Summary: Scheduler#statusUpdate should not be called multiple times for > the same status update > Currently Scheduler#statusUpdate can be called multiple times for the same > status update, for example when the slave retransmits a status update because > it's not acknowledged in time. Especially for terminal status updates this > can lead to unexpected scheduler behavior when task id's are being reused. > Consider this scenario: > * Scheduler schedules task > * Task fails, slave sends TASK_FAILED > * Scheduler is busy and libmesos doesn't acknowledge update in time > * Slave retransmits TASK_FAILED > * Scheduler eventually receives first TASK_FAILED and reschedules task > * Second TASK_FAILED triggers statusUpdate again and the scheduler can't > determine if the TASK_FAILED belongs to the first or second run of the task. > It would be a lot better if libmesos would dedupe status updates and only > call Scheduler#statusUpdate once per status update it received. Retries with > the same UUID shouldn't cause Scheduler#statusUpdate to be executed again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4602) Invalid usage of ATOMIC_FLAG_INIT in member initialization
[ https://issues.apache.org/jira/browse/MESOS-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158047#comment-15158047 ] Yong Tang commented on MESOS-4602: -- That seems to be an easy fix. Just submitted a review request: https://reviews.apache.org/r/43859/ > Invalid usage of ATOMIC_FLAG_INIT in member initialization > -- > > Key: MESOS-4602 > URL: https://issues.apache.org/jira/browse/MESOS-4602 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Bannier >Assignee: Yong Tang > Labels: newbie, tech-debt > > MESOS-2925 fixed a few instances where {{ATOMIC_FLAG_INIT}} was used in > initializer lists, but missed to fix > {{3rdparty/libprocess/src/libevent_ssl_socket.cpp}} (even though the > corresponding header was touched). > There, {{LibeventSSLSocketImpl}}'s {{lock}} member is still (incorrectly) > initialized in initializer lists, even though the member is already > initialized in the class declaration, so it appears they should be dropped. > Clang from trunk incorrectly diagnoses the initializations in the initializer > lists as benign redundant braces in initialization of a scalar, but they > should be fixed for the reasons stated in MESOS-2925. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4602) Invalid usage of ATOMIC_FLAG_INIT in member initialization
[ https://issues.apache.org/jira/browse/MESOS-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yong Tang reassigned MESOS-4602: Assignee: Yong Tang > Invalid usage of ATOMIC_FLAG_INIT in member initialization > -- > > Key: MESOS-4602 > URL: https://issues.apache.org/jira/browse/MESOS-4602 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Bannier >Assignee: Yong Tang > Labels: newbie, tech-debt > > MESOS-2925 fixed a few instances where {{ATOMIC_FLAG_INIT}} was used in > initializer lists, but missed to fix > {{3rdparty/libprocess/src/libevent_ssl_socket.cpp}} (even though the > corresponding header was touched). > There, {{LibeventSSLSocketImpl}}'s {{lock}} member is still (incorrectly) > initialized in initializer lists, even though the member is already > initialized in the class declaration, so it appears they should be dropped. > Clang from trunk incorrectly diagnoses the initializations in the initializer > lists as benign redundant braces in initialization of a scalar, but they > should be fixed for the reasons stated in MESOS-2925. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4676) ROOT_DOCKER_Logs is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158030#comment-15158030 ] Joseph Wu commented on MESOS-4676: -- Confirmed that this is a docker issue. I fished out a command string from a failed test with {{GLOG_v=1}}, then ran it independently repeatedly. On Ubuntu12 (another place we're seeing the failure): {code} sh -c 'while true; do docker -H unix:///var/run/docker.sock run --cpu-shares 2048 --memory 1073741824 \ -e MESOS_SANDBOX=/mnt/mesos/sandbox \ -e MESOS_CONTAINER_NAME=mesos-3672e44c-2c92-48d5-825e-e8475227ad88-S0.bdb7f52c-5d3e-46f9-b676-4e693fb0d1f2 \ -v /tmp/DockerContainerizerTest_ROOT_DOCKER_Logs_vSwYXT/slaves/3672e44c-2c92-48d5-825e-e8475227ad88-S0/frameworks/3672e44c-2c92-48d5-825e-e8475227ad88-/executors/1/runs/bdb7f52c-5d3e-46f9-b676-4e693fb0d1f2:/mnt/mesos/sandbox \ --net host \ --entrypoint /bin/sh \ alpine -c "echo outd5d895af-0c86-41bc-9f27-037ab12d8035 ; echo errd5d895af-0c86-41bc-9f27-037ab12d8035 1>&2"; done' 2>&1 | grep -v \ -e "^outd5d895af-0c86-41bc-9f27-037ab12d8035$" \ -e "^errd5d895af-0c86-41bc-9f27-037ab12d8035$" \ -e "^WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.$" {code} After about an hour (don't know exactly how many iterations), got the following output: {code} (outd5d895af-0c86-41bc-9f27-037abUnrecognized input header (outd5d895af-0c86-41bc-9f27-037abUnrecognized input header (errd5d895af-0c86-41bc-9f27-037abUnrecognized input header (errd5d895af-0c86-41bc-9f27-037abUnrecognized input header ... {code} > ROOT_DOCKER_Logs is flaky. > -- > > Key: MESOS-4676 > URL: https://issues.apache.org/jira/browse/MESOS-4676 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27 > Environment: CentOS 7 with SSL. >Reporter: Bernd Mathiske > Labels: flaky, mesosphere, test > > {noformat} > [18:06:25][Step 8/8] [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Logs > [18:06:25][Step 8/8] I0215 17:06:25.256103 1740 leveldb.cpp:174] Opened db > in 6.548327ms > [18:06:25][Step 8/8] I0215 17:06:25.258002 1740 leveldb.cpp:181] Compacted > db in 1.837816ms > [18:06:25][Step 8/8] I0215 17:06:25.258059 1740 leveldb.cpp:196] Created db > iterator in 22044ns > [18:06:25][Step 8/8] I0215 17:06:25.258076 1740 leveldb.cpp:202] Seeked to > beginning of db in 2347ns > [18:06:25][Step 8/8] I0215 17:06:25.258091 1740 leveldb.cpp:271] Iterated > through 0 keys in the db in 571ns > [18:06:25][Step 8/8] I0215 17:06:25.258152 1740 replica.cpp:779] Replica > recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [18:06:25][Step 8/8] I0215 17:06:25.258936 1758 recover.cpp:447] Starting > replica recovery > [18:06:25][Step 8/8] I0215 17:06:25.259177 1758 recover.cpp:473] Replica is > in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.260327 1757 replica.cpp:673] Replica in > EMPTY status received a broadcasted recover request from > (13608)@172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.260545 1758 recover.cpp:193] Received a > recover response from a replica in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.261065 1757 master.cpp:376] Master > 112363e2-c680-4946-8fee-d0626ed8b21e (ip-172-30-2-239.mesosphere.io) started > on 172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.261209 1761 recover.cpp:564] Updating > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.261086 1757 master.cpp:378] Flags at > startup: --acls="" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate="true" > --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/HncLLj/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/HncLLj/master" > --zk_session_timeout="10secs" > [18:06:25][Step 8/8] I0215 17:06:25.261446 1757 master.cpp:423] Master only > allowing authenticated frameworks to register > [18:06:25][Step 8/8] I0215 17:06:25.261456 1757 master.cpp:428] Master only > allowing authenticated slaves to register > [18:06:25][Step 8/8] I0215 17:06:25.261462 1757 credentials.hpp:35]
[jira] [Created] (MESOS-4740) Improve metrics/snapshot performace
Cong Wang created MESOS-4740: Summary: Improve metrics/snapshot performace Key: MESOS-4740 URL: https://issues.apache.org/jira/browse/MESOS-4740 Project: Mesos Issue Type: Task Reporter: Cong Wang Assignee: Cong Wang David Robinson noticed retrieving metrics/snapshot statistics could be very inefficient and cause Mesos master stuck. {noformat} [root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot real2m7.302s user0m0.001s sys0m0.004s {noformat} >From a quick glance of the code, this *seems* due to we sort all the values >saved in the time series when calculating percentiles. {noformat} foreach (const typename TimeSeries::Value& value, values_) { values.push_back(value.data); } std::sort(values.begin(), values.end()); {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4738) Make ingress and egress bandwidth a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157981#comment-15157981 ] Sargun Dhillon commented on MESOS-4738: --- We could build something to determine the bandwidth on the machine, or the kinds of NICs on the machine to model bandwidth. I would say we should have a plan of implementing this resource, because it is in demand. > Make ingress and egress bandwidth a resource > > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3007) Support systemd with Mesos
[ https://issues.apache.org/jira/browse/MESOS-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-3007: Summary: Support systemd with Mesos (was: Support systemd with Mesos containerizer) > Support systemd with Mesos > -- > > Key: MESOS-3007 > URL: https://issues.apache.org/jira/browse/MESOS-3007 > Project: Mesos > Issue Type: Epic >Reporter: Artem Harutyunyan >Assignee: Joris Van Remoortere > Labels: mesosphere > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4738) Make ingress and egress bandwidth a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157959#comment-15157959 ] Ian Downes commented on MESOS-4738: --- [~adam-mesos] The port_mapping isolator does measure ingress and egress bandwidth per container (plus many other network related metrics) and can set egress limits (currently at the agent level but the actual isolating code is flexible). We are definitely interested in supporting bandwidth as a resource! [~wangcong] for visibility. > Make ingress and egress bandwidth a resource > > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4737) document TaskID uniqueness requirement
[ https://issues.apache.org/jira/browse/MESOS-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157920#comment-15157920 ] Adam B commented on MESOS-4737: --- Duplicate of MESOS-2198 > document TaskID uniqueness requirement > -- > > Key: MESOS-4737 > URL: https://issues.apache.org/jira/browse/MESOS-4737 > Project: Mesos > Issue Type: Task > Components: documentation >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Erik Weathers >Priority: Minor > Labels: documentation > Attachments: Reusing Task IDs.pdf > > > There are comments above the definition of TaskID in > [mesos.proto|https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66] > which lead one to believe it is ok to reuse TaskID values so long as you > guarantee there will only ever be 1 such TaskID running at the same time. > {code: title=existing comments for TaskID} > * A framework generated ID to distinguish a task. The ID must remain > * unique while the task is active. However, a framework can reuse an > * ID _only_ if a previous task with the same ID has reached a > * terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). > {code} > However, there are a few scenarios where problems can arise. > # The checkpointing-and-recovery feature of mesos-slave/agent clashes with > tasks that reuse an ID and get assigned to the same executor. > #* See [this > email|https://mail-archives.apache.org/mod_mbox/mesos-user/201602.mbox/%3CCAO5KYW8%2BXMWc1dXtEo20BAsfGow028jwjL2ubMinP%2BK%2BvdOh8w%40mail.gmail.com%3E] > for more info, as well as the attachment on this issue. > # Issues during network partitions and master failover, where a TaskID might > appear to be unique in the system, whereas in actuality another Task is > running with that ID and was just partitioned away for some time. > In light of these issues, we should simply update the document(s) to make it > abundantly clear that reusing TaskIDs is never ok. At the minimum this > should involve updating the afore-mentioned comments in {{mesos.proto}}. > Also any framework development guides that talk about TaskID creation should > be updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4738) Make ingress and egress bandwidth a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157896#comment-15157896 ] Adam B commented on MESOS-4738: --- Have you tried this with custom resource types yet? Does that work for you? I'm not sure we should first-class a new resource type unless we can measure and isolate it. > Make ingress and egress bandwidth a resource > > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4739) libprocess CHECK failure in SlaveRecoveryTest/0.ReconnectHTTPExecutor
[ https://issues.apache.org/jira/browse/MESOS-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157869#comment-15157869 ] Neil Conway commented on MESOS-4739: cc [~anandmazumdar] [~mcypark] > libprocess CHECK failure in SlaveRecoveryTest/0.ReconnectHTTPExecutor > - > > Key: MESOS-4739 > URL: https://issues.apache.org/jira/browse/MESOS-4739 > Project: Mesos > Issue Type: Bug > Components: HTTP API, libprocess >Reporter: Neil Conway > Labels: flaky-test, libprocess, mesosphere > > {noformat} > [ RUN ] SlaveRecoveryTest/0.ReconnectHTTPExecutor > I0223 09:38:55.434953 11158 executor.cpp:172] Version: 0.28.0 > Received a SUBSCRIBED event > Starting task 1 > Finishing task 1 > Received an ERROR event > Received an ERROR event > E0223 09:38:55.504820 11159 executor.cpp:553] End-Of-File received from > agent. The agent closed the event stream > Received an ERROR event > Received an ERROR event > Received an ERROR event > F0223 09:39:00.535778 22159 process.cpp:1114] Check failed: items.size() > 0 > *** Check failure stack trace: *** > Received an ERROR event > Received an ERROR event > @ 0x7f4affd0e754 google::LogMessage::Fail() > Received an ERROR event > Received an ERROR event > Received an ERROR event > Received an ERROR event > @ 0x7f4affd0e6ad google::LogMessage::SendToLog() > @ 0x7f4affd0e0a3 google::LogMessage::Flush() > @ 0x7f4affd10f14 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f4affc618d4 process::HttpProxy::waited() > @ 0x7f4affc8f57f > _ZZN7process8dispatchINS_9HttpProxyERKNS_6FutureINS_4http8ResponseEEES5_EEvRKNS_3PIDIT_EEMS9_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESI_ > @ 0x7f4affcac946 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchINS0_9HttpProxyERKNS0_6FutureINS0_4http8ResponseEEES9_EEvRKNS0_3PIDIT_EEMSD_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_ > @ 0x7f4affc89961 std::function<>::operator()() > @ 0x7f4affc6ef02 process::ProcessBase::visit() > @ 0x7f4affc74e52 process::DispatchEvent::visit() > @ 0xa3afe8 process::ProcessBase::serve() > @ 0x7f4affc6b073 process::ProcessManager::resume() > @ 0x7f4affc6813b > _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt6atomicIbEE_clES4_ > @ 0x7f4affc745fa > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS4_EEE6__callIvJEJLm0T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE > @ 0x7f4affc745a8 > _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS4_EEEclIJEvEET0_DpOT_ > @ 0x7f4affc74556 > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS5_EEEvEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x7f4affc744bf > _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS5_EEEvEEclEv > @ 0x7f4affc7445e > _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS7_EEEvEEE6_M_runEv > @ 0x7f4afa6ddc40 execute_native_thread_routine > @ 0x7f4afadba424 start_thread > @ 0x7f4af9e50cbd __clone > @ (nil) (unknown) > Aborted (core dumped) > {noformat} > This crash was observed in a recent ArchLinux VM (Virtualbox), running > concurrently with {{stress --cpu 4}}. Repro'd with {{./src/mesos-tests > --gtest_filter="SlaveRecovery*" --gtest_repeat=100 > --gtest_break_on_failure}}; took about 20 iterations to trigger a crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2717) Qemu/KVM containerizer
[ https://issues.apache.org/jira/browse/MESOS-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157860#comment-15157860 ] James Peach commented on MESOS-2717: Not really. It is still on my TODO list, but I'd be happy to pass it on if someone else can work on it immediately :) > Qemu/KVM containerizer > -- > > Key: MESOS-2717 > URL: https://issues.apache.org/jira/browse/MESOS-2717 > Project: Mesos > Issue Type: Wish > Components: containerization >Reporter: Pierre-Yves Ritschard >Assignee: James Peach > > I think it would make sense for Mesos to have the ability to treat > hypervisors as containerizers and the most sensible one to start with would > probably be Qemu/KVM. > There are a few workloads that can require full-fledged VMs (the most obvious > one being Windows workloads). > The containerization code is well decoupled and seems simple enough, I can > definitely take a shot at it. VMs do bring some questions with them here is > my take on them: > 1. Routing, network strategy > == > The simplest approach here might very well be to go for bridged networks > and leave the setup and inter slave routing up to the administrator > 2. IP Address assignment > > At first, it can be up to the Frameworks to deal with IP assignment. > The simplest way to address this could be to have an executor running > on slaves providing the qemu/kvm containerizer which would instrument a DHCP > server and collect IP + Mac address resources from slaves. While it may be up > to the frameworks to provide this, an example should most likely be provided. > 3. VM Templates > == > VM templates should probably leverage the fetcher and could thus be copied > locally or fetch from HTTP(s) / HDFS. > 4. Resource limiting > > Mapping resouce constraints to the qemu command line is probably the easiest > part, Additional command line should also be fetchable. For Unix VMs, the > sandbox could show the output of the serial console > 5. Libvirt / plain Qemu > = > I tend to favor limiting the amount of necessary hoops to jump through and > would thus investigate working directly with Qemu, maintaining an open > connection to the monitor to assert status. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4739) libprocess CHECK failure in SlaveRecoveryTest/0.ReconnectHTTPExecutor
Neil Conway created MESOS-4739: -- Summary: libprocess CHECK failure in SlaveRecoveryTest/0.ReconnectHTTPExecutor Key: MESOS-4739 URL: https://issues.apache.org/jira/browse/MESOS-4739 Project: Mesos Issue Type: Bug Components: HTTP API, libprocess Reporter: Neil Conway {noformat} [ RUN ] SlaveRecoveryTest/0.ReconnectHTTPExecutor I0223 09:38:55.434953 11158 executor.cpp:172] Version: 0.28.0 Received a SUBSCRIBED event Starting task 1 Finishing task 1 Received an ERROR event Received an ERROR event E0223 09:38:55.504820 11159 executor.cpp:553] End-Of-File received from agent. The agent closed the event stream Received an ERROR event Received an ERROR event Received an ERROR event F0223 09:39:00.535778 22159 process.cpp:1114] Check failed: items.size() > 0 *** Check failure stack trace: *** Received an ERROR event Received an ERROR event @ 0x7f4affd0e754 google::LogMessage::Fail() Received an ERROR event Received an ERROR event Received an ERROR event Received an ERROR event @ 0x7f4affd0e6ad google::LogMessage::SendToLog() @ 0x7f4affd0e0a3 google::LogMessage::Flush() @ 0x7f4affd10f14 google::LogMessageFatal::~LogMessageFatal() @ 0x7f4affc618d4 process::HttpProxy::waited() @ 0x7f4affc8f57f _ZZN7process8dispatchINS_9HttpProxyERKNS_6FutureINS_4http8ResponseEEES5_EEvRKNS_3PIDIT_EEMS9_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESI_ @ 0x7f4affcac946 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchINS0_9HttpProxyERKNS0_6FutureINS0_4http8ResponseEEES9_EEvRKNS0_3PIDIT_EEMSD_FvT0_ET1_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_ @ 0x7f4affc89961 std::function<>::operator()() @ 0x7f4affc6ef02 process::ProcessBase::visit() @ 0x7f4affc74e52 process::DispatchEvent::visit() @ 0xa3afe8 process::ProcessBase::serve() @ 0x7f4affc6b073 process::ProcessManager::resume() @ 0x7f4affc6813b _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt6atomicIbEE_clES4_ @ 0x7f4affc745fa _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS4_EEE6__callIvJEJLm0T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE @ 0x7f4affc745a8 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS4_EEEclIJEvEET0_DpOT_ @ 0x7f4affc74556 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS5_EEEvEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE @ 0x7f4affc744bf _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS5_EEEvEEclEv @ 0x7f4affc7445e _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt6atomicIbEE_St17reference_wrapperIS7_EEEvEEE6_M_runEv @ 0x7f4afa6ddc40 execute_native_thread_routine @ 0x7f4afadba424 start_thread @ 0x7f4af9e50cbd __clone @ (nil) (unknown) Aborted (core dumped) {noformat} This crash was observed in a recent ArchLinux VM (Virtualbox), running concurrently with {{stress --cpu 4}}. Repro'd with {{./src/mesos-tests --gtest_filter="SlaveRecovery*" --gtest_repeat=100 --gtest_break_on_failure}}; took about 20 iterations to trigger a crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4738) Make ingress and egress bandwidth a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sargun Dhillon updated MESOS-4738: -- Description: Some of our users care about variable network network isolation. Although we cannot fundamentally limit ingress network bandwidth, having it as a resource, so we can drop packets above a specific limit would be attractive. It would be nice to expose egress and ingress bandwidth as an agent resource, perhaps with a default of 10,000 mbps, and we can allow people to adjust as needed. Alternatively, a more advanced design would involve generating heuristics based on an analysis of the network MII / PHY. was: Some of our users care about variable network network isolation. Although we cannot fundamentally limit ingress network bandwidth, having it as a resource, so we can drop packets above a specific limit would be attractive. > Make ingress and egress bandwidth a resource > > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4738) Make ingress and egress bandwidth a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sargun Dhillon updated MESOS-4738: -- Labels: mesosphere (was: ) > Make ingress and egress bandwidth a resource > > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4736) DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes is flaky
[ https://issues.apache.org/jira/browse/MESOS-4736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu reassigned MESOS-4736: Assignee: Joseph Wu > DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes is flaky > > > Key: MESOS-4736 > URL: https://issues.apache.org/jira/browse/MESOS-4736 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.28.0 > Environment: Centos6 + GCC 4.9 on AWS >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: flaky, mesosphere, test > > This test passes consistently on other OS's, but fails consistently on CentOS > 6. > Verbose logs from test failure: > {code} > [ RUN ] DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes > I0222 18:16:12.327957 26681 leveldb.cpp:174] Opened db in 7.466102ms > I0222 18:16:12.330528 26681 leveldb.cpp:181] Compacted db in 2.540139ms > I0222 18:16:12.330580 26681 leveldb.cpp:196] Created db iterator in 16908ns > I0222 18:16:12.330592 26681 leveldb.cpp:202] Seeked to beginning of db in > 1403ns > I0222 18:16:12.330600 26681 leveldb.cpp:271] Iterated through 0 keys in the > db in 315ns > I0222 18:16:12.330634 26681 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0222 18:16:12.331082 26698 recover.cpp:447] Starting replica recovery > I0222 18:16:12.331289 26698 recover.cpp:473] Replica is in EMPTY status > I0222 18:16:12.332162 26703 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (13761)@172.30.2.148:35274 > I0222 18:16:12.332701 26701 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0222 18:16:12.333230 26699 recover.cpp:564] Updating replica status to > STARTING > I0222 18:16:12.334102 26698 master.cpp:376] Master > 652149b4-3932-4d8b-ba6f-8c9d9045be70 (ip-172-30-2-148.mesosphere.io) started > on 172.30.2.148:35274 > I0222 18:16:12.334116 26698 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/QEhLBS/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/QEhLBS/master" > --zk_session_timeout="10secs" > I0222 18:16:12.334354 26698 master.cpp:423] Master only allowing > authenticated frameworks to register > I0222 18:16:12.334363 26698 master.cpp:428] Master only allowing > authenticated slaves to register > I0222 18:16:12.334369 26698 credentials.hpp:35] Loading credentials for > authentication from '/tmp/QEhLBS/credentials' > I0222 18:16:12.335366 26698 master.cpp:468] Using default 'crammd5' > authenticator > I0222 18:16:12.335492 26698 master.cpp:537] Using default 'basic' HTTP > authenticator > I0222 18:16:12.335623 26698 master.cpp:571] Authorization enabled > I0222 18:16:12.335752 26703 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 2.314693ms > I0222 18:16:12.335769 26700 whitelist_watcher.cpp:77] No whitelist given > I0222 18:16:12.335778 26703 replica.cpp:320] Persisted replica status to > STARTING > I0222 18:16:12.335821 26697 hierarchical.cpp:144] Initialized hierarchical > allocator process > I0222 18:16:12.335965 26701 recover.cpp:473] Replica is in STARTING status > I0222 18:16:12.336771 26703 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (13763)@172.30.2.148:35274 > I0222 18:16:12.337191 26696 recover.cpp:193] Received a recover response from > a replica in STARTING status > I0222 18:16:12.337635 26700 recover.cpp:564] Updating replica status to VOTING > I0222 18:16:12.337671 26703 master.cpp:1712] The newly elected leader is > master@172.30.2.148:35274 with id 652149b4-3932-4d8b-ba6f-8c9d9045be70 > I0222 18:16:12.337698 26703 master.cpp:1725] Elected as the leading master! > I0222 18:16:12.337713 26703 master.cpp:1470] Recovering from registrar > I0222 18:16:12.337828 26696 registrar.cpp:307] Recovering registrar > I0222 18:16:12.339972 26702 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 2.060
[jira] [Created] (MESOS-4738) Make ingress and egress bandwidth a resource
Sargun Dhillon created MESOS-4738: - Summary: Make ingress and egress bandwidth a resource Key: MESOS-4738 URL: https://issues.apache.org/jira/browse/MESOS-4738 Project: Mesos Issue Type: Improvement Reporter: Sargun Dhillon Priority: Minor Some of our users care about variable network network isolation. Although we cannot fundamentally limit ingress network bandwidth, having it as a resource, so we can drop packets above a specific limit would be attractive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4737) document TaskID uniqueness requirement
[ https://issues.apache.org/jira/browse/MESOS-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated MESOS-4737: - Attachment: Reusing Task IDs.pdf > document TaskID uniqueness requirement > -- > > Key: MESOS-4737 > URL: https://issues.apache.org/jira/browse/MESOS-4737 > Project: Mesos > Issue Type: Task > Components: documentation >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Erik Weathers >Priority: Minor > Labels: documentation > Attachments: Reusing Task IDs.pdf > > > There are comments above the definition of TaskID in > [mesos.proto|https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66] > which lead one to believe it is ok to reuse TaskID values so long as you > guarantee there will only ever be 1 such TaskID running at the same time. > {code: title=existing comments for TaskID} > * A framework generated ID to distinguish a task. The ID must remain > * unique while the task is active. However, a framework can reuse an > * ID _only_ if a previous task with the same ID has reached a > * terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). > {code} > However, there are a few scenarios where problems can arise. > # The checkpointing-and-recovery feature of mesos-slave/agent clashes with > tasks that reuse an ID and get assigned to the same executor. > #* See [this > email|https://mail-archives.apache.org/mod_mbox/mesos-user/201602.mbox/%3CCAO5KYW8%2BXMWc1dXtEo20BAsfGow028jwjL2ubMinP%2BK%2BvdOh8w%40mail.gmail.com%3E] > for more info, as well as the attachment on this issue. > # Issues during network partitions and master failover, where a TaskID might > appear to be unique in the system, whereas in actuality another Task is > running with that ID and was just partitioned away for some time. > In light of these issues, we should simply update the document(s) to make it > abundantly clear that reusing TaskIDs is never ok. At the minimum this > should involve updating the afore-mentioned comments in {{mesos.proto}}. > Also any framework development guides that talk about TaskID creation should > be updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4737) document TaskID uniqueness requirement
Erik Weathers created MESOS-4737: Summary: document TaskID uniqueness requirement Key: MESOS-4737 URL: https://issues.apache.org/jira/browse/MESOS-4737 Project: Mesos Issue Type: Task Components: documentation Affects Versions: 0.27.0 Reporter: Erik Weathers Assignee: Erik Weathers Priority: Minor There are comments above the definition of TaskID in [mesos.proto|https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66] which lead one to believe it is ok to reuse TaskID values so long as you guarantee there will only ever be 1 such TaskID running at the same time. {code title=existing comments for TaskID} * A framework generated ID to distinguish a task. The ID must remain * unique while the task is active. However, a framework can reuse an * ID _only_ if a previous task with the same ID has reached a * terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). {code} However, there are a few scenarios where problems can arise. # The checkpointing-and-recovery feature of mesos-slave/agent clashes with tasks that reuse an ID and get assigned to the same executor. #* See [this email|https://mail-archives.apache.org/mod_mbox/mesos-user/201602.mbox/%3CCAO5KYW8%2BXMWc1dXtEo20BAsfGow028jwjL2ubMinP%2BK%2BvdOh8w%40mail.gmail.com%3E] for more info, as well as the attachment on this issue. # Issues during network partitions and master failover, where a TaskID might appear to be unique in the system, whereas in actuality another Task is running with that ID and was just partitioned away for some time. In light of these issues, we should simply update the document(s) to make it abundantly clear that reusing TaskIDs is never ok. At the minimum this should involve updating the afore-mentioned comments in {{mesos.proto}}. Also any framework development guides that talk about TaskID creation should be updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4737) document TaskID uniqueness requirement
[ https://issues.apache.org/jira/browse/MESOS-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Weathers updated MESOS-4737: - Description: There are comments above the definition of TaskID in [mesos.proto|https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66] which lead one to believe it is ok to reuse TaskID values so long as you guarantee there will only ever be 1 such TaskID running at the same time. {code: title=existing comments for TaskID} * A framework generated ID to distinguish a task. The ID must remain * unique while the task is active. However, a framework can reuse an * ID _only_ if a previous task with the same ID has reached a * terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). {code} However, there are a few scenarios where problems can arise. # The checkpointing-and-recovery feature of mesos-slave/agent clashes with tasks that reuse an ID and get assigned to the same executor. #* See [this email|https://mail-archives.apache.org/mod_mbox/mesos-user/201602.mbox/%3CCAO5KYW8%2BXMWc1dXtEo20BAsfGow028jwjL2ubMinP%2BK%2BvdOh8w%40mail.gmail.com%3E] for more info, as well as the attachment on this issue. # Issues during network partitions and master failover, where a TaskID might appear to be unique in the system, whereas in actuality another Task is running with that ID and was just partitioned away for some time. In light of these issues, we should simply update the document(s) to make it abundantly clear that reusing TaskIDs is never ok. At the minimum this should involve updating the afore-mentioned comments in {{mesos.proto}}. Also any framework development guides that talk about TaskID creation should be updated. was: There are comments above the definition of TaskID in [mesos.proto|https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66] which lead one to believe it is ok to reuse TaskID values so long as you guarantee there will only ever be 1 such TaskID running at the same time. {code title=existing comments for TaskID} * A framework generated ID to distinguish a task. The ID must remain * unique while the task is active. However, a framework can reuse an * ID _only_ if a previous task with the same ID has reached a * terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). {code} However, there are a few scenarios where problems can arise. # The checkpointing-and-recovery feature of mesos-slave/agent clashes with tasks that reuse an ID and get assigned to the same executor. #* See [this email|https://mail-archives.apache.org/mod_mbox/mesos-user/201602.mbox/%3CCAO5KYW8%2BXMWc1dXtEo20BAsfGow028jwjL2ubMinP%2BK%2BvdOh8w%40mail.gmail.com%3E] for more info, as well as the attachment on this issue. # Issues during network partitions and master failover, where a TaskID might appear to be unique in the system, whereas in actuality another Task is running with that ID and was just partitioned away for some time. In light of these issues, we should simply update the document(s) to make it abundantly clear that reusing TaskIDs is never ok. At the minimum this should involve updating the afore-mentioned comments in {{mesos.proto}}. Also any framework development guides that talk about TaskID creation should be updated. > document TaskID uniqueness requirement > -- > > Key: MESOS-4737 > URL: https://issues.apache.org/jira/browse/MESOS-4737 > Project: Mesos > Issue Type: Task > Components: documentation >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Erik Weathers >Priority: Minor > Labels: documentation > > There are comments above the definition of TaskID in > [mesos.proto|https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66] > which lead one to believe it is ok to reuse TaskID values so long as you > guarantee there will only ever be 1 such TaskID running at the same time. > {code: title=existing comments for TaskID} > * A framework generated ID to distinguish a task. The ID must remain > * unique while the task is active. However, a framework can reuse an > * ID _only_ if a previous task with the same ID has reached a > * terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). > {code} > However, there are a few scenarios where problems can arise. > # The checkpointing-and-recovery feature of mesos-slave/agent clashes with > tasks that reuse an ID and get assigned to the same executor. > #* See [this > email|https://mail-archives.apache.org/mod_mbox/mesos-user/201602.mbox/%3CCAO5KYW8%2BXMWc1dXtEo20BAsfGow028jwjL2ubMinP%2BK%2BvdOh8w%40mail.gmail.com%3E] > for more info, as well as the attachment on this issue. > # Issues during network partitions and master failover, where a TaskID might > appear t
[jira] [Commented] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
[ https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157704#comment-15157704 ] Alexander Rojas commented on MESOS-4047: Reproduced again with following message (CentOS 6.7): {noformat} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from MemoryPressureMesosTest 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.000394345 s, 2.7 GB/s [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery I0222 09:32:20.622694 20868 leveldb.cpp:174] Opened db in 5.153509ms I0222 09:32:20.624688 20868 leveldb.cpp:181] Compacted db in 1.914323ms I0222 09:32:20.624778 20868 leveldb.cpp:196] Created db iterator in 24549ns I0222 09:32:20.624795 20868 leveldb.cpp:202] Seeked to beginning of db in 2610ns I0222 09:32:20.624804 20868 leveldb.cpp:271] Iterated through 0 keys in the db in 323ns I0222 09:32:20.624874 20868 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0222 09:32:20.625977 20888 recover.cpp:447] Starting replica recovery I0222 09:32:20.626901 20888 recover.cpp:473] Replica is in EMPTY status I0222 09:32:20.634701 20889 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (11193)@127.0.0.1:54769 I0222 09:32:20.634953 20888 master.cpp:376] Master 17b7da64-0c4d-4e46-ae1f-2b356dc5f266 (localhost) started on 127.0.0.1:54769 I0222 09:32:20.634986 20888 master.cpp:378] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/0rXncF/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/0rXncF/master" --zk_session_timeout="10secs" W0222 09:32:20.635417 20888 master.cpp:381] ** Master bound to loopback interface! Cannot communicate with remote schedulers or slaves. You might want to set '--ip' flag to a routable IP address. ** I0222 09:32:20.635587 20888 master.cpp:423] Master only allowing authenticated frameworks to register I0222 09:32:20.635601 20888 master.cpp:428] Master only allowing authenticated slaves to register I0222 09:32:20.635622 20888 credentials.hpp:35] Loading credentials for authentication from '/tmp/0rXncF/credentials' I0222 09:32:20.636018 20888 master.cpp:468] Using default 'crammd5' authenticator I0222 09:32:20.636190 20888 master.cpp:537] Using default 'basic' HTTP authenticator I0222 09:32:20.636174 20887 recover.cpp:193] Received a recover response from a replica in EMPTY status I0222 09:32:20.636425 20888 master.cpp:571] Authorization enabled I0222 09:32:20.637810 20885 recover.cpp:564] Updating replica status to STARTING I0222 09:32:20.640805 20887 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 2.741248ms I0222 09:32:20.640964 20887 replica.cpp:320] Persisted replica status to STARTING I0222 09:32:20.641525 20885 recover.cpp:473] Replica is in STARTING status I0222 09:32:20.642133 20888 master.cpp:1712] The newly elected leader is master@127.0.0.1:54769 with id 17b7da64-0c4d-4e46-ae1f-2b356dc5f266 I0222 09:32:20.642236 20888 master.cpp:1725] Elected as the leading master! I0222 09:32:20.642253 20888 master.cpp:1470] Recovering from registrar I0222 09:32:20.642496 20885 registrar.cpp:307] Recovering registrar I0222 09:32:20.643162 20889 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from (11195)@127.0.0.1:54769 I0222 09:32:20.643590 20885 recover.cpp:193] Received a recover response from a replica in STARTING status I0222 09:32:20.644120 20887 recover.cpp:564] Updating replica status to VOTING I0222 09:32:20.646817 20889 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.190281ms I0222 09:32:20.646870 20889 replica.cpp:320] Persisted replica status to VOTING I0222 09:32:20.647094 20885 recover.cpp:578] Successfully joined the Paxos group I0222 09:32:20.647337 20885 recover.cpp:462] Recover process terminated I0222 09:32:20.647781 20887 log.cpp:659] Attempting to start the writer I0222 09:32:20.648854 20890 replica.cpp:493] Repli
[jira] [Updated] (MESOS-4686) Implement master failover tests for the scheduler library.
[ https://issues.apache.org/jira/browse/MESOS-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-4686: -- Sprint: Mesosphere Sprint 29 Review Chain: https://reviews.apache.org/r/43846/ > Implement master failover tests for the scheduler library. > -- > > Key: MESOS-4686 > URL: https://issues.apache.org/jira/browse/MESOS-4686 > Project: Mesos > Issue Type: Task >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar > Labels: mesosphere > > Currently, the scheduler library creates its own {{MasterDetector}} object > internally. We would need to create a standalone detector and create new > tests for testing that callbacks are invoked correctly in the event of a > master failover. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4736) DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes is flaky
Joseph Wu created MESOS-4736: Summary: DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes is flaky Key: MESOS-4736 URL: https://issues.apache.org/jira/browse/MESOS-4736 Project: Mesos Issue Type: Bug Affects Versions: 0.28.0 Environment: Centos6 + GCC 4.9 on AWS Reporter: Joseph Wu This test passes consistently on other OS's, but fails consistently on CentOS 6. Verbose logs from test failure: {code} [ RUN ] DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes I0222 18:16:12.327957 26681 leveldb.cpp:174] Opened db in 7.466102ms I0222 18:16:12.330528 26681 leveldb.cpp:181] Compacted db in 2.540139ms I0222 18:16:12.330580 26681 leveldb.cpp:196] Created db iterator in 16908ns I0222 18:16:12.330592 26681 leveldb.cpp:202] Seeked to beginning of db in 1403ns I0222 18:16:12.330600 26681 leveldb.cpp:271] Iterated through 0 keys in the db in 315ns I0222 18:16:12.330634 26681 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0222 18:16:12.331082 26698 recover.cpp:447] Starting replica recovery I0222 18:16:12.331289 26698 recover.cpp:473] Replica is in EMPTY status I0222 18:16:12.332162 26703 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (13761)@172.30.2.148:35274 I0222 18:16:12.332701 26701 recover.cpp:193] Received a recover response from a replica in EMPTY status I0222 18:16:12.333230 26699 recover.cpp:564] Updating replica status to STARTING I0222 18:16:12.334102 26698 master.cpp:376] Master 652149b4-3932-4d8b-ba6f-8c9d9045be70 (ip-172-30-2-148.mesosphere.io) started on 172.30.2.148:35274 I0222 18:16:12.334116 26698 master.cpp:378] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/QEhLBS/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/QEhLBS/master" --zk_session_timeout="10secs" I0222 18:16:12.334354 26698 master.cpp:423] Master only allowing authenticated frameworks to register I0222 18:16:12.334363 26698 master.cpp:428] Master only allowing authenticated slaves to register I0222 18:16:12.334369 26698 credentials.hpp:35] Loading credentials for authentication from '/tmp/QEhLBS/credentials' I0222 18:16:12.335366 26698 master.cpp:468] Using default 'crammd5' authenticator I0222 18:16:12.335492 26698 master.cpp:537] Using default 'basic' HTTP authenticator I0222 18:16:12.335623 26698 master.cpp:571] Authorization enabled I0222 18:16:12.335752 26703 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 2.314693ms I0222 18:16:12.335769 26700 whitelist_watcher.cpp:77] No whitelist given I0222 18:16:12.335778 26703 replica.cpp:320] Persisted replica status to STARTING I0222 18:16:12.335821 26697 hierarchical.cpp:144] Initialized hierarchical allocator process I0222 18:16:12.335965 26701 recover.cpp:473] Replica is in STARTING status I0222 18:16:12.336771 26703 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from (13763)@172.30.2.148:35274 I0222 18:16:12.337191 26696 recover.cpp:193] Received a recover response from a replica in STARTING status I0222 18:16:12.337635 26700 recover.cpp:564] Updating replica status to VOTING I0222 18:16:12.337671 26703 master.cpp:1712] The newly elected leader is master@172.30.2.148:35274 with id 652149b4-3932-4d8b-ba6f-8c9d9045be70 I0222 18:16:12.337698 26703 master.cpp:1725] Elected as the leading master! I0222 18:16:12.337713 26703 master.cpp:1470] Recovering from registrar I0222 18:16:12.337828 26696 registrar.cpp:307] Recovering registrar I0222 18:16:12.339972 26702 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 2.06039ms I0222 18:16:12.339994 26702 replica.cpp:320] Persisted replica status to VOTING I0222 18:16:12.340082 26700 recover.cpp:578] Successfully joined the Paxos group I0222 18:16:12.340267 26700 recover.cpp:462] Recover process terminated I0222 18:16:12.340591 26699 log.cpp:659] Attempting to start the writer I0222 18:16:12.341594 26698 replica.cpp:493] Replica received implicit promise request from (13764)@172.30.2.148:35274 with proposal 1
[jira] [Commented] (MESOS-4642) Mesos Agent Json API can dump binary data from log files out as invalid JSON
[ https://issues.apache.org/jira/browse/MESOS-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157470#comment-15157470 ] Chris Pennello commented on MESOS-4642: --- bq. In my opinion, encoding everything as UTF-8 would be an incorrect approach. Doing this introduces ambiguity with regards to the original data. Indeed, without an extra encoding scheme, this seems like an unavoidable consequence of using JSON. bq. For us to actually output valid JSON, we need to encode the output as unicode. I think it's a little worse than that; per the above, there simply exists data that is unrepresentable. Clients still do have {{files/read}} to use if they want the raw data, right? One idea is to add extra, Unicode-friendly encoding to {{files/read.json}} for raw data. For example, it could be Base64-encoded and _then_ dumped to JSON. Maybe as a more client-friendly idea, perhaps we could augment {{files/read.json}} such that sequences of bytes that can't be interpreted as UTF-8 encoded Unicode are replaced by a {{?}} character? [(This is kind of akin to Python's {{unicode(..., errors='replace')}}.)|https://docs.python.org/2/howto/unicode.html#the-unicode-type] That way, we'd be able to get valid JSON out of {{files/read.json}} (a plus!), and have "reasonable" behavior for unrepresentable data. As a wild idea, perhaps if we still wanted endpoints that could represent arbitrary, but _structured_ data, we might consider adding an additional serialization format, such as [MessagePack|http://msgpack.org/]. > Mesos Agent Json API can dump binary data from log files out as invalid JSON > > > Key: MESOS-4642 > URL: https://issues.apache.org/jira/browse/MESOS-4642 > Project: Mesos > Issue Type: Bug > Components: json api, slave >Affects Versions: 0.27.0 >Reporter: Steven Schlansker >Priority: Critical > > One of our tasks accidentally started logging binary data to stderr. This > was not intentional and generally should not happen -- however, it causes > severe problems with the Mesos Agent "files/read.json" API, since it gladly > dumps this binary data out as invalid JSON. > {code} > # hexdump -C /path/to/task/stderr | tail > 0003d1f0 6f 6e 6e 65 63 74 69 6f 6e 0a 4e 45 54 3a 20 31 |onnection.NET: 1| > 0003d200 20 6f 6e 72 65 61 64 20 45 4e 4f 45 4e 54 20 32 | onread ENOENT 2| > 0003d210 39 35 34 35 36 20 32 35 31 20 32 39 35 37 30 37 |95456 251 295707| > 0003d220 0a 01 00 00 00 00 00 00 ac 57 65 64 2c 20 31 30 |.Wed, 10| > 0003d230 20 55 6e 72 65 63 6f 67 6e 69 7a 65 64 20 69 6e | Unrecognized in| > 0003d240 70 75 74 20 68 65 61 64 65 72 0a |put header.| > {code} > {code} > # curl > 'http://agent-host:5051/files/read.json?path=/path/to/task/stderr&offset=220443&length=9&grep=' > | hexdump -C > 7970 6e 65 63 74 69 6f 6e 5c 6e 4e 45 54 3a 20 31 20 |nection\nNET: 1 | > 7980 6f 6e 72 65 61 64 20 45 4e 4f 45 4e 54 20 32 39 |onread ENOENT 29| > 7990 35 34 35 36 20 32 35 31 20 32 39 35 37 30 37 5c |5456 251 295707\| > 79a0 6e 5c 75 30 30 30 31 5c 75 30 30 30 30 5c 75 30 |n\u0001\u\u0| > 79b0 30 30 30 5c 75 30 30 30 30 5c 75 30 30 30 30 5c |000\u\u\| > 79c0 75 30 30 30 30 5c 75 30 30 30 30 ac 57 65 64 2c |u\u.Wed,| > 79d0 20 31 30 20 55 6e 72 65 63 6f 67 6e 69 7a 65 64 | 10 Unrecognized| > 79e0 20 69 6e 70 75 74 20 68 65 61 64 65 72 5c 6e 22 | input header\n"| > 79f0 2c 22 6f 66 66 73 65 74 22 3a 32 32 30 34 34 33 |,"offset":220443| > 7a00 7d|}| > {code} > This causes downstream sadness: > {code} > ERROR [2016-02-10 18:55:12,303] > io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: > 0ee749630f8b26f1 > ! com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xac > ! at [Source: org.jboss.netty.buffer.ChannelBufferInputStream@6d69ee8; line: > 1, column: 31181] > ! at > com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1487) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:518) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3339) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2360) > ~[singularity-0.4.9.jar:0.4.9] > ! at > com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8Strea
[jira] [Assigned] (MESOS-4700) Allow agent to configure net_cls handle minor range.
[ https://issues.apache.org/jira/browse/MESOS-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan reassigned MESOS-4700: Assignee: Avinash Sridharan > Allow agent to configure net_cls handle minor range. > > > Key: MESOS-4700 > URL: https://issues.apache.org/jira/browse/MESOS-4700 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Avinash Sridharan > Labels: mesosphere > > Bug exists in some user libraries that prevents some certain minor net_cls > handle being used. It'll be great if we can configure the minor range through > agent flags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3368) Add device support in cgroups abstraction
[ https://issues.apache.org/jira/browse/MESOS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-3368: --- Assignee: Abhishek Dasgupta (was: Kevin Klues) > Add device support in cgroups abstraction > - > > Key: MESOS-3368 > URL: https://issues.apache.org/jira/browse/MESOS-3368 > Project: Mesos > Issue Type: Task >Reporter: Niklas Quarfot Nielsen >Assignee: Abhishek Dasgupta > > Add support for [device > cgroups|https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt] to > aid isolators controlling access to devices. > In the future, we could think about how to numerate and control access to > devices as resource or task/container policy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4623) Add a stub Nvidia GPU isolator.
[ https://issues.apache.org/jira/browse/MESOS-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157241#comment-15157241 ] Abhishek Dasgupta commented on MESOS-4623: -- ok..no worries > Add a stub Nvidia GPU isolator. > --- > > Key: MESOS-4623 > URL: https://issues.apache.org/jira/browse/MESOS-4623 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Benjamin Mahler >Assignee: Kevin Klues > > We'll first wire up a skeleton Nvidia GPU isolator, which needs to be guarded > by a configure flag due to the dependency on NVML. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4623) Add a stub Nvidia GPU isolator.
[ https://issues.apache.org/jira/browse/MESOS-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157238#comment-15157238 ] Kevin Klues commented on MESOS-4623: I already have most of this in place, it is just not out for review yet. I've been developing on the nvidia-isolator branch of my mesos fork. https://github.com/klueska-mesosphere/mesos/tree/nvidia-isolator > Add a stub Nvidia GPU isolator. > --- > > Key: MESOS-4623 > URL: https://issues.apache.org/jira/browse/MESOS-4623 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Benjamin Mahler >Assignee: Kevin Klues > > We'll first wire up a skeleton Nvidia GPU isolator, which needs to be guarded > by a configure flag due to the dependency on NVML. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4623) Add a stub Nvidia GPU isolator.
[ https://issues.apache.org/jira/browse/MESOS-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues reassigned MESOS-4623: -- Assignee: Kevin Klues > Add a stub Nvidia GPU isolator. > --- > > Key: MESOS-4623 > URL: https://issues.apache.org/jira/browse/MESOS-4623 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Benjamin Mahler >Assignee: Kevin Klues > > We'll first wire up a skeleton Nvidia GPU isolator, which needs to be guarded > by a configure flag due to the dependency on NVML. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4424) Initial support for GPU resources.
[ https://issues.apache.org/jira/browse/MESOS-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues reassigned MESOS-4424: -- Assignee: Kevin Klues > Initial support for GPU resources. > -- > > Key: MESOS-4424 > URL: https://issues.apache.org/jira/browse/MESOS-4424 > Project: Mesos > Issue Type: Epic > Components: isolation >Reporter: Benjamin Mahler >Assignee: Kevin Klues > > Mesos already has generic mechanisms for expressing / isolating resources, > and we'd like to expose GPUs as resources that can be consumed and isolated. > However, GPUs present unique challenges: > * Users may rely on vendor-specific libraries to interact with the device > (e.g. CUDA, HSA, etc), others may rely on portable libraries like OpenCL or > OpenGL. These libraries need to be available from within the container. > * GPU hardware has many attributes that may impose scheduling constraints > (e.g. core count, total memory, topology (via PCI-E, NVLINK, etc), driver > versions, etc). > * Obtaining utilization information requires vendor-specific approaches. > * Isolated sharing of a GPU device requires vendor-specific approaches. > As such, the focus is on supporting a narrow initial use case: homogenous > device-level GPU support: > * Fractional sharing of GPU devices across containers will not be supported > initially, unlike CPU cores. > * Heterogeneity will be supported via other means for now (e.g. using agent > attributes to differentiate hardware profiles, using portable libraries like > OpenCL, etc). > Working group email list: https://groups.google.com/forum/#!forum/mesos-gpus -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3368) Add device support in cgroups abstraction
[ https://issues.apache.org/jira/browse/MESOS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues reassigned MESOS-3368: -- Assignee: Kevin Klues > Add device support in cgroups abstraction > - > > Key: MESOS-3368 > URL: https://issues.apache.org/jira/browse/MESOS-3368 > Project: Mesos > Issue Type: Task >Reporter: Niklas Quarfot Nielsen >Assignee: Kevin Klues > > Add support for [device > cgroups|https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt] to > aid isolators controlling access to devices. > In the future, we could think about how to numerate and control access to > devices as resource or task/container policy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4595) Add support for newest pre-defined Perf events to PerfEventIsolator
[ https://issues.apache.org/jira/browse/MESOS-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bartek Plotka updated MESOS-4595: - Description: Currently, Perf Event Isolator is able to monitor all (specified in {{--perf_events=...}}) Perf Events, but it can map only part of them in {{ResourceUsage.proto}} (to be more exact in [PerfStatistics.proto | https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L862]) Since the last time {{PerfStatistics.proto}} was updated, list of supported events expanded much and is growing constantly. I have created some comparison table: || Events type || Num of matched events in PerfStatistics vs perf 4.3.3 || perf 4.3.3 events || | HW events | 8 | 8 | | SW events | 9 | 10 | | HW cache event | 20 | 20 | | *Kernel PMU events* | *0* | *37* | | Tracepoint events | 0 | billion (: | For advance analysis (e.g during Oversubscription in QoS Controller) having support for additional events is crucial. For instance in [Serenity|https://github.com/mesosphere/serenity] we based some of our revocation algorithms on the new [CMT| https://01.org/packet-processing/cache-monitoring-technology-memory-bandwidth-monitoring-cache-allocation-technology-code-and-data] feature which gives additional, useful event called {{llc_occupancy}}. I think we all agree that it would be great to support more (or even all) perf events in {{Mesos PerfEventIsolator}} (: Let's start a discussion over the approach. Within this task we have three issues: # What events do we want to support in Mesos? ## all? ## only add Kernel PMU Events? --- I don't have a strong opinion on that, since i have never used {{Tracepoint events}}. We currently need PMU events. # How to add new (or modify existing) events in {{mesos.proto}}? We can distinguish here 3 approaches: *# Add new events statically in {{PerfStatistics.proto}} as separate optional fields. (like it is currently) *# Instead of optional fields in {{PerfStatistics.proto}} message we could have a {{key-value}} map (something like {{labels}} in other messages) and feed it dynamically in {{PerfEventIsolator}} *# We could mix above approaches and just add mentioned map to existing {{PerfStatistics.proto}} for additional events (: --- IMO: Approaches 1) is somehow explicit - users can view what events to expect (although they are parsed in a different manner e.g {{"-"}} to {{"_"}}), but we would end with a looong message and a lot of copy-paste work. And we have to maintain that! Approach 2 & 3 are more elastic, and we don't have problem mentioned in the issue below (: And we *always* support *all* perf events in all kernel versions (: IMO approaches 2 & 3 are the best. # How to support different naming format? For instance {{intel_cqm/llc_occupancy/}} with {{"/"}} in name or {{migrate:mm_migrate_pages}} with {{":"}}. I don't think it is possible to have these as the field names in {{.proto}} syntax Currently, approach #3 is chosen. (Adding dynamic map to existing {{PerfStatistics.proto}} for additional events specified in {{--perf_events=...}}) was: Currently, Perf Event Isolator is able to monitor all (specified in {{--perf_events=...}}) Perf Events, but it can map only part of them in {{ResourceUsage.proto}} (to be more exact in [PerfStatistics.proto | https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L862]) Since the last time {{PerfStatistics.proto}} was updated, list of supported events expanded much and is growing constantly. I have created some comparison table: || Events type || Num of matched events in PerfStatistics vs perf 4.3.3 || perf 4.3.3 events || | HW events | 8 | 8 | | SW events | 9 | 10 | | HW cache event | 20 | 20 | | *Kernel PMU events* | *0* | *37* | | Tracepoint events | 0 | billion (: | For advance analysis (e.g during Oversubscription in QoS Controller) having support for additional events is crucial. For instance in [Serenity|https://github.com/mesosphere/serenity] we based some of our revocation algorithms on the new [CMT| https://01.org/packet-processing/cache-monitoring-technology-memory-bandwidth-monitoring-cache-allocation-technology-code-and-data] feature which gives additional, useful event called {{llc_occupancy}}. I think we all agree that it would be great to support more (or even all) perf events in {{Mesos PerfEventIsolator}} (: Let's start a discussion over the approach. Within this task we have three issues: # What events do we want to support in Mesos? ## all? ## only add Kernel PMU Events? --- I don't have a strong opinion on that, since i have never used {{Tracepoint events}}. We currently need PMU events. # How to add new (or modify existing) events in {{mesos.proto}}? We can distinguish here 3 approaches: *# Add new events statically in {{PerfStatistics.proto}} as separate optional fields. (like it is currently) *# Instead of optional fields in {{Pe
[jira] [Commented] (MESOS-4269) Minor typo in src/linux/cgroups.cpp
[ https://issues.apache.org/jira/browse/MESOS-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157038#comment-15157038 ] Ryuichi Okumura commented on MESOS-4269: Any taker? > Minor typo in src/linux/cgroups.cpp > --- > > Key: MESOS-4269 > URL: https://issues.apache.org/jira/browse/MESOS-4269 > Project: Mesos > Issue Type: Bug >Reporter: Ryuichi Okumura >Assignee: Ryuichi Okumura >Priority: Minor > > There is a typo in the "src/linux/cgroups.cpp" as follows. > https://github.com/apache/mesos/blob/765c025dd43e04360b29c19bd9a66837954c5a20/src/linux/cgroups.cpp#L1438 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4547) Introduce TASK_KILLING state.
[ https://issues.apache.org/jira/browse/MESOS-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156825#comment-15156825 ] Guangya Liu commented on MESOS-4547: A new file was added for the command executor related to TASK_KILLING but no RR for this: https://github.com/apache/mesos/blob/master/src/tests/command_executor_tests.cpp > Introduce TASK_KILLING state. > - > > Key: MESOS-4547 > URL: https://issues.apache.org/jira/browse/MESOS-4547 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Assignee: Abhishek Dasgupta > Labels: mesosphere > Fix For: 0.28.0 > > > Currently there is no state to express that a task is being killed, but is > not yet killed (see MESOS-4140). In a similar way to how we have > TASK_STARTING to indicate the task is starting but not yet running, a > TASK_KILLING state would indicate the task is being killed but is not yet > killed. > This would need to be guarded by a framework capability to protect old > frameworks that cannot understand the TASK_KILLING state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu reassigned MESOS-4735: -- Assignee: Guangya Liu > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Assignee: Guangya Liu >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4735) CommandInfo.URI should allow specifying target filename
[ https://issues.apache.org/jira/browse/MESOS-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156822#comment-15156822 ] Guangya Liu commented on MESOS-4735: [~erikdw] Can you please show more detail for your desired URI with an example? I was a bit confused, if only file name, how can the fetcher get the file? > CommandInfo.URI should allow specifying target filename > --- > > Key: MESOS-4735 > URL: https://issues.apache.org/jira/browse/MESOS-4735 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.27.0 >Reporter: Erik Weathers >Priority: Minor > > The {{CommandInfo.URI}} message should allow explicitly choosing the > downloaded file's name, to better mimic functionality present in tools like > {{wget}} and {{curl}}. > This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that > has query parameters at the end of the path, resulting in the downloaded > filename having those elements. This also prevents extracting of such files, > since the extraction logic is simply looking at the file's suffix. See > MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was > fixed, then I could workaround the other issues not being fixed by modifying > my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3937) Test DockerContainerizerTest.ROOT_DOCKER_Launch_Executor fails.
[ https://issues.apache.org/jira/browse/MESOS-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-3937: -- Assignee: (was: Till Toenshoff) > Test DockerContainerizerTest.ROOT_DOCKER_Launch_Executor fails. > --- > > Key: MESOS-3937 > URL: https://issues.apache.org/jira/browse/MESOS-3937 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 0.26.0 > Environment: Ubuntu 14.04, gcc 4.8.4, Docker version 1.6.2 > 8 CPUs, 16 GB memory > Vagrant, libvirt/Virtual Box or VMware >Reporter: Bernd Mathiske > Labels: mesosphere > Fix For: 0.26.0 > > > {noformat} > ../configure > make check > sudo ./bin/mesos-tests.sh > --gtest_filter="DockerContainerizerTest.ROOT_DOCKER_Launch_Executor" --verbose > {noformat} > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from DockerContainerizerTest > I1117 15:08:09.265943 26380 leveldb.cpp:176] Opened db in 3.199666ms > I1117 15:08:09.267761 26380 leveldb.cpp:183] Compacted db in 1.684873ms > I1117 15:08:09.267902 26380 leveldb.cpp:198] Created db iterator in 58313ns > I1117 15:08:09.267966 26380 leveldb.cpp:204] Seeked to beginning of db in > 4927ns > I1117 15:08:09.267997 26380 leveldb.cpp:273] Iterated through 0 keys in the > db in 1605ns > I1117 15:08:09.268156 26380 replica.cpp:780] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1117 15:08:09.270148 26396 recover.cpp:449] Starting replica recovery > I1117 15:08:09.272105 26396 recover.cpp:475] Replica is in EMPTY status > I1117 15:08:09.275640 26396 replica.cpp:676] Replica in EMPTY status received > a broadcasted recover request from (4)@10.0.2.15:50088 > I1117 15:08:09.276578 26399 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1117 15:08:09.277600 26397 recover.cpp:566] Updating replica status to > STARTING > I1117 15:08:09.279613 26396 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.016098ms > I1117 15:08:09.279731 26396 replica.cpp:323] Persisted replica status to > STARTING > I1117 15:08:09.280306 26399 recover.cpp:475] Replica is in STARTING status > I1117 15:08:09.282181 26400 replica.cpp:676] Replica in STARTING status > received a broadcasted recover request from (5)@10.0.2.15:50088 > I1117 15:08:09.282552 26400 master.cpp:367] Master > 59c600f1-92ff-4926-9c84-073d9b81f68a (vagrant-ubuntu-trusty-64) started on > 10.0.2.15:50088 > I1117 15:08:09.283021 26400 master.cpp:369] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/40AlT8/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/40AlT8/master" > --zk_session_timeout="10secs" > I1117 15:08:09.283920 26400 master.cpp:414] Master only allowing > authenticated frameworks to register > I1117 15:08:09.283972 26400 master.cpp:419] Master only allowing > authenticated slaves to register > I1117 15:08:09.284032 26400 credentials.hpp:37] Loading credentials for > authentication from '/tmp/40AlT8/credentials' > I1117 15:08:09.282944 26401 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1117 15:08:09.284639 26401 recover.cpp:566] Updating replica status to VOTING > I1117 15:08:09.285539 26400 master.cpp:458] Using default 'crammd5' > authenticator > I1117 15:08:09.285995 26401 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.075466ms > I1117 15:08:09.286062 26401 replica.cpp:323] Persisted replica status to > VOTING > I1117 15:08:09.286200 26401 recover.cpp:580] Successfully joined the Paxos > group > I1117 15:08:09.286471 26401 recover.cpp:464] Recover process terminated > I1117 15:08:09.287303 26400 authenticator.cpp:520] Initializing server SASL > I1117 15:08:09.289371 26400 master.cpp:495] Authorization enabled > I1117 15:08:09.296018 26399 master.cpp:1606] The newly elected leader is > master@10.0.2.15:50088 with id 59c600f1-92ff-4926-9c84-073d9b81f68a > I1117 15:08:09.296115 26399 master.cpp:1619] Elected as the leading master! > I1117 15:08:09.2
[jira] [Commented] (MESOS-3937) Test DockerContainerizerTest.ROOT_DOCKER_Launch_Executor fails.
[ https://issues.apache.org/jira/browse/MESOS-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156763#comment-15156763 ] Till Toenshoff commented on MESOS-3937: --- Reopening this ticket for suggesting the already mentioned "medium-term" solution as also hinted by the last comment on https://reviews.apache.org/r/40748/. We might want to try to identify the lack of a resolvable hostname before attempting to run the above tests. Doing so within the test environment analysis allows us to exclude this/these test/s while also making the user aware of this configuration problem. > Test DockerContainerizerTest.ROOT_DOCKER_Launch_Executor fails. > --- > > Key: MESOS-3937 > URL: https://issues.apache.org/jira/browse/MESOS-3937 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 0.26.0 > Environment: Ubuntu 14.04, gcc 4.8.4, Docker version 1.6.2 > 8 CPUs, 16 GB memory > Vagrant, libvirt/Virtual Box or VMware >Reporter: Bernd Mathiske >Assignee: Till Toenshoff > Labels: mesosphere > Fix For: 0.26.0 > > > {noformat} > ../configure > make check > sudo ./bin/mesos-tests.sh > --gtest_filter="DockerContainerizerTest.ROOT_DOCKER_Launch_Executor" --verbose > {noformat} > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from DockerContainerizerTest > I1117 15:08:09.265943 26380 leveldb.cpp:176] Opened db in 3.199666ms > I1117 15:08:09.267761 26380 leveldb.cpp:183] Compacted db in 1.684873ms > I1117 15:08:09.267902 26380 leveldb.cpp:198] Created db iterator in 58313ns > I1117 15:08:09.267966 26380 leveldb.cpp:204] Seeked to beginning of db in > 4927ns > I1117 15:08:09.267997 26380 leveldb.cpp:273] Iterated through 0 keys in the > db in 1605ns > I1117 15:08:09.268156 26380 replica.cpp:780] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1117 15:08:09.270148 26396 recover.cpp:449] Starting replica recovery > I1117 15:08:09.272105 26396 recover.cpp:475] Replica is in EMPTY status > I1117 15:08:09.275640 26396 replica.cpp:676] Replica in EMPTY status received > a broadcasted recover request from (4)@10.0.2.15:50088 > I1117 15:08:09.276578 26399 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1117 15:08:09.277600 26397 recover.cpp:566] Updating replica status to > STARTING > I1117 15:08:09.279613 26396 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.016098ms > I1117 15:08:09.279731 26396 replica.cpp:323] Persisted replica status to > STARTING > I1117 15:08:09.280306 26399 recover.cpp:475] Replica is in STARTING status > I1117 15:08:09.282181 26400 replica.cpp:676] Replica in STARTING status > received a broadcasted recover request from (5)@10.0.2.15:50088 > I1117 15:08:09.282552 26400 master.cpp:367] Master > 59c600f1-92ff-4926-9c84-073d9b81f68a (vagrant-ubuntu-trusty-64) started on > 10.0.2.15:50088 > I1117 15:08:09.283021 26400 master.cpp:369] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/40AlT8/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/40AlT8/master" > --zk_session_timeout="10secs" > I1117 15:08:09.283920 26400 master.cpp:414] Master only allowing > authenticated frameworks to register > I1117 15:08:09.283972 26400 master.cpp:419] Master only allowing > authenticated slaves to register > I1117 15:08:09.284032 26400 credentials.hpp:37] Loading credentials for > authentication from '/tmp/40AlT8/credentials' > I1117 15:08:09.282944 26401 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1117 15:08:09.284639 26401 recover.cpp:566] Updating replica status to VOTING > I1117 15:08:09.285539 26400 master.cpp:458] Using default 'crammd5' > authenticator > I1117 15:08:09.285995 26401 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 1.075466ms > I1117 15:08:09.286062 26401 replica.cpp:323] Persisted replica status to > VOTING > I1117 15:08:09.286200 26401 recover.cpp:580] Successfully joined the
[jira] [Created] (MESOS-4735) CommandInfo.URI should allow specifying target filename
Erik Weathers created MESOS-4735: Summary: CommandInfo.URI should allow specifying target filename Key: MESOS-4735 URL: https://issues.apache.org/jira/browse/MESOS-4735 Project: Mesos Issue Type: Improvement Components: fetcher Affects Versions: 0.27.0 Reporter: Erik Weathers Priority: Minor The {{CommandInfo.URI}} message should allow explicitly choosing the downloaded file's name, to better mimic functionality present in tools like {{wget}} and {{curl}}. This relates to issues when the {{CommandInfo.URI}} is pointing to a URL that has query parameters at the end of the path, resulting in the downloaded filename having those elements. This also prevents extracting of such files, since the extraction logic is simply looking at the file's suffix. See MESOS-3367, MESOS-1686, and MESOS-1509 for more info. If this issue was fixed, then I could workaround the other issues not being fixed by modifying my framework's scheduler to set the target filename. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4547) Introduce TASK_KILLING state.
[ https://issues.apache.org/jira/browse/MESOS-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156737#comment-15156737 ] Abhishek Dasgupta commented on MESOS-4547: -- Yes, there are test cases and docs for this feature: For docs: https://reviews.apache.org/r/43827/ https://reviews.apache.org/r/43821/ [~bmahler] Could you please provide the RRs for testcases here, if any?? > Introduce TASK_KILLING state. > - > > Key: MESOS-4547 > URL: https://issues.apache.org/jira/browse/MESOS-4547 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Assignee: Abhishek Dasgupta > Labels: mesosphere > Fix For: 0.28.0 > > > Currently there is no state to express that a task is being killed, but is > not yet killed (see MESOS-4140). In a similar way to how we have > TASK_STARTING to indicate the task is starting but not yet running, a > TASK_KILLING state would indicate the task is being killed but is not yet > killed. > This would need to be guarded by a framework capability to protect old > frameworks that cannot understand the TASK_KILLING state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4547) Introduce TASK_KILLING state.
[ https://issues.apache.org/jira/browse/MESOS-4547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156719#comment-15156719 ] Bernd Mathiske commented on MESOS-4547: --- The RR for tests (https://reviews.apache.org/r/43490/) has been discarded. Are there going to be tests and documentation for this feature? > Introduce TASK_KILLING state. > - > > Key: MESOS-4547 > URL: https://issues.apache.org/jira/browse/MESOS-4547 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler >Assignee: Abhishek Dasgupta > Labels: mesosphere > Fix For: 0.28.0 > > > Currently there is no state to express that a task is being killed, but is > not yet killed (see MESOS-4140). In a similar way to how we have > TASK_STARTING to indicate the task is starting but not yet running, a > TASK_KILLING state would indicate the task is being killed but is not yet > killed. > This would need to be guarded by a framework capability to protect old > frameworks that cannot understand the TASK_KILLING state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4734) Add running cluster upgrade section for 0.25 => 0.26 and 0.26 => 0.27
Michael Park created MESOS-4734: --- Summary: Add running cluster upgrade section for 0.25 => 0.26 and 0.26 => 0.27 Key: MESOS-4734 URL: https://issues.apache.org/jira/browse/MESOS-4734 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Michael Park In {{docs/upgrades.md}}, the 0.25 to 0.26 and 0.26 to 0.27 is missing the "In order to upgrade a running cluster" steps. We should add these sections. -- This message was sent by Atlassian JIRA (v6.3.4#6332)