[jira] [Commented] (MESOS-1806) Etcd-based master contender/detector module
[ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281232#comment-15281232 ] Jay Guo commented on MESOS-1806: We create a repo to temporarily host this module. Your comments and reviews are highly appreciated. https://github.com/guoger/mesos-etcd-module > Etcd-based master contender/detector module > --- > > Key: MESOS-1806 > URL: https://issues.apache.org/jira/browse/MESOS-1806 > Project: Mesos > Issue Type: Epic > Components: leader election >Reporter: Ed Ropple >Assignee: Shuai Lin >Priority: Minor > >eropple: Could you also file a new JIRA for Mesos to drop ZK > in favor of etcd or ReplicatedLog? Would love to get some momentum going on > that one. > -- > Consider it filed. =) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5366) Update documentation to include contender/detector module
Jay Guo created MESOS-5366: -- Summary: Update documentation to include contender/detector module Key: MESOS-5366 URL: https://issues.apache.org/jira/browse/MESOS-5366 Project: Mesos Issue Type: Documentation Reporter: Jay Guo Assignee: Jay Guo Priority: Minor Since contender and detector are modulerized, the documentation should be updated to reflect this change as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5106) Improve test_http_framework so it can load master detector from modules
[ https://issues.apache.org/jira/browse/MESOS-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Guo reassigned MESOS-5106: -- Assignee: zhou xing > Improve test_http_framework so it can load master detector from modules > --- > > Key: MESOS-5106 > URL: https://issues.apache.org/jira/browse/MESOS-5106 > Project: Mesos > Issue Type: Task >Reporter: Shuai Lin >Assignee: zhou xing > > I'm planning to restart the work of [MESOS-1806] (etcd contender/detector) > based on [MESOS-4610]. One thing I need to address first is when writing a > script test, I need a framework that can use a master detector loaded from a > module. The best way to do this seems to be adding {{\-\-modules}} and > {{\-\-master_detector}} flags to {{test_http_framework.cpp}} so we can reuse > it in tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4434) Install 3rdparty package boost, glog, protobuf and picojson when installing Mesos
[ https://issues.apache.org/jira/browse/MESOS-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281227#comment-15281227 ] Jay Guo commented on MESOS-4434: This definitely ease the compilation of modules. It would be good to have flag `--enable-install-module-dependencies` reflected in documentation as well > Install 3rdparty package boost, glog, protobuf and picojson when installing > Mesos > - > > Key: MESOS-4434 > URL: https://issues.apache.org/jira/browse/MESOS-4434 > Project: Mesos > Issue Type: Bug > Components: build, modules >Reporter: Kapil Arya >Assignee: Kapil Arya > Labels: mesosphere > Fix For: 0.29.0 > > > Mesos modules depend on having these packages installed with the exact > version as Mesos was compiled with. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3637) Port process/process.hpp to Windows
[ https://issues.apache.org/jira/browse/MESOS-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281166#comment-15281166 ] Michael Park commented on MESOS-3637: - {noformat} commit 0676792bfeb8dc0abd90d71e84441feeb9e207fa Author: Alex ClemmerDate: Wed May 11 21:35:09 2016 -0600 Libprocess: Implemented `subprocess_windows.cpp`. Review: https://reviews.apache.org/r/46608/ {noformat} {noformat} commit aa281adf8a6eedbfa83f35e8561b2bb3e001b155 Author: Alex Clemmer Date: Wed May 11 21:35:01 2016 -0600 Windows: Forked `subprocess.cpp`. Review: https://reviews.apache.org/r/46423/ {noformat} > Port process/process.hpp to Windows > --- > > Key: MESOS-3637 > URL: https://issues.apache.org/jira/browse/MESOS-3637 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2516) Move allocation-related types to mesos::master namespace
[ https://issues.apache.org/jira/browse/MESOS-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281116#comment-15281116 ] José Guilherme Vanz commented on MESOS-2516: https://reviews.apache.org/r/47281/ > Move allocation-related types to mesos::master namespace > > > Key: MESOS-2516 > URL: https://issues.apache.org/jira/browse/MESOS-2516 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Alexander Rukletsov >Assignee: José Guilherme Vanz >Priority: Minor > Labels: easyfix, newbie > > {{Allocator}}, {{Sorter}} and {{Comaprator}} types live in > {{master::allocator}} namespace. This is not consistent with the rest of the > codebase: {{Isolator}}, {{Fetcher}}, {{Containerizer}} all live in {{slave}} > namespace. Namespace {{allocator}} should be killed for consistency. > Since sorters are poorly named, they should be renamed (or namespaced) prior > to this change in order not to pollute {{master}} namespace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5287) boto is no longer a Mesos dependency.
[ https://issues.apache.org/jira/browse/MESOS-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281099#comment-15281099 ] Chen Zhiwei commented on MESOS-5287: Thanks, I also planed to update this patch to include the getting-started.md updates. > boto is no longer a Mesos dependency. > - > > Key: MESOS-5287 > URL: https://issues.apache.org/jira/browse/MESOS-5287 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Assignee: Chen Zhiwei > Labels: easyfix, newbie > Fix For: 0.29.0 > > > Since 'mesos-ec2' has been removed from the repo in MESOS-2640. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5336) Add authorization to GET /quota
[ https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281073#comment-15281073 ] Zhitao Li commented on MESOS-5336: -- [~adam-mesos], I'll leave this decision reconcile with you and [~alexr]. I'm fine with both way, just let me know whether you decide to keep GET_ENDPOINT_WITH_PATH or not. The patch below is built on top of GET_ENDPOINT_WITH_PATH but I can rebase to drop that one if we decide to go otherwise. > Add authorization to GET /quota > --- > > Key: MESOS-5336 > URL: https://issues.apache.org/jira/browse/MESOS-5336 > Project: Mesos > Issue Type: Improvement > Components: master, security >Reporter: Adam B > Labels: mesosphere, security > Fix For: 0.29.0 > > > We already authorize which http users can set/remove quota for particular > roles, but even knowing of the existence of these roles (let alone their > quotas) may be sensitive information. We should add authz around GET > operations on /quota. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5336) Add authorization to GET /quota
[ https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281071#comment-15281071 ] Zhitao Li commented on MESOS-5336: -- https://reviews.apache.org/r/47274/ > Add authorization to GET /quota > --- > > Key: MESOS-5336 > URL: https://issues.apache.org/jira/browse/MESOS-5336 > Project: Mesos > Issue Type: Improvement > Components: master, security >Reporter: Adam B > Labels: mesosphere, security > Fix For: 0.29.0 > > > We already authorize which http users can set/remove quota for particular > roles, but even knowing of the existence of these roles (let alone their > quotas) may be sensitive information. We should add authz around GET > operations on /quota. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5308) ROOT_XFS_QuotaTest.NoCheckpointRecovery failed.
[ https://issues.apache.org/jira/browse/MESOS-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281024#comment-15281024 ] Yan Xu commented on MESOS-5308: --- [~jpe...@apache.org] I committed the patch you had {noformat:title=} commit 95e670cd41a33e68afab701f9c28dd968a6f8011 Author: James PeachDate: Wed May 11 16:43:05 2016 -0700 Fix race conditions in ROOT_XFS_QuotaTest.NoCheckpointRecovery. There are two race conditions in ROOT_XFS_QuotaTest.NoCheckpointRecovery. The first is when we were checking the disk resources consumed without knowing whether the dd command had completed. We can just eliminate this check since other tests cover the resource usage case. The second race was installing the MesosContainerizerProcess::___recover expectation after starting the slave. We need to install this before starting. Review: https://reviews.apache.org/r/47001/ {noformat} Weird check failures though. Any theories? Let me see if I can repro. > ROOT_XFS_QuotaTest.NoCheckpointRecovery failed. > --- > > Key: MESOS-5308 > URL: https://issues.apache.org/jira/browse/MESOS-5308 > Project: Mesos > Issue Type: Bug > Components: isolation > Environment: Fedora 23 with/without SSL >Reporter: Gilbert Song >Assignee: James Peach > Labels: isolation > > Here is the log: > {code} > [01:07:51] : [Step 10/10] [ RUN ] > ROOT_XFS_QuotaTest.NoCheckpointRecovery > [01:07:51] : [Step 10/10] meta-data=/dev/loop0 isize=512 > agcount=2, agsize=5120 blks > [01:07:51] : [Step 10/10] = sectsz=512 > attr=2, projid32bit=1 > [01:07:51] : [Step 10/10] = crc=1 > finobt=1, sparse=0 > [01:07:51] : [Step 10/10] data = bsize=4096 > blocks=10240, imaxpct=25 > [01:07:51] : [Step 10/10] = sunit=0 > swidth=0 blks > [01:07:51] : [Step 10/10] naming =version 2 bsize=4096 > ascii-ci=0 ftype=1 > [01:07:51] : [Step 10/10] log =internal log bsize=4096 > blocks=855, version=2 > [01:07:51] : [Step 10/10] = sectsz=512 > sunit=0 blks, lazy-count=1 > [01:07:51] : [Step 10/10] realtime =none extsz=4096 > blocks=0, rtextents=0 > [01:07:51]W: [Step 10/10] I0429 01:07:51.690585 17604 cluster.cpp:149] > Creating default 'local' authorizer > [01:07:51]W: [Step 10/10] I0429 01:07:51.706126 17604 leveldb.cpp:174] > Opened db in 15.452988ms > [01:07:51]W: [Step 10/10] I0429 01:07:51.707135 17604 leveldb.cpp:181] > Compacted db in 984939ns > [01:07:51]W: [Step 10/10] I0429 01:07:51.707154 17604 leveldb.cpp:196] > Created db iterator in 4159ns > [01:07:51]W: [Step 10/10] I0429 01:07:51.707159 17604 leveldb.cpp:202] > Seeked to beginning of db in 517ns > [01:07:51]W: [Step 10/10] I0429 01:07:51.707165 17604 leveldb.cpp:271] > Iterated through 0 keys in the db in 305ns > [01:07:51]W: [Step 10/10] I0429 01:07:51.707176 17604 replica.cpp:779] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [01:07:51]W: [Step 10/10] I0429 01:07:51.707320 17621 recover.cpp:447] > Starting replica recovery > [01:07:51]W: [Step 10/10] I0429 01:07:51.707381 17621 recover.cpp:473] > Replica is in EMPTY status > [01:07:51]W: [Step 10/10] I0429 01:07:51.707638 17619 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > (17889)@172.30.2.13:37618 > [01:07:51]W: [Step 10/10] I0429 01:07:51.707732 17624 recover.cpp:193] > Received a recover response from a replica in EMPTY status > [01:07:51]W: [Step 10/10] I0429 01:07:51.707885 17624 recover.cpp:564] > Updating replica status to STARTING > [01:07:51]W: [Step 10/10] I0429 01:07:51.708389 17618 master.cpp:382] > Master 0c1e0a50-1212-4104-a148-661131b79f27 > (ip-172-30-2-13.ec2.internal.mesosphere.io) started on 172.30.2.13:37618 > [01:07:51]W: [Step 10/10] I0429 01:07:51.708406 17618 master.cpp:384] Flags > at startup: --acls="" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate="true" > --authenticate_http="true" --authenticate_http_frameworks="true" > --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" > --credentials="/mnt/teamcity/temp/buildTmp/ROOT_XFS_QuotaTest_NoCheckpointRecovery_ZsRNg9/mnt/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000"
[jira] [Commented] (MESOS-5336) Add authorization to GET /quota
[ https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281009#comment-15281009 ] Adam B commented on MESOS-5336: --- I think GET_QUOTA_WITH_ROLE should be sufficient, and GET_ENDPOINT_WITH_PATH can be emulated with GET_QUOTA_WITH_ROLE, ANY > Add authorization to GET /quota > --- > > Key: MESOS-5336 > URL: https://issues.apache.org/jira/browse/MESOS-5336 > Project: Mesos > Issue Type: Improvement > Components: master, security >Reporter: Adam B > Labels: mesosphere, security > Fix For: 0.29.0 > > > We already authorize which http users can set/remove quota for particular > roles, but even knowing of the existence of these roles (let alone their > quotas) may be sensitive information. We should add authz around GET > operations on /quota. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5277) Need to add REMOVE semantics to the copy backend
[ https://issues.apache.org/jira/browse/MESOS-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-5277: - Description: Some Dockerfiles run the `rm` command to remove files from the base image using the "RUN" directive in the Dockerfile. An example can be found here: https://github.com/ngineered/nginx-php-fpm.git In the final rootfs the removed files should not be present. Presence of these files in the final image can make the container misbehave. For example, the nginx-php-fpm docker image that is referenced tries to remove the default nginx config and replaces it with its own config to point to a different HTML root. If the default nginx config is still present after the building the image, nginx will start pointing to a different HTML root than the one set in the Dockerfile. Currently the copy backend cannot handle removal of files from intermediate layers. This can cause issues with docker images built using a Dockerfile similar to the one listed here. Hence, we need to add REMOVE semantics to the copy backend. was: Some Dockerfile run the `rm` command to remove files from the base image using the "RUN" directive in the Dockerfile. An example can be found here: https://github.com/ngineered/nginx-php-fpm.git In the final rootfs the removed files should not be present. Presence of these files in the final image can make the container misbehave. For example, the nginx-php-fpm docker image that is reference tries to remove the default nginx config and replace it with it own config to point a different HTML root. If the default nginx config is still present after the building the image, nginx will start pointing to a different HTML root than the one set in the Dockerfile. Currently the copy backend cannot handle removal of files from intermediate layers. This can cause issues with docker images built using a Dockerfile similar to the one listed here. Hence, we need to add REMOVE semantics to the copy backend. > Need to add REMOVE semantics to the copy backend > > > Key: MESOS-5277 > URL: https://issues.apache.org/jira/browse/MESOS-5277 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: linux >Reporter: Avinash Sridharan >Assignee: Gilbert Song > Labels: mesosphere > Fix For: 0.29.0 > > > Some Dockerfiles run the `rm` command to remove files from the base image > using the "RUN" directive in the Dockerfile. An example can be found here: > https://github.com/ngineered/nginx-php-fpm.git > In the final rootfs the removed files should not be present. Presence of > these files in the final image can make the container misbehave. For example, > the nginx-php-fpm docker image that is referenced tries to remove the default > nginx config and replaces it with its own config to point to a different HTML > root. If the default nginx config is still present after the building the > image, nginx will start pointing to a different HTML root than the one set in > the Dockerfile. > Currently the copy backend cannot handle removal of files from intermediate > layers. This can cause issues with docker images built using a Dockerfile > similar to the one listed here. Hence, we need to add REMOVE semantics to the > copy backend. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5365) Introduce a timeout for docker volume driver mount/unmount operation.
[ https://issues.apache.org/jira/browse/MESOS-5365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu reassigned MESOS-5365: - Assignee: Jie Yu > Introduce a timeout for docker volume driver mount/unmount operation. > - > > Key: MESOS-5365 > URL: https://issues.apache.org/jira/browse/MESOS-5365 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jie Yu > > 'dvdcli' might hang indefinitely. We should introduce timeout for both > mount/unmount operation so that launch/cleanup are not blocked forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5365) Introduce a timeout for docker volume driver mount/unmount operation.
Jie Yu created MESOS-5365: - Summary: Introduce a timeout for docker volume driver mount/unmount operation. Key: MESOS-5365 URL: https://issues.apache.org/jira/browse/MESOS-5365 Project: Mesos Issue Type: Task Reporter: Jie Yu 'dvdcli' might hang indefinitely. We should introduce timeout for both mount/unmount operation so that launch/cleanup are not blocked forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280897#comment-15280897 ] Anand Mazumdar commented on MESOS-5332: --- [~StephanErb] That took some catching! Since, we have identified the root cause and filed corresponding tickets for the action items on our end i.e. MESOS-5361/MESOS-5364 (also linked to this JIRA). I am resolving this issue. Feel free to re-open if you have any further queries/concerns. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. log messages are in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5005) Enforce that DiskInfo principal is equal to framework/operator principal
[ https://issues.apache.org/jira/browse/MESOS-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5005: - Description: Currently, we require that {{ReservationInfo.principal}} be equal to the principal provided for authentication, which means that when HTTP authentication is disabled this field cannot be set. Based on comments in 'mesos.proto', the original intention was to enforce this same constraint for {{Persistence.principal}}, but it seems that we don't enforce it. This should be changed to make the two fields equivalent, with one exception: when the framework/operator principal is {{None}}, we should allow the principal in {{DiskInfo}} to take any value, along the same lines as MESOS-5212. (was: Currently, we require that {{ReservationInfo.principal}} be equal to the principal provided for authentication, which means that when HTTP authentication is disabled this field cannot be set. Based on comments in 'mesos.proto', the original intention was to enforce this same constraint for {{Persistence.principal}}, but it seems that we don't enforce it. This should be changed to make the two fields equivalent.) > Enforce that DiskInfo principal is equal to framework/operator principal > > > Key: MESOS-5005 > URL: https://issues.apache.org/jira/browse/MESOS-5005 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere, persistent-volumes, reservations > Fix For: 0.29.0 > > > Currently, we require that {{ReservationInfo.principal}} be equal to the > principal provided for authentication, which means that when HTTP > authentication is disabled this field cannot be set. Based on comments in > 'mesos.proto', the original intention was to enforce this same constraint for > {{Persistence.principal}}, but it seems that we don't enforce it. This should > be changed to make the two fields equivalent, with one exception: when the > framework/operator principal is {{None}}, we should allow the principal in > {{DiskInfo}} to take any value, along the same lines as MESOS-5212. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5360) Set death signal for dvdcli subprocess in docker volume isolator.
[ https://issues.apache.org/jira/browse/MESOS-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280590#comment-15280590 ] Jie Yu commented on MESOS-5360: --- This is for the case where slave crashes. In that case, waiting for the subprocess to finish does not make sense because no one will read its output. > Set death signal for dvdcli subprocess in docker volume isolator. > - > > Key: MESOS-5360 > URL: https://issues.apache.org/jira/browse/MESOS-5360 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Jie Yu > > If the slave crashes, we should kill the dvdcli subprocess. Otherwise, if the > dvdcli subprocess gets stuck, it'll not be cleaned up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5360) Set death signal for dvdcli subprocess in docker volume isolator.
[ https://issues.apache.org/jira/browse/MESOS-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu reassigned MESOS-5360: - Assignee: Jie Yu > Set death signal for dvdcli subprocess in docker volume isolator. > - > > Key: MESOS-5360 > URL: https://issues.apache.org/jira/browse/MESOS-5360 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu >Assignee: Jie Yu > > If the slave crashes, we should kill the dvdcli subprocess. Otherwise, if the > dvdcli subprocess gets stuck, it'll not be cleaned up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5287) boto is no longer a Mesos dependency.
[ https://issues.apache.org/jira/browse/MESOS-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-5287: -- Shepherd: Yan Xu > boto is no longer a Mesos dependency. > - > > Key: MESOS-5287 > URL: https://issues.apache.org/jira/browse/MESOS-5287 > Project: Mesos > Issue Type: Bug >Reporter: Yan Xu >Assignee: Chen Zhiwei > Labels: easyfix, newbie > > Since 'mesos-ec2' has been removed from the repo in MESOS-2640. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5336) Add authorization to GET /quota
[ https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280485#comment-15280485 ] Zhitao Li commented on MESOS-5336: -- I think I can come up with an implementation for {{GET_QUOTA_WITH_ROLE}} using {{stout::collect}} on a list of futures. Question: if we have {{GET_QUOTA_WITH_ROLE}}, do you think we still want to guard {{/quota}} endpoint with {{GET_ENDPOINT_WITH_PATH}}? The closest alternative would be an ACL of {{ANY}} or {{NONE}} role, but it probably would return empty map rather than {{Forbidden}}. I have no strong opinion here. I'll try a diff on top my previous review while wait for your answer. > Add authorization to GET /quota > --- > > Key: MESOS-5336 > URL: https://issues.apache.org/jira/browse/MESOS-5336 > Project: Mesos > Issue Type: Improvement > Components: master, security >Reporter: Adam B > Labels: mesosphere, security > Fix For: 0.29.0 > > > We already authorize which http users can set/remove quota for particular > roles, but even knowing of the existence of these roles (let alone their > quotas) may be sensitive information. We should add authz around GET > operations on /quota. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5364) Consider adding `unlink` functionality to libprocess
Anand Mazumdar created MESOS-5364: - Summary: Consider adding `unlink` functionality to libprocess Key: MESOS-5364 URL: https://issues.apache.org/jira/browse/MESOS-5364 Project: Mesos Issue Type: Improvement Reporter: Anand Mazumdar Currently we don't have the {{unlink}} functionality in libprocess i.e. Erlang's equivalent of http://erlang.org/doc/man/erlang.html#unlink-1. We have a lot of places in our current code with {{TODO's}} for implementing it. It can benefit us in a couple of ways: - Based on the business logic of the actor, it would want to authoritatively communicate that it is no longer interested in {{ExitedEvent}} for the external remote link. - Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to the remote instance being unavailable (e.g., partition, network intermediaries not sending RST's etc). I did not find any old JIRA's pertaining to this but I did come across an initial attempt to add this though albeit for injecting {{exited}} events as part of the initial review for MESOS-1059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3637) Port process/process.hpp to Windows
[ https://issues.apache.org/jira/browse/MESOS-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-3637: Fix Version/s: (was: 0.29.0) > Port process/process.hpp to Windows > --- > > Key: MESOS-3637 > URL: https://issues.apache.org/jira/browse/MESOS-3637 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4788) Mesos UI should show the role and principal of a framework
[ https://issues.apache.org/jira/browse/MESOS-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deshna jain reassigned MESOS-4788: -- Assignee: deshna jain > Mesos UI should show the role and principal of a framework > -- > > Key: MESOS-4788 > URL: https://issues.apache.org/jira/browse/MESOS-4788 > Project: Mesos > Issue Type: Task > Components: webui >Reporter: Zhitao Li >Assignee: deshna jain >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4823) Implement port forwarding in `network/cni` isolator
[ https://issues.apache.org/jira/browse/MESOS-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-4823: - Sprint: Mesosphere Sprint 30, Mesosphere Sprint 31 (was: Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint 35) > Implement port forwarding in `network/cni` isolator > --- > > Key: MESOS-4823 > URL: https://issues.apache.org/jira/browse/MESOS-4823 > Project: Mesos > Issue Type: Task > Components: containerization > Environment: linux >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan >Priority: Critical > Labels: mesosphere > > Most docker and appc images wish to expose ports that micro-services are > listening on, to the outside world. When containers are running on bridged > (or ptp) networking this can be achieved by installing port forwarding rules > on the agent (using iptables). This can be done in the `network/cni` > isolator. > The reason we would like this functionality to be implemented in the > `network/cni` isolator, and not a CNI plugin, is that the specifications > currently do not support specifying port forwarding rules. Further, to > install these rules the isolator needs two pieces of information, the exposed > ports and the IP address associated with the container. Bother are available > to the isolator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5066) Create an iptables interface in Mesos
[ https://issues.apache.org/jira/browse/MESOS-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-5066: - Sprint: (was: Mesosphere Sprint 35) > Create an iptables interface in Mesos > - > > Key: MESOS-5066 > URL: https://issues.apache.org/jira/browse/MESOS-5066 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan > Labels: mesosphere > > For support port mapping functionality in the network CNI isolator we need to > enable DNAT rules in iptables. We therefore need to create an iptables > interface in Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5336) Add authorization to GET /quota
[ https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279881#comment-15279881 ] Alexander Rukletsov commented on MESOS-5336: Currently we have {{SET_QUOTA_WITH_ROLE}} and {{DESTROY_QUOTA_WITH_PRINCIPAL}} authz actions, which will be eventually subsumed by a single {{UPDATE_QUOTA_WITH_ROLE}} for all {{POST}}, {{PUT}}, and {{DELETE}}. These are fine-grained authz actions. We can additionally implement coarse-grained authz actions: {{GET_ENDPOINT_WITH_PATH}}, {{POST_ENDPOINT_WITH_PATH}} and so on for {{/quota}}. I can see benefits of having both coarse- and fine- grained authz actions, but maybe we don't need to implement them now. For now, let's either do {{GET_ENDPOINT_WITH_PATH}} or {{GET_QUOTA_WITH_ROLE}}. The former does not allow fine-grained filtering, while the latter is harder to implement since we have to filter quotas based on authorizer's response. I see you have opted for {{GET_ENDPOINT_WITH_PATH}}. Do you think we can implement the latter fast enough? For now, we'll have to query authorizer for each role, but in the future we should be able to send a BatchRequest. > Add authorization to GET /quota > --- > > Key: MESOS-5336 > URL: https://issues.apache.org/jira/browse/MESOS-5336 > Project: Mesos > Issue Type: Improvement > Components: master, security >Reporter: Adam B > Labels: mesosphere, security > Fix For: 0.29.0 > > > We already authorize which http users can set/remove quota for particular > roles, but even knowing of the existence of these roles (let alone their > quotas) may be sensitive information. We should add authz around GET > operations on /quota. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated MESOS-5332: --- Description: When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 20% to 50% of all tasks. They are killed by the Mesos agent via: {code} I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered executors I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-28854-0-6a88d62e-656 4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:51463 ... I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery {code} We have verified that the tasks and their executors are killed by the agent during startup. When stopping the agent using supervisorctl stop, the executors are still running (verified via {{ps aux}}). They are only killed once the agent tries to reregister. The issue is hard to reproduce: * When restarting the agent binary multiple times, tasks are only lost for the first restart. * It is much more likely to occur if the agent binary has been running for a longer period of time (> 7 days) Mesos is correctly sticking to the 2 seconds wait time before killing un-reregistered executors. The failed executors receive the reregistration request, but it seems like they fail to send a reply. A successful reregistration (not leading to LOST): {code} I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 20160118-141153-92471562-5050-6270-S17 I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms {code} A failed one: {code} I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from slave 20160118-141153-92471562-5050-6270-S17 E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with slave 20160118-141153-92471562-5050-6270-S17 {code} All task ending up in LOST have an output similar to the one posted above, i.e. log messages are in a wrong order. Anyone an idea what might be going on here? was: When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 20% to 50% of all tasks. They are killed by the Mesos agent via: {code} I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered executors I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-28854-0-6a88d62e-656 4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:51463 ... I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery {code} We have verified that the tasks and their executors are killed by the agent during startup. When stopping the agent using supervisorctl stop, the executors are still running (verified via {{ps aux}}). They are only killed once the agent tries to reregister. The issue is hard to reproduce: * When restarting the agent binary multiple times, tasks are only lost for the first restart. * It is much more likely to occur if the agent binary has been running for a longer period of time (> 7 days) * It tends to be more likely if the host has many cores (30-40) and thus many libprocess workers. Mesos is correctly sticking to the 2 seconds wait time before killing un-reregistered executors. The failed executors receive the reregistration request, but it seems like they fail to send a reply. A successful reregistration (not leading
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279791#comment-15279791 ] Stephan Erb commented on MESOS-5332: I was able to assemble a reproducing example (using Aurora master and Mesos 0.27.2): https://gist.github.com/StephanErb/5798b0d87c11473fb0ec147272ea0288 Summary of events: * An iptables firewall is terminating idle TCP connections after the iptables default of 5 days (reduced to 60 seconds in the example above). * Mesos does not detect broken, half-open TCP connections that occur when connections are terminated by iptables. * Mesos tries to use the old, broken TCP connection when answering the agent reconnect request. The message therefore never makes it to the agent. * The agent ends up killing the executor because it does not receive a reply for its reconnect request. I'd conclude that there are several areas that need improvement: * *Firewalling*: We have to fix our inhouse iptables firewall scripts so that it does not apply connection tracking to local connections. * *KeepAlive*: Mesos has to enable TCP keepalives in libprocess. As detailed here (http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/) this has two advantages: ** Detection of dead peers in case there has been a hard crash, an unplugged network cable, or if a firewall has silently dropped a connection. ** Prevention of disconnection due to network inactivity. * *Unlink* By the addition of an {{unlink}} function libprocess could handle exit events more gracefully by making sure it always creates a new connection when talking to a new process. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: libprocess, slave >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > * It tends to be more likely if the host has many cores (30-40) and thus many > libprocess workers. > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is
[jira] [Commented] (MESOS-2731) Allow frameworks to deploy storage drivers on demand.
[ https://issues.apache.org/jira/browse/MESOS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279787#comment-15279787 ] Ken Sipe commented on MESOS-2731: - This is consistent with use case #2, but we also want to support access to s3 which includes permission based access. > Allow frameworks to deploy storage drivers on demand. > - > > Key: MESOS-2731 > URL: https://issues.apache.org/jira/browse/MESOS-2731 > Project: Mesos > Issue Type: Epic >Reporter: Joerg Schad > Labels: mesosphere > > Certain storage options require storage drivers to access them including HDFS > driver, Quobyte client, Database driver, and so on. > When Tasks in Mesos require access to such storage they also need access to > the respective driver on the node where they were scheduled to. > As it is not desirable to deploy the driver onto all nodes in the cluster, it > would be good to deploy the driver on demand. > Use Cases: > 1. Fetcher Cache pulling resources from user-provided URIs > 2. Framework executors/tasks requiring r/w access to HDFS/DFS > 3. Framework executors/tasks requiring r/w Databases access (requiring > drivers) -- This message was sent by Atlassian JIRA (v6.3.4#6332)