[jira] [Commented] (MESOS-1806) Etcd-based master contender/detector module

2016-05-11 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281232#comment-15281232
 ] 

Jay Guo commented on MESOS-1806:


We create a repo to temporarily host this module. Your comments and reviews are 
highly appreciated.
https://github.com/guoger/mesos-etcd-module

> Etcd-based master contender/detector module
> ---
>
> Key: MESOS-1806
> URL: https://issues.apache.org/jira/browse/MESOS-1806
> Project: Mesos
>  Issue Type: Epic
>  Components: leader election
>Reporter: Ed Ropple
>Assignee: Shuai Lin
>Priority: Minor
>
>eropple: Could you also file a new JIRA for Mesos to drop ZK 
> in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
> that one.
> --
> Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5366) Update documentation to include contender/detector module

2016-05-11 Thread Jay Guo (JIRA)
Jay Guo created MESOS-5366:
--

 Summary: Update documentation to include contender/detector module
 Key: MESOS-5366
 URL: https://issues.apache.org/jira/browse/MESOS-5366
 Project: Mesos
  Issue Type: Documentation
Reporter: Jay Guo
Assignee: Jay Guo
Priority: Minor


Since contender and detector are modulerized, the documentation should be 
updated to reflect this change as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5106) Improve test_http_framework so it can load master detector from modules

2016-05-11 Thread Jay Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Guo reassigned MESOS-5106:
--

Assignee: zhou xing

> Improve test_http_framework so it can load master detector from modules
> ---
>
> Key: MESOS-5106
> URL: https://issues.apache.org/jira/browse/MESOS-5106
> Project: Mesos
>  Issue Type: Task
>Reporter: Shuai Lin
>Assignee: zhou xing
>
> I'm planning to restart the work of [MESOS-1806] (etcd contender/detector) 
> based on [MESOS-4610]. One thing I need to address first is when writing a 
> script test,  I need a framework that can use a master detector loaded from a 
> module. The best way to do this seems to be adding {{\-\-modules}} and 
> {{\-\-master_detector}} flags to {{test_http_framework.cpp}} so we can reuse 
> it in tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4434) Install 3rdparty package boost, glog, protobuf and picojson when installing Mesos

2016-05-11 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281227#comment-15281227
 ] 

Jay Guo commented on MESOS-4434:


This definitely ease the compilation of modules. It would be good to have flag 
`--enable-install-module-dependencies` reflected in documentation as well

> Install 3rdparty package boost, glog, protobuf and picojson when installing 
> Mesos
> -
>
> Key: MESOS-4434
> URL: https://issues.apache.org/jira/browse/MESOS-4434
> Project: Mesos
>  Issue Type: Bug
>  Components: build, modules
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Mesos modules depend on having these packages installed with the exact 
> version as Mesos was compiled with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3637) Port process/process.hpp to Windows

2016-05-11 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281166#comment-15281166
 ] 

Michael Park commented on MESOS-3637:
-

{noformat}
commit 0676792bfeb8dc0abd90d71e84441feeb9e207fa
Author: Alex Clemmer 
Date:   Wed May 11 21:35:09 2016 -0600

Libprocess: Implemented `subprocess_windows.cpp`.

Review: https://reviews.apache.org/r/46608/
{noformat}
{noformat}
commit aa281adf8a6eedbfa83f35e8561b2bb3e001b155
Author: Alex Clemmer 
Date:   Wed May 11 21:35:01 2016 -0600

Windows: Forked `subprocess.cpp`.

Review: https://reviews.apache.org/r/46423/
{noformat}

> Port process/process.hpp to Windows
> ---
>
> Key: MESOS-3637
> URL: https://issues.apache.org/jira/browse/MESOS-3637
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, windows
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2516) Move allocation-related types to mesos::master namespace

2016-05-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281116#comment-15281116
 ] 

José Guilherme Vanz commented on MESOS-2516:


https://reviews.apache.org/r/47281/

> Move allocation-related types to mesos::master namespace
> 
>
> Key: MESOS-2516
> URL: https://issues.apache.org/jira/browse/MESOS-2516
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Alexander Rukletsov
>Assignee: José Guilherme Vanz
>Priority: Minor
>  Labels: easyfix, newbie
>
> {{Allocator}}, {{Sorter}} and {{Comaprator}} types live in 
> {{master::allocator}} namespace. This is not consistent with the rest of the 
> codebase: {{Isolator}}, {{Fetcher}}, {{Containerizer}} all live in {{slave}} 
> namespace. Namespace {{allocator}} should be killed for consistency.
> Since sorters are poorly named, they should be renamed (or namespaced) prior 
> to this change in order not to pollute {{master}} namespace. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5287) boto is no longer a Mesos dependency.

2016-05-11 Thread Chen Zhiwei (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281099#comment-15281099
 ] 

Chen Zhiwei commented on MESOS-5287:


Thanks, I also planed to update this patch to include the getting-started.md 
updates.

> boto is no longer a Mesos dependency.
> -
>
> Key: MESOS-5287
> URL: https://issues.apache.org/jira/browse/MESOS-5287
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Assignee: Chen Zhiwei
>  Labels: easyfix, newbie
> Fix For: 0.29.0
>
>
> Since 'mesos-ec2' has been removed from the repo in MESOS-2640.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5336) Add authorization to GET /quota

2016-05-11 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281073#comment-15281073
 ] 

Zhitao Li commented on MESOS-5336:
--

[~adam-mesos], I'll leave this decision reconcile with you and [~alexr]. I'm 
fine with both way, just let me know whether you decide to keep 
GET_ENDPOINT_WITH_PATH or not.

The patch below is built on top of GET_ENDPOINT_WITH_PATH but I can rebase to 
drop that one if we decide to go otherwise.

> Add authorization to GET /quota
> ---
>
> Key: MESOS-5336
> URL: https://issues.apache.org/jira/browse/MESOS-5336
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, security
>Reporter: Adam B
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> We already authorize which http users can set/remove quota for particular 
> roles, but even knowing of the existence of these roles (let alone their 
> quotas) may be sensitive information. We should add authz around GET 
> operations on /quota.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5336) Add authorization to GET /quota

2016-05-11 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281071#comment-15281071
 ] 

Zhitao Li commented on MESOS-5336:
--

https://reviews.apache.org/r/47274/

> Add authorization to GET /quota
> ---
>
> Key: MESOS-5336
> URL: https://issues.apache.org/jira/browse/MESOS-5336
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, security
>Reporter: Adam B
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> We already authorize which http users can set/remove quota for particular 
> roles, but even knowing of the existence of these roles (let alone their 
> quotas) may be sensitive information. We should add authz around GET 
> operations on /quota.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5308) ROOT_XFS_QuotaTest.NoCheckpointRecovery failed.

2016-05-11 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281024#comment-15281024
 ] 

Yan Xu commented on MESOS-5308:
---

[~jpe...@apache.org] I committed the patch you had

{noformat:title=}
commit 95e670cd41a33e68afab701f9c28dd968a6f8011
Author: James Peach 
Date:   Wed May 11 16:43:05 2016 -0700

Fix race conditions in ROOT_XFS_QuotaTest.NoCheckpointRecovery.

There are two race conditions in
ROOT_XFS_QuotaTest.NoCheckpointRecovery. The first is when we were
checking the disk resources consumed without knowing whether the dd
command had completed. We can just eliminate this check since other
tests cover the resource usage case. The second race was installing the
MesosContainerizerProcess::___recover expectation after starting the
slave. We need to install this before starting.

Review: https://reviews.apache.org/r/47001/
{noformat}

Weird check failures though. Any theories? Let me see if I can repro.

> ROOT_XFS_QuotaTest.NoCheckpointRecovery failed.
> ---
>
> Key: MESOS-5308
> URL: https://issues.apache.org/jira/browse/MESOS-5308
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
> Environment: Fedora 23 with/without SSL
>Reporter: Gilbert Song
>Assignee: James Peach
>  Labels: isolation
>
> Here is the log:
> {code}
> [01:07:51] :   [Step 10/10] [ RUN  ] 
> ROOT_XFS_QuotaTest.NoCheckpointRecovery
> [01:07:51] :   [Step 10/10] meta-data=/dev/loop0 isize=512
> agcount=2, agsize=5120 blks
> [01:07:51] :   [Step 10/10]  =   sectsz=512   
> attr=2, projid32bit=1
> [01:07:51] :   [Step 10/10]  =   crc=1
> finobt=1, sparse=0
> [01:07:51] :   [Step 10/10] data =   bsize=4096   
> blocks=10240, imaxpct=25
> [01:07:51] :   [Step 10/10]  =   sunit=0  
> swidth=0 blks
> [01:07:51] :   [Step 10/10] naming   =version 2  bsize=4096   
> ascii-ci=0 ftype=1
> [01:07:51] :   [Step 10/10] log  =internal log   bsize=4096   
> blocks=855, version=2
> [01:07:51] :   [Step 10/10]  =   sectsz=512   
> sunit=0 blks, lazy-count=1
> [01:07:51] :   [Step 10/10] realtime =none   extsz=4096   
> blocks=0, rtextents=0
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.690585 17604 cluster.cpp:149] 
> Creating default 'local' authorizer
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.706126 17604 leveldb.cpp:174] 
> Opened db in 15.452988ms
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707135 17604 leveldb.cpp:181] 
> Compacted db in 984939ns
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707154 17604 leveldb.cpp:196] 
> Created db iterator in 4159ns
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707159 17604 leveldb.cpp:202] 
> Seeked to beginning of db in 517ns
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707165 17604 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 305ns
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707176 17604 replica.cpp:779] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707320 17621 recover.cpp:447] 
> Starting replica recovery
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707381 17621 recover.cpp:473] 
> Replica is in EMPTY status
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707638 17619 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> (17889)@172.30.2.13:37618
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707732 17624 recover.cpp:193] 
> Received a recover response from a replica in EMPTY status
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.707885 17624 recover.cpp:564] 
> Updating replica status to STARTING
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.708389 17618 master.cpp:382] 
> Master 0c1e0a50-1212-4104-a148-661131b79f27 
> (ip-172-30-2-13.ec2.internal.mesosphere.io) started on 172.30.2.13:37618
> [01:07:51]W:   [Step 10/10] I0429 01:07:51.708406 17618 master.cpp:384] Flags 
> at startup: --acls="" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate="true" 
> --authenticate_http="true" --authenticate_http_frameworks="true" 
> --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" 
> --credentials="/mnt/teamcity/temp/buildTmp/ROOT_XFS_QuotaTest_NoCheckpointRecovery_ZsRNg9/mnt/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 

[jira] [Commented] (MESOS-5336) Add authorization to GET /quota

2016-05-11 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15281009#comment-15281009
 ] 

Adam B commented on MESOS-5336:
---

I think GET_QUOTA_WITH_ROLE should be sufficient, and
GET_ENDPOINT_WITH_PATH can be emulated with GET_QUOTA_WITH_ROLE, ANY




> Add authorization to GET /quota
> ---
>
> Key: MESOS-5336
> URL: https://issues.apache.org/jira/browse/MESOS-5336
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, security
>Reporter: Adam B
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> We already authorize which http users can set/remove quota for particular 
> roles, but even knowing of the existence of these roles (let alone their 
> quotas) may be sensitive information. We should add authz around GET 
> operations on /quota.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5277) Need to add REMOVE semantics to the copy backend

2016-05-11 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5277:
-
Description: 
Some Dockerfiles run the `rm` command to remove files from the base image using 
the "RUN" directive in the Dockerfile. An example can be found here:
https://github.com/ngineered/nginx-php-fpm.git

In the final rootfs the removed files should not be present. Presence of these 
files in the final image can make the container misbehave. For example, the 
nginx-php-fpm docker image that is referenced tries to remove the default nginx 
config and replaces it with its own config to point to a different HTML root. 
If the default nginx config is still present after the building the image, 
nginx will start pointing to a different HTML root than the one set in the 
Dockerfile.


Currently the copy backend cannot handle removal of files from intermediate 
layers. This can cause issues with docker images built using a Dockerfile 
similar to the one listed here. Hence, we need to add REMOVE semantics to the 
copy backend.  

  was:
Some Dockerfile run the `rm` command to remove files from the base image using 
the "RUN" directive in the Dockerfile. An example can be found here:
https://github.com/ngineered/nginx-php-fpm.git

In the final rootfs the removed files should not be present. Presence of these 
files in the final image can make the container misbehave. For example, the 
nginx-php-fpm docker image that is reference tries to remove the default nginx 
config and replace it with it own config to point a different HTML root. If the 
default nginx config is still present after the building the image, nginx will 
start pointing to a different HTML root than the one set in the Dockerfile.


Currently the copy backend cannot handle removal of files from intermediate 
layers. This can cause issues with docker images built using a Dockerfile 
similar to the one listed here. Hence, we need to add REMOVE semantics to the 
copy backend.  


> Need to add REMOVE semantics to the copy backend
> 
>
> Key: MESOS-5277
> URL: https://issues.apache.org/jira/browse/MESOS-5277
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Gilbert Song
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Some Dockerfiles run the `rm` command to remove files from the base image 
> using the "RUN" directive in the Dockerfile. An example can be found here:
> https://github.com/ngineered/nginx-php-fpm.git
> In the final rootfs the removed files should not be present. Presence of 
> these files in the final image can make the container misbehave. For example, 
> the nginx-php-fpm docker image that is referenced tries to remove the default 
> nginx config and replaces it with its own config to point to a different HTML 
> root. If the default nginx config is still present after the building the 
> image, nginx will start pointing to a different HTML root than the one set in 
> the Dockerfile.
> Currently the copy backend cannot handle removal of files from intermediate 
> layers. This can cause issues with docker images built using a Dockerfile 
> similar to the one listed here. Hence, we need to add REMOVE semantics to the 
> copy backend.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5365) Introduce a timeout for docker volume driver mount/unmount operation.

2016-05-11 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-5365:
-

Assignee: Jie Yu

> Introduce a timeout for docker volume driver mount/unmount operation.
> -
>
> Key: MESOS-5365
> URL: https://issues.apache.org/jira/browse/MESOS-5365
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> 'dvdcli' might hang indefinitely. We should introduce timeout for both 
> mount/unmount operation so that launch/cleanup are not blocked forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5365) Introduce a timeout for docker volume driver mount/unmount operation.

2016-05-11 Thread Jie Yu (JIRA)
Jie Yu created MESOS-5365:
-

 Summary: Introduce a timeout for docker volume driver 
mount/unmount operation.
 Key: MESOS-5365
 URL: https://issues.apache.org/jira/browse/MESOS-5365
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


'dvdcli' might hang indefinitely. We should introduce timeout for both 
mount/unmount operation so that launch/cleanup are not blocked forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

2016-05-11 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280897#comment-15280897
 ] 

Anand Mazumdar commented on MESOS-5332:
---

[~StephanErb] That took some catching! Since, we have identified the root cause 
and filed corresponding tickets for the action items on our end i.e. 
MESOS-5361/MESOS-5364 (also linked to this JIRA). I am resolving this issue. 
Feel free to re-open if you have any further queries/concerns.

> TASK_LOST on slave restart potentially due to executor race condition
> -
>
> Key: MESOS-5332
> URL: https://issues.apache.org/jira/browse/MESOS-5332
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, slave
>Affects Versions: 0.26.0
> Environment: Mesos 0.26
> Aurora 0.13
>Reporter: Stephan Erb
> Attachments: executor-logs.tar.gz, executor-stderr.log, 
> executor-stderrV2.log, mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> {code}
> All task ending up in LOST have an output similar to the one posted above, 
> i.e. log messages are in a wrong order.
> Anyone an idea what might be going on here? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5005) Enforce that DiskInfo principal is equal to framework/operator principal

2016-05-11 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5005:
-
Description: Currently, we require that {{ReservationInfo.principal}} be 
equal to the principal provided for authentication, which means that when HTTP 
authentication is disabled this field cannot be set. Based on comments in 
'mesos.proto', the original intention was to enforce this same constraint for 
{{Persistence.principal}}, but it seems that we don't enforce it. This should 
be changed to make the two fields equivalent, with one exception: when the 
framework/operator principal is {{None}}, we should allow the principal in 
{{DiskInfo}} to take any value, along the same lines as MESOS-5212.  (was: 
Currently, we require that {{ReservationInfo.principal}} be equal to the 
principal provided for authentication, which means that when HTTP 
authentication is disabled this field cannot be set. Based on comments in 
'mesos.proto', the original intention was to enforce this same constraint for 
{{Persistence.principal}}, but it seems that we don't enforce it. This should 
be changed to make the two fields equivalent.)

> Enforce that DiskInfo principal is equal to framework/operator principal
> 
>
> Key: MESOS-5005
> URL: https://issues.apache.org/jira/browse/MESOS-5005
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere, persistent-volumes, reservations
> Fix For: 0.29.0
>
>
> Currently, we require that {{ReservationInfo.principal}} be equal to the 
> principal provided for authentication, which means that when HTTP 
> authentication is disabled this field cannot be set. Based on comments in 
> 'mesos.proto', the original intention was to enforce this same constraint for 
> {{Persistence.principal}}, but it seems that we don't enforce it. This should 
> be changed to make the two fields equivalent, with one exception: when the 
> framework/operator principal is {{None}}, we should allow the principal in 
> {{DiskInfo}} to take any value, along the same lines as MESOS-5212.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5360) Set death signal for dvdcli subprocess in docker volume isolator.

2016-05-11 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280590#comment-15280590
 ] 

Jie Yu commented on MESOS-5360:
---

This is for the case where slave crashes. In that case, waiting for the 
subprocess to finish does not make sense because no one will read its output.

> Set death signal for dvdcli subprocess in docker volume isolator.
> -
>
> Key: MESOS-5360
> URL: https://issues.apache.org/jira/browse/MESOS-5360
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> If the slave crashes, we should kill the dvdcli subprocess. Otherwise, if the 
> dvdcli subprocess gets stuck, it'll not be cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5360) Set death signal for dvdcli subprocess in docker volume isolator.

2016-05-11 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-5360:
-

Assignee: Jie Yu

> Set death signal for dvdcli subprocess in docker volume isolator.
> -
>
> Key: MESOS-5360
> URL: https://issues.apache.org/jira/browse/MESOS-5360
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> If the slave crashes, we should kill the dvdcli subprocess. Otherwise, if the 
> dvdcli subprocess gets stuck, it'll not be cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5287) boto is no longer a Mesos dependency.

2016-05-11 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-5287:
--
Shepherd: Yan Xu

> boto is no longer a Mesos dependency.
> -
>
> Key: MESOS-5287
> URL: https://issues.apache.org/jira/browse/MESOS-5287
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Assignee: Chen Zhiwei
>  Labels: easyfix, newbie
>
> Since 'mesos-ec2' has been removed from the repo in MESOS-2640.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5336) Add authorization to GET /quota

2016-05-11 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280485#comment-15280485
 ] 

Zhitao Li commented on MESOS-5336:
--

I think I can come up with an implementation for {{GET_QUOTA_WITH_ROLE}} using 
{{stout::collect}} on a list of futures.

Question: if we have {{GET_QUOTA_WITH_ROLE}}, do you think we still want to 
guard {{/quota}} endpoint with {{GET_ENDPOINT_WITH_PATH}}? The closest 
alternative would be an ACL of {{ANY}} or {{NONE}} role, but it probably would 
return empty map rather than {{Forbidden}}.

I have no strong opinion here. I'll try a diff on top my previous review while 
wait for your answer.

> Add authorization to GET /quota
> ---
>
> Key: MESOS-5336
> URL: https://issues.apache.org/jira/browse/MESOS-5336
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, security
>Reporter: Adam B
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> We already authorize which http users can set/remove quota for particular 
> roles, but even knowing of the existence of these roles (let alone their 
> quotas) may be sensitive information. We should add authz around GET 
> operations on /quota.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5364) Consider adding `unlink` functionality to libprocess

2016-05-11 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-5364:
-

 Summary: Consider adding `unlink` functionality to libprocess
 Key: MESOS-5364
 URL: https://issues.apache.org/jira/browse/MESOS-5364
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar


Currently we don't have the {{unlink}} functionality in libprocess i.e. 
Erlang's equivalent of http://erlang.org/doc/man/erlang.html#unlink-1. We have 
a lot of places in our current code with {{TODO's}} for implementing it.

It can benefit us in a couple of ways:
- Based on the business logic of the actor, it would want to authoritatively 
communicate that it is no longer interested in {{ExitedEvent}} for the external 
remote link.
- Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
the remote instance being unavailable (e.g., partition, network intermediaries 
not sending RST's etc). 

I did not find any old JIRA's pertaining to this but I did come across an 
initial attempt to add this though albeit for injecting {{exited}} events as 
part of the initial review for MESOS-1059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3637) Port process/process.hpp to Windows

2016-05-11 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-3637:

Fix Version/s: (was: 0.29.0)

> Port process/process.hpp to Windows
> ---
>
> Key: MESOS-3637
> URL: https://issues.apache.org/jira/browse/MESOS-3637
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, windows
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4788) Mesos UI should show the role and principal of a framework

2016-05-11 Thread deshna jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deshna jain reassigned MESOS-4788:
--

Assignee: deshna jain

> Mesos UI should show the role and principal of a framework
> --
>
> Key: MESOS-4788
> URL: https://issues.apache.org/jira/browse/MESOS-4788
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Zhitao Li
>Assignee: deshna jain
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4823) Implement port forwarding in `network/cni` isolator

2016-05-11 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-4823:
-
Sprint: Mesosphere Sprint 30, Mesosphere Sprint 31  (was: Mesosphere Sprint 
30, Mesosphere Sprint 31, Mesosphere Sprint 35)

> Implement port forwarding in `network/cni` isolator
> ---
>
> Key: MESOS-4823
> URL: https://issues.apache.org/jira/browse/MESOS-4823
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>Priority: Critical
>  Labels: mesosphere
>
> Most docker and appc images wish to expose ports that micro-services are 
> listening on, to the outside world. When containers are running on bridged 
> (or ptp) networking this can be achieved by installing port forwarding rules 
> on the agent (using iptables). This can be done in the `network/cni` 
> isolator. 
> The reason we would like this functionality to be implemented in the 
> `network/cni` isolator, and not a CNI plugin, is that the specifications 
> currently do not support specifying port forwarding rules. Further, to 
> install these rules the isolator needs two pieces of information, the exposed 
> ports and the IP address associated with the container. Bother are available 
> to the isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5066) Create an iptables interface in Mesos

2016-05-11 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5066:
-
Sprint:   (was: Mesosphere Sprint 35)

> Create an iptables interface in Mesos
> -
>
> Key: MESOS-5066
> URL: https://issues.apache.org/jira/browse/MESOS-5066
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> For support port mapping functionality in the network CNI isolator we need to 
> enable DNAT rules in iptables. We therefore need to create an iptables 
> interface in Mesos. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5336) Add authorization to GET /quota

2016-05-11 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279881#comment-15279881
 ] 

Alexander Rukletsov commented on MESOS-5336:


Currently we have {{SET_QUOTA_WITH_ROLE}} and {{DESTROY_QUOTA_WITH_PRINCIPAL}} 
authz actions, which will be eventually subsumed by a single 
{{UPDATE_QUOTA_WITH_ROLE}} for all {{POST}}, {{PUT}}, and {{DELETE}}. These are 
fine-grained authz actions.

We can additionally implement coarse-grained authz actions: 
{{GET_ENDPOINT_WITH_PATH}}, {{POST_ENDPOINT_WITH_PATH}} and so on for 
{{/quota}}. I can see benefits of having both coarse- and fine- grained authz 
actions, but maybe we don't need to implement them now.

For now, let's either do {{GET_ENDPOINT_WITH_PATH}} or {{GET_QUOTA_WITH_ROLE}}. 
The former does not allow fine-grained filtering, while the latter is harder to 
implement since we have to filter quotas based on authorizer's response.

I see you have opted for {{GET_ENDPOINT_WITH_PATH}}. Do you think we can 
implement the latter fast enough? For now, we'll have to query authorizer for 
each role, but in the future we should be able to send a BatchRequest.

> Add authorization to GET /quota
> ---
>
> Key: MESOS-5336
> URL: https://issues.apache.org/jira/browse/MESOS-5336
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, security
>Reporter: Adam B
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> We already authorize which http users can set/remove quota for particular 
> roles, but even knowing of the existence of these roles (let alone their 
> quotas) may be sensitive information. We should add authz around GET 
> operations on /quota.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

2016-05-11 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated MESOS-5332:
---
Description: 
When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 
20% to 50% of all tasks. They are killed by the Mesos agent via:

{code}
I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
executors
I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-28854-0-6a88d62e-656
4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976- 
at executor(1)@10.X.X.X:40541
I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976- 
at executor(1)@10.X.X.X:35757
I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
-af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
executor(1)@10.X.X.X:51463
...
I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
{code}


We have verified that the tasks and their executors are killed by the agent 
during startup. When stopping the agent using supervisorctl stop, the executors 
are still running (verified via {{ps aux}}). They are only killed once the 
agent tries to reregister.

The issue is hard to reproduce:

* When restarting the agent binary multiple times, tasks are only lost for the 
first restart.
* It is much more likely to occur if the agent binary has been running for a 
longer period of time (> 7 days)


Mesos is correctly sticking to the 2 seconds wait time before killing 
un-reregistered executors. The failed executors receive the reregistration 
request, but it seems like they fail to send a reply.

A successful reregistration (not leading to LOST):
{code}
I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
checkpointing enabled. Waiting 15mins to reconnect with slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
20160118-141153-92471562-5050-6270-S17
I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 1.492339ms
{code}

A failed one:
{code}
I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from slave 
20160118-141153-92471562-5050-6270-S17
E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with fd 
11: Transport endpoint is not connected
I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
checkpointing enabled. Waiting 15mins to reconnect with slave 
20160118-141153-92471562-5050-6270-S17
{code}

All task ending up in LOST have an output similar to the one posted above, i.e. 
log messages are in a wrong order.

Anyone an idea what might be going on here? 

  was:
When restarting the Mesos agent binary, tasks can end up as LOST. We lose from 
20% to 50% of all tasks. They are killed by the Mesos agent via:

{code}
I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
executors
I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-28854-0-6a88d62e-656
4-4e33-b0bb-1d8039d97afc' of framework 20151001-085346-58917130-5050-37976- 
at executor(1)@10.X.X.X:40541
I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
4-4cba-a9df-3dfc1552667f' of framework 20151001-085346-58917130-5050-37976- 
at executor(1)@10.X.X.X:35757
I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
-af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
executor(1)@10.X.X.X:51463
...
I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
{code}


We have verified that the tasks and their executors are killed by the agent 
during startup. When stopping the agent using supervisorctl stop, the executors 
are still running (verified via {{ps aux}}). They are only killed once the 
agent tries to reregister.

The issue is hard to reproduce:

* When restarting the agent binary multiple times, tasks are only lost for the 
first restart.
* It is much more likely to occur if the agent binary has been running for a 
longer period of time (> 7 days)
* It tends to be more likely if the host has many cores (30-40) and thus many 
libprocess workers. 


Mesos is correctly sticking to the 2 seconds wait time before killing 
un-reregistered executors. The failed executors receive the reregistration 
request, but it seems like they fail to send a reply.

A successful reregistration (not leading 

[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

2016-05-11 Thread Stephan Erb (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279791#comment-15279791
 ] 

Stephan Erb commented on MESOS-5332:


I was able to assemble a reproducing example (using Aurora master and Mesos 
0.27.2): https://gist.github.com/StephanErb/5798b0d87c11473fb0ec147272ea0288 

Summary of events: 
* An iptables firewall is terminating idle TCP connections after the iptables 
default of 5 days (reduced to 60 seconds in the example above).
* Mesos does not detect broken, half-open TCP connections that occur when 
connections are terminated by iptables.
* Mesos tries to use the old, broken TCP connection when answering the agent 
reconnect request. The message therefore never makes it to the agent.
* The agent ends up killing the executor because it does not receive a reply 
for its reconnect request.

I'd conclude that there are several areas that need improvement:

* *Firewalling*: We have to fix our inhouse iptables firewall scripts so that 
it does not apply connection tracking to local connections.
* *KeepAlive*: Mesos has to enable TCP keepalives in libprocess. As detailed 
here (http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/) this has two 
advantages:
** Detection of dead peers in case there has been a hard crash, an unplugged 
network cable, or if a firewall has silently dropped a connection.
** Prevention of disconnection due to network inactivity.
* *Unlink* By the addition of an {{unlink}} function libprocess could handle 
exit events more gracefully by making sure it always creates a new connection 
when talking to a new process.

> TASK_LOST on slave restart potentially due to executor race condition
> -
>
> Key: MESOS-5332
> URL: https://issues.apache.org/jira/browse/MESOS-5332
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, slave
>Affects Versions: 0.26.0
> Environment: Mesos 0.26
> Aurora 0.13
>Reporter: Stephan Erb
> Attachments: executor-logs.tar.gz, executor-stderr.log, 
> executor-stderrV2.log, mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> * It tends to be more likely if the host has many cores (30-40) and thus many 
> libprocess workers. 
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is 

[jira] [Commented] (MESOS-2731) Allow frameworks to deploy storage drivers on demand.

2016-05-11 Thread Ken Sipe (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279787#comment-15279787
 ] 

Ken Sipe commented on MESOS-2731:
-

This is consistent with use case #2, but we also want to support access to s3 
which includes permission based access.

> Allow frameworks to deploy storage drivers on demand.
> -
>
> Key: MESOS-2731
> URL: https://issues.apache.org/jira/browse/MESOS-2731
> Project: Mesos
>  Issue Type: Epic
>Reporter: Joerg Schad
>  Labels: mesosphere
>
> Certain storage options require storage drivers to access them including HDFS 
> driver, Quobyte client, Database driver, and so on.
> When Tasks in Mesos require access to such storage they also need access to 
> the respective driver on the node where they were scheduled to.
> As it is not desirable to deploy the driver onto all nodes in the cluster, it 
> would be good to deploy the driver on demand.
> Use Cases:
> 1. Fetcher Cache pulling resources from user-provided URIs
> 2. Framework executors/tasks requiring r/w access to HDFS/DFS
> 3. Framework executors/tasks requiring r/w Databases access (requiring 
> drivers)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)