[jira] [Assigned] (MESOS-5229) Mesos containerizer should support file mounts

2016-04-18 Thread Abhishek Dasgupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dasgupta reassigned MESOS-5229:


Assignee: Abhishek Dasgupta

> Mesos containerizer should support file mounts
> --
>
> Key: MESOS-5229
> URL: https://issues.apache.org/jira/browse/MESOS-5229
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Joshua Cohen
>Assignee: Abhishek Dasgupta
>
> When using an image to represent a container's file system, it's currently 
> not possible to mount a single file into the filesystem. I had to resort to 
> adding {{RUN touch /path/to/my/file}} in my Dockerfile in order to get the 
> filesystem provisioned properly.
> It would be great if this wasn't necessary. Even better would be if Mesos 
> would create all mount points on demand, rather than requiring them to be 
> present in the container filesystem (c.f. 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L522-L527)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4553) Manage offers in allocator.

2016-04-18 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-4553:

Epic Name: Manage Offers in Allocator  (was: Manage Offers ni Allocator)

> Manage offers in allocator.
> ---
>
> Key: MESOS-4553
> URL: https://issues.apache.org/jira/browse/MESOS-4553
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>
> Currently, the {{offers}} are managed by {{Master}} which introduces two 
> issues:
> 1. In Quota, master rescind more offers to address race condition
> 2. Allocator can not modify offers: resources return to allocator and offer 
> again,  that impact resources utilisation & performance (MESOS-3078)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5231) Create Design Doc for Manage offers in allocator

2016-04-18 Thread Klaus Ma (JIRA)
Klaus Ma created MESOS-5231:
---

 Summary: Create Design Doc for Manage offers in allocator
 Key: MESOS-5231
 URL: https://issues.apache.org/jira/browse/MESOS-5231
 Project: Mesos
  Issue Type: Bug
Reporter: Klaus Ma
Assignee: Klaus Ma






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4553) Manage offers in allocator.

2016-04-18 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-4553:

Epic Name: Manage Offers ni Allocator

> Manage offers in allocator.
> ---
>
> Key: MESOS-4553
> URL: https://issues.apache.org/jira/browse/MESOS-4553
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>
> Currently, the {{offers}} are managed by {{Master}} which introduces two 
> issues:
> 1. In Quota, master rescind more offers to address race condition
> 2. Allocator can not modify offers: resources return to allocator and offer 
> again,  that impact resources utilisation & performance (MESOS-3078)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4553) Manage offers in allocator.

2016-04-18 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-4553:

Issue Type: Epic  (was: Bug)

> Manage offers in allocator.
> ---
>
> Key: MESOS-4553
> URL: https://issues.apache.org/jira/browse/MESOS-4553
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>
> Currently, the {{offers}} are managed by {{Master}} which introduces two 
> issues:
> 1. In Quota, master rescind more offers to address race condition
> 2. Allocator can not modify offers: resources return to allocator and offer 
> again,  that impact resources utilisation & performance (MESOS-3078)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5031) Authorization Action enum does not support upgrades.

2016-04-18 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247097#comment-15247097
 ] 

Yong Tang commented on MESOS-5031:
--

Hi [~bmahler] I just created a pull request to address the "default" issue:

https://reviews.apache.org/r/46364/

Please let me know if there are any issues.

cc [~vinodkone] [~adam-mesos]



> Authorization Action enum does not support upgrades.
> 
>
> Key: MESOS-5031
> URL: https://issues.apache.org/jira/browse/MESOS-5031
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.29.0
>Reporter: Adam B
>Assignee: Yong Tang
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> We need to make the Action enum optional in authorization::Request, and add 
> an `UNKNOWN = 0;` enum value. See MESOS-4997 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5139) ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247090#comment-15247090
 ] 

haosdent commented on MESOS-5139:
-

I am not sure whether this is because we use {{tar -czf}} to create tar and use 
{{tar -xf}} to extract tar or not. We create a gzip compressed tar and extract 
it in normal way. tar could use gzip to decompress automatically. But I not 
sure whether it doesn't work in some case.

I think need replace 
{code}
ASSERT_SOME(os::tar(".", "../layer.tar"));
{code}
to {{command::tar}} to keep consistent.


> ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky
> --
>
> Key: MESOS-5139
> URL: https://issues.apache.org/jira/browse/MESOS-5139
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0
> Environment: Ubuntu14.04
>Reporter: Vinod Kone
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Found this on ASF CI while testing 0.28.1-rc2
> {code}
> [ RUN  ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar
> E0406 18:29:30.870481   520 shell.hpp:93] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> E0406 18:29:30.870576   520 fetcher.cpp:59] Failed to create URI fetcher 
> plugin 'hadoop': Failed to create HDFS client: Failed to execute 'hadoop 
> version 2>&1'; the command was either not found or exited with a non-zero 
> exit status: 127
> I0406 18:29:30.871052   520 local_puller.cpp:90] Creating local puller with 
> docker registry '/tmp/3l8ZBv/images'
> I0406 18:29:30.873325   539 metadata_manager.cpp:159] Looking for image 'abc'
> I0406 18:29:30.874438   539 local_puller.cpp:142] Untarring image 'abc' from 
> '/tmp/3l8ZBv/images/abc.tar' to '/tmp/3l8ZBv/store/staging/5tw8bD'
> I0406 18:29:30.901916   547 local_puller.cpp:162] The repositories JSON file 
> for image 'abc' is '{"abc":{"latest":"456"}}'
> I0406 18:29:30.902304   547 local_puller.cpp:290] Extracting layer tar ball 
> '/tmp/3l8ZBv/store/staging/5tw8bD/123/layer.tar to rootfs 
> '/tmp/3l8ZBv/store/staging/5tw8bD/123/rootfs'
> I0406 18:29:30.909144   547 local_puller.cpp:290] Extracting layer tar ball 
> '/tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar to rootfs 
> '/tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs'
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:183: Failure
> (imageInfo).failure(): Collect failed: Subprocess 'tar, tar, -x, -f, 
> /tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar, -C, 
> /tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs' failed: tar: This does not look 
> like a tar archive
> tar: Exiting with failure status due to previous errors
> [  FAILED  ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar (243 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1575) master sets failover timeout to 0 when framework requests a high value

2016-04-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247043#comment-15247043
 ] 

José Guilherme Vanz commented on MESOS-1575:


I changed the default failover timeout to 1 week:

{code}
vanz@london build]$ git diff
diff --git a/include/mesos/mesos.proto b/include/mesos/mesos.proto
index 87af4a0..3af43b7 100644
--- a/include/mesos/mesos.proto
+++ b/include/mesos/mesos.proto
@@ -228,7 +228,7 @@ message FrameworkInfo {
   //
   // NOTE: To avoid accidental destruction of tasks, production
   // frameworks typically set this to a large value (e.g., 1 week).
-  optional double failover_timeout = 4 [default = 0.0];
+  optional double failover_timeout = 4 [default = 604800];
 
   // If set, framework pid, executor pids and status updates are
   // checkpointed to disk by the slaves. Checkpointing allows a
diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
index 34da0a1..a576f11 100644
--- a/include/mesos/v1/mesos.proto
+++ b/include/mesos/v1/mesos.proto
@@ -228,7 +228,7 @@ message FrameworkInfo {
   //
   // NOTE: To avoid accidental destruction of tasks, production
   // frameworks typically set this to a large value (e.g., 1 week).
-  optional double failover_timeout = 4 [default = 0.0];
+  optional double failover_timeout = 4 [default = 604800];
 
   // If set, framework pid, executor pids and status updates are
   // checkpointed to disk by the agents. Checkpointing allows a
{code}

As result, the failover timeout is being used when the framework disconnects:

{code}
I0418 23:34:10.686487 11890 master.cpp:1375] Framework 
0445e63c-c455-4c88-893e-61d740493432- (Rendler Framework (Java)) at 
scheduler-8059b7d5-5257-450c-8930-c32f7a1103b3@127.0.0.1:41977 disconnected 
 
I0418 23:34:10.686553 11890 master.cpp:2764] Disconnecting framework 
0445e63c-c455-4c88-893e-61d740493432- (Rendler Framework (Java)) at 
scheduler-8059b7d5-5257-450c-8930-c32f7a1103b3@127.0.0.1:41977 I0418 
23:34:10.686590 11890 master.cpp:2788] Deactivating framework 
0445e63c-c455-4c88-893e-61d740493432- (Rendler Framework (Java)) at 
scheduler-8059b7d5-5257-450c-8930-c32f7a1103b3@127.0.0.1:41977  
W0418 23:34:10.686745 11890 master.cpp:1394] Using the default value for 
'failover_timeout' because the input value is invalid: Argument out of the 
range that a Duration can represent due to int64_t's size limit 



I0418 23:34:10.686766 11890 master.cpp:1399] Giving framework 
0445e63c-c455-4c88-893e-61d740493432- (Rendler Framework (Java)) at 
scheduler-8059b7d5-5257-450c-8930-c32f7a1103b3@127.0.0.1:41977 1weeks to 
failover

 I0418 23:34:10.686920 11890 
hierarchical.cpp:375] Deactivated framework 
0445e63c-c455-4c88-893e-61d740493432-
{code}

Should I change the code to validate the value instead of using default value? 
What do you think is  better approach? 

> master sets failover timeout to 0 when framework requests a high value
> --
>
> Key: MESOS-1575
> URL: https://issues.apache.org/jira/browse/MESOS-1575
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Sweeney
>Assignee: José Guilherme Vanz
>  Labels: newbie, twitter
>
> In response to a registered RPC we observed the following behavior:
> {noformat}
> W0709 19:07:32.982997 11400 master.cpp:612] Using the default value for 
> 'failover_timeout' becausethe input value is invalid: Argument out of the 
> range that a Duration can represent due to int64_t's size limit
> I0709 19:07:32.983008 11404 hierarchical_allocator_process.hpp:408] 
> Deactivated framework 20140709-184342-119646400-5050-11380-0003
> I0709 19:07:32.983013 11400 master.cpp:617] Giving framework 
> 20140709-184342-119646400-5050-11380-0003 0ns to failover
> I0709 19:07:32.983271 11404 master.cpp:2201] Framework failover timeout, 
> removing framework 20140709-184342-119646400-5050-11380-0003
> I0709 19:07:32.983294 11404 master.cpp:2688] Removing framework 
> 20140709-184342-119646400-5050-11380-0003
> I0709 19:07:32.983678 11404 hierarchical_allocator_process.hpp:363] Removed 
> framework 20140709-184342-119646400-5050-11380-0003
> {noformat}
> This was using the following frameworkInfo.
> {code}
> FrameworkInfo frameworkInfo = FrameworkInfo.newBuilder()
> .setUser("test")
> .setName("jvm")
> .setFailoverTimeout(Long.MAX_VALUE)
> .build();
> 

[jira] [Commented] (MESOS-5223) MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247015#comment-15247015
 ] 

haosdent commented on MESOS-5223:
-

Hi, [~gyliu] The log you posted are 
{{ContentType/SchedulerTest.SchedulerReconnect/1}} Do you have the log 
associate with {{MasterAllocatorTest/1.RebalancedForUpdatedWeights}}?

> MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky
> --
>
> Key: MESOS-5223
> URL: https://issues.apache.org/jira/browse/MESOS-5223
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, flaky, test
>Reporter: Guangya Liu
>  Labels: flaky, flaky-test
>
> {code}
> I0415 06:52:22.243783 31906 cluster.cpp:149] Creating default 'local' 
> authorizer
> I0415 06:52:22.365927 31906 leveldb.cpp:174] Opened db in 121.715227ms
> I0415 06:52:22.413648 31906 leveldb.cpp:181] Compacted db in 47.651756ms
> I0415 06:52:22.413713 31906 leveldb.cpp:196] Created db iterator in 25647ns
> I0415 06:52:22.413729 31906 leveldb.cpp:202] Seeked to beginning of db in 
> 1890ns
> I0415 06:52:22.413741 31906 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 317ns
> I0415 06:52:22.413800 31906 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0415 06:52:22.414681 31939 recover.cpp:447] Starting replica recovery
> I0415 06:52:22.414999 31939 recover.cpp:473] Replica is in EMPTY status
> I0415 06:52:22.416792 31939 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (17242)@172.17.0.2:44024
> I0415 06:52:22.417222 31925 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0415 06:52:22.417966 31925 recover.cpp:564] Updating replica status to 
> STARTING
> I0415 06:52:22.421860 31933 master.cpp:382] Master 
> c4bfcab0-cd45-4c65-953a-f810c14806e0 (37d6f4eebe29) started on 
> 172.17.0.2:44024
> I0415 06:52:22.421900 31933 master.cpp:384] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_http="true" 
> --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ImAAfx/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ImAAfx/master" --zk_session_timeout="10secs"
> I0415 06:52:22.422327 31933 master.cpp:435] Master allowing unauthenticated 
> frameworks to register
> I0415 06:52:22.422339 31933 master.cpp:438] Master only allowing 
> authenticated agents to register
> I0415 06:52:22.422349 31933 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/ImAAfx/credentials'
> I0415 06:52:22.422750 31933 master.cpp:480] Using default 'crammd5' 
> authenticator
> I0415 06:52:22.422914 31933 master.cpp:551] Using default 'basic' HTTP 
> authenticator
> I0415 06:52:22.423054 31933 master.cpp:589] Authorization enabled
> I0415 06:52:22.423259 31926 hierarchical.cpp:142] Initialized hierarchical 
> allocator process
> I0415 06:52:22.423327 31926 whitelist_watcher.cpp:77] No whitelist given
> I0415 06:52:22.425593 31937 master.cpp:1832] The newly elected leader is 
> master@172.17.0.2:44024 with id c4bfcab0-cd45-4c65-953a-f810c14806e0
> I0415 06:52:22.425631 31937 master.cpp:1845] Elected as the leading master!
> I0415 06:52:22.425650 31937 master.cpp:1532] Recovering from registrar
> I0415 06:52:22.425915 31937 registrar.cpp:331] Recovering registrar
> I0415 06:52:22.458044 31928 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 39.766176ms
> I0415 06:52:22.458093 31928 replica.cpp:320] Persisted replica status to 
> STARTING
> I0415 06:52:22.458391 31928 recover.cpp:473] Replica is in STARTING status
> I0415 06:52:22.459728 31930 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (17245)@172.17.0.2:44024
> I0415 06:52:22.459952 31928 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0415 06:52:22.460414 31925 recover.cpp:564] Updating replica status to VOTING
> I0415 06:52:22.499866 31925 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 39.170393ms
> I0415 

[jira] [Updated] (MESOS-5223) MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky

2016-04-18 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-5223:

Labels: flaky flaky-test  (was: )

> MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky
> --
>
> Key: MESOS-5223
> URL: https://issues.apache.org/jira/browse/MESOS-5223
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, flaky, test
>Reporter: Guangya Liu
>  Labels: flaky, flaky-test
>
> {code}
> I0415 06:52:22.243783 31906 cluster.cpp:149] Creating default 'local' 
> authorizer
> I0415 06:52:22.365927 31906 leveldb.cpp:174] Opened db in 121.715227ms
> I0415 06:52:22.413648 31906 leveldb.cpp:181] Compacted db in 47.651756ms
> I0415 06:52:22.413713 31906 leveldb.cpp:196] Created db iterator in 25647ns
> I0415 06:52:22.413729 31906 leveldb.cpp:202] Seeked to beginning of db in 
> 1890ns
> I0415 06:52:22.413741 31906 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 317ns
> I0415 06:52:22.413800 31906 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0415 06:52:22.414681 31939 recover.cpp:447] Starting replica recovery
> I0415 06:52:22.414999 31939 recover.cpp:473] Replica is in EMPTY status
> I0415 06:52:22.416792 31939 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (17242)@172.17.0.2:44024
> I0415 06:52:22.417222 31925 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0415 06:52:22.417966 31925 recover.cpp:564] Updating replica status to 
> STARTING
> I0415 06:52:22.421860 31933 master.cpp:382] Master 
> c4bfcab0-cd45-4c65-953a-f810c14806e0 (37d6f4eebe29) started on 
> 172.17.0.2:44024
> I0415 06:52:22.421900 31933 master.cpp:384] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_http="true" 
> --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ImAAfx/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ImAAfx/master" --zk_session_timeout="10secs"
> I0415 06:52:22.422327 31933 master.cpp:435] Master allowing unauthenticated 
> frameworks to register
> I0415 06:52:22.422339 31933 master.cpp:438] Master only allowing 
> authenticated agents to register
> I0415 06:52:22.422349 31933 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/ImAAfx/credentials'
> I0415 06:52:22.422750 31933 master.cpp:480] Using default 'crammd5' 
> authenticator
> I0415 06:52:22.422914 31933 master.cpp:551] Using default 'basic' HTTP 
> authenticator
> I0415 06:52:22.423054 31933 master.cpp:589] Authorization enabled
> I0415 06:52:22.423259 31926 hierarchical.cpp:142] Initialized hierarchical 
> allocator process
> I0415 06:52:22.423327 31926 whitelist_watcher.cpp:77] No whitelist given
> I0415 06:52:22.425593 31937 master.cpp:1832] The newly elected leader is 
> master@172.17.0.2:44024 with id c4bfcab0-cd45-4c65-953a-f810c14806e0
> I0415 06:52:22.425631 31937 master.cpp:1845] Elected as the leading master!
> I0415 06:52:22.425650 31937 master.cpp:1532] Recovering from registrar
> I0415 06:52:22.425915 31937 registrar.cpp:331] Recovering registrar
> I0415 06:52:22.458044 31928 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 39.766176ms
> I0415 06:52:22.458093 31928 replica.cpp:320] Persisted replica status to 
> STARTING
> I0415 06:52:22.458391 31928 recover.cpp:473] Replica is in STARTING status
> I0415 06:52:22.459728 31930 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (17245)@172.17.0.2:44024
> I0415 06:52:22.459952 31928 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0415 06:52:22.460414 31925 recover.cpp:564] Updating replica status to VOTING
> I0415 06:52:22.499866 31925 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 39.170393ms
> I0415 06:52:22.499905 31925 replica.cpp:320] Persisted replica status to 
> VOTING
> I0415 06:52:22.500013 31927 recover.cpp:578] Successfully joined the Paxos 
> group
> I0415 06:52:22.500238 31927 

[jira] [Updated] (MESOS-5223) MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky

2016-04-18 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-5223:

Component/s: test
 flaky

> MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky
> --
>
> Key: MESOS-5223
> URL: https://issues.apache.org/jira/browse/MESOS-5223
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, flaky, test
>Reporter: Guangya Liu
>
> {code}
> I0415 06:52:22.243783 31906 cluster.cpp:149] Creating default 'local' 
> authorizer
> I0415 06:52:22.365927 31906 leveldb.cpp:174] Opened db in 121.715227ms
> I0415 06:52:22.413648 31906 leveldb.cpp:181] Compacted db in 47.651756ms
> I0415 06:52:22.413713 31906 leveldb.cpp:196] Created db iterator in 25647ns
> I0415 06:52:22.413729 31906 leveldb.cpp:202] Seeked to beginning of db in 
> 1890ns
> I0415 06:52:22.413741 31906 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 317ns
> I0415 06:52:22.413800 31906 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0415 06:52:22.414681 31939 recover.cpp:447] Starting replica recovery
> I0415 06:52:22.414999 31939 recover.cpp:473] Replica is in EMPTY status
> I0415 06:52:22.416792 31939 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (17242)@172.17.0.2:44024
> I0415 06:52:22.417222 31925 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0415 06:52:22.417966 31925 recover.cpp:564] Updating replica status to 
> STARTING
> I0415 06:52:22.421860 31933 master.cpp:382] Master 
> c4bfcab0-cd45-4c65-953a-f810c14806e0 (37d6f4eebe29) started on 
> 172.17.0.2:44024
> I0415 06:52:22.421900 31933 master.cpp:384] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="false" --authenticate_http="true" 
> --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ImAAfx/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ImAAfx/master" --zk_session_timeout="10secs"
> I0415 06:52:22.422327 31933 master.cpp:435] Master allowing unauthenticated 
> frameworks to register
> I0415 06:52:22.422339 31933 master.cpp:438] Master only allowing 
> authenticated agents to register
> I0415 06:52:22.422349 31933 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/ImAAfx/credentials'
> I0415 06:52:22.422750 31933 master.cpp:480] Using default 'crammd5' 
> authenticator
> I0415 06:52:22.422914 31933 master.cpp:551] Using default 'basic' HTTP 
> authenticator
> I0415 06:52:22.423054 31933 master.cpp:589] Authorization enabled
> I0415 06:52:22.423259 31926 hierarchical.cpp:142] Initialized hierarchical 
> allocator process
> I0415 06:52:22.423327 31926 whitelist_watcher.cpp:77] No whitelist given
> I0415 06:52:22.425593 31937 master.cpp:1832] The newly elected leader is 
> master@172.17.0.2:44024 with id c4bfcab0-cd45-4c65-953a-f810c14806e0
> I0415 06:52:22.425631 31937 master.cpp:1845] Elected as the leading master!
> I0415 06:52:22.425650 31937 master.cpp:1532] Recovering from registrar
> I0415 06:52:22.425915 31937 registrar.cpp:331] Recovering registrar
> I0415 06:52:22.458044 31928 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 39.766176ms
> I0415 06:52:22.458093 31928 replica.cpp:320] Persisted replica status to 
> STARTING
> I0415 06:52:22.458391 31928 recover.cpp:473] Replica is in STARTING status
> I0415 06:52:22.459728 31930 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (17245)@172.17.0.2:44024
> I0415 06:52:22.459952 31928 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0415 06:52:22.460414 31925 recover.cpp:564] Updating replica status to VOTING
> I0415 06:52:22.499866 31925 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 39.170393ms
> I0415 06:52:22.499905 31925 replica.cpp:320] Persisted replica status to 
> VOTING
> I0415 06:52:22.500013 31927 recover.cpp:578] Successfully joined the Paxos 
> group
> I0415 06:52:22.500238 31927 recover.cpp:462] Recover process 

[jira] [Commented] (MESOS-1837) failed to determine cgroup for the 'cpu' subsystem

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246934#comment-15246934
 ] 

haosdent commented on MESOS-1837:
-

[~doctapp] After your task failed, it may update the cgroup failed due to the 
process folder under {{/proc}} is not exist.

You task failed at {{15:35:44.278214}}, and update happens after that. 
{{15:35:44.285468}}

> failed to determine cgroup for the 'cpu' subsystem
> --
>
> Key: MESOS-1837
> URL: https://issues.apache.org/jira/browse/MESOS-1837
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.20.1
> Environment: Ubuntu 14.04
>Reporter: Chris Fortier
>Assignee: Timothy Chen
>
> Attempting to launch Docker container with Marathon. Container is launched 
> then fails. 
> A search of /var/log/syslog reveals:
> Sep 27 03:01:43 vagrant-ubuntu-trusty-64 mesos-slave[1409]: E0927 
> 03:01:43.546957  1463 slave.cpp:2205] Failed to update resources for 
> container 8c2429d9-f090-4443-8108-0206ca37f3fd of executor 
> hello-world.970dbe74-45f2-11e4-8b1d-56847afe9799 running task 
> hello-world.970dbe74-45f2-11e4-8b1d-56847afe9799 on status update for 
> terminal task, destroying container: Failed to determine cgroup for the 'cpu' 
> subsystem: Failed to read /proc/9792/cgroup: Failed to open file 
> '/proc/9792/cgroup': No such file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2638) Add support for Optional parameters to protobuf handlers to wrap option fields

2016-04-18 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2638:
---
Description: 
We currently don't have a way to install a protobuf handler for an option field 
where the handler takes an Optional parameter of the 'option' field in the 
protobuf message. The goal is to be able to do:
{code:title=example|borderStyle=solid}
message Person {
  required string name = 1;
  option uint32_t age = 2; 
}

void person(const std::string& name, const Option& age) 
{
  if (age.isSome()) { ... }
}

install(
person, 
::name,
::age); 
{code}
We can then use this to test whether the field was provided, as opposed to 
capturing a reference to a default constructed value of the the type.

For now, the workaround is to use the take the entire message in the handler:

{code}
void person(const Person& person)
{
  if (person.has_age()) { ... }
}

install(person);
{code}

  was:
We currently don't have a way to install a protobuf handler for an option field 
where the handler takes an Optional parameter of the 'option' field in the 
protobuf message. The goal is to be able to do:
{code:title=example|borderStyle=solid}
message Person {
  required string name = 1;
  option uint32_t age = 2; 
}

void person(const std::string& name, const Option& age) 
{
  if (age.isKnown()) {
...
  }
}

install(person,
  ::name,
  ::age); 
{code}
We can then use this to test whether the field was provided, as opposed to 
capturing a reference to a default constructed value of the the type.


> Add support for Optional parameters to protobuf handlers to wrap option fields
> --
>
> Key: MESOS-2638
> URL: https://issues.apache.org/jira/browse/MESOS-2638
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Joris Van Remoortere
>
> We currently don't have a way to install a protobuf handler for an option 
> field where the handler takes an Optional parameter of the 'option' field in 
> the protobuf message. The goal is to be able to do:
> {code:title=example|borderStyle=solid}
> message Person {
>   required string name = 1;
>   option uint32_t age = 2; 
> }
> void person(const std::string& name, const Option& age) 
> {
>   if (age.isSome()) { ... }
> }
> install(
> person, 
> ::name,
> ::age); 
> {code}
> We can then use this to test whether the field was provided, as opposed to 
> capturing a reference to a default constructed value of the the type.
> For now, the workaround is to use the take the entire message in the handler:
> {code}
> void person(const Person& person)
> {
>   if (person.has_age()) { ... }
> }
> install(person);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3782) Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create symlinks)

2016-04-18 Thread zhou xing (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246883#comment-15246883
 ] 

zhou xing commented on MESOS-3782:
--

[~klueska], Added MESOS-5230 to track this one

> Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create 
> symlinks)
> --
>
> Key: MESOS-3782
> URL: https://issues.apache.org/jira/browse/MESOS-3782
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: zhou xing
> Fix For: 0.29.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5230) Slave/Agent Rename Phase I: Rename '/include/mesos/slave' folder

2016-04-18 Thread zhou xing (JIRA)
zhou xing created MESOS-5230:


 Summary: Slave/Agent Rename Phase I: Rename '/include/mesos/slave' 
folder
 Key: MESOS-5230
 URL: https://issues.apache.org/jira/browse/MESOS-5230
 Project: Mesos
  Issue Type: Bug
  Components: general
Reporter: zhou xing
Assignee: zhou xing
Priority: Minor


During the implementation of MESOS-3782, we thought it would be good to open a 
new ticket to track the rename of folder "/include/mesos/slave". Please refer 
to the discussion in review https://reviews.apache.org/r/45806/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-18 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5225:
-
Assignee: Qian Zhang  (was: Avinash Sridharan)

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>  Labels: mesosphere
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
> '192.168.1.2/24' from CNI network 'net1' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
> container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
> I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
> I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 6.061056ms
> I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 2.46016ms
> I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace 
> handle 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b31
> 

[jira] [Commented] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-18 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246843#comment-15246843
 ] 

Avinash Sridharan commented on MESOS-5225:
--

Thanks for taking this on Qian!! Will review it. 

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>  Labels: mesosphere
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
> '192.168.1.2/24' from CNI network 'net1' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
> container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
> I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
> I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 6.061056ms
> I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 2.46016ms
> I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace 
> handle 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for 

[jira] [Updated] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-18 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5225:
-
Labels: mesosphere  (was: )

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
> '192.168.1.2/24' from CNI network 'net1' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
> container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
> I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
> I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 6.061056ms
> I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 2.46016ms
> I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace 
> handle 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b31
> 4-477f-b734-7771d07d41e3
> 

[jira] [Comment Edited] (MESOS-3782) Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create symlinks)

2016-04-18 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246821#comment-15246821
 ] 

Kevin Klues edited comment on MESOS-3782 at 4/19/16 12:17 AM:
--

[~dongdong] Did you ever get  a chance to file a ticket for:

{noformat}
The only odd man out is the `./include/mesos/slave` folder, which we've decided 
to create
a new ticket for dealing with (i.e. we should copy this folder to 
`./include/mesos/agent` as
part of the install process).
{noformat}


was (Author: klueska):
[~dongdong] Did you ever get  a chance to file a ticket for:

{noformat}
The only odd man out is the `./include/mesos/slave` folder, which we've decided 
to create a new ticket for dealing with (i.e. we should copy this folder to 
`./include/mesos/agent` as part of the install process).
{noformat}

> Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create 
> symlinks)
> --
>
> Key: MESOS-3782
> URL: https://issues.apache.org/jira/browse/MESOS-3782
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: zhou xing
> Fix For: 0.29.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3782) Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create symlinks)

2016-04-18 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246821#comment-15246821
 ] 

Kevin Klues commented on MESOS-3782:


[~dongdong] Did you ever get  a chance to file a ticket for:

{noformat}
The only odd man out is the `./include/mesos/slave` folder, which we've decided 
to create a new ticket for dealing with (i.e. we should copy this folder to 
`./include/mesos/agent` as part of the install process).
{noformat}

> Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create 
> symlinks)
> --
>
> Key: MESOS-3782
> URL: https://issues.apache.org/jira/browse/MESOS-3782
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: zhou xing
> Fix For: 0.29.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5031) Authorization Action enum does not support upgrades.

2016-04-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246807#comment-15246807
 ] 

Benjamin Mahler commented on MESOS-5031:


[~adam-mesos] [~yongtang] see my comment in the review on preferring an 
explicit case statement in favor of using 'default': 
https://reviews.apache.org/r/45342/

See context in MESOS-2664 and MESOS-3754.

> Authorization Action enum does not support upgrades.
> 
>
> Key: MESOS-5031
> URL: https://issues.apache.org/jira/browse/MESOS-5031
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.29.0
>Reporter: Adam B
>Assignee: Yong Tang
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> We need to make the Action enum optional in authorization::Request, and add 
> an `UNKNOWN = 0;` enum value. See MESOS-4997 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5229) Mesos containerizer should support file mounts

2016-04-18 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated MESOS-5229:

Description: 
When using an image to represent a container's file system, it's currently not 
possible to mount a single file into the filesystem. I had to resort to adding 
{{RUN touch /path/to/my/file}} in my Dockerfile in order to get the filesystem 
provisioned properly.

It would be great if this wasn't necessary. Even better would be if Mesos would 
create all mount points on demand, rather than requiring them to be present in 
the container filesystem (c.f. 
https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L522-L527)

  was:
When using an image to represent a container's file system, it's currently not 
possible to mount a single file into the filesystem. I had to resort to adding 
`RUN touch /path/to/my/file` in my Dockerfile in order to get the filesystem 
provisioned properly.

It would be great if this wasn't necessary. Even better would be if Mesos would 
create all mount points on demand, rather than requiring them to be present in 
the container filesystem (c.f. 
https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L522-L527)


> Mesos containerizer should support file mounts
> --
>
> Key: MESOS-5229
> URL: https://issues.apache.org/jira/browse/MESOS-5229
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Joshua Cohen
>
> When using an image to represent a container's file system, it's currently 
> not possible to mount a single file into the filesystem. I had to resort to 
> adding {{RUN touch /path/to/my/file}} in my Dockerfile in order to get the 
> filesystem provisioned properly.
> It would be great if this wasn't necessary. Even better would be if Mesos 
> would create all mount points on demand, rather than requiring them to be 
> present in the container filesystem (c.f. 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L522-L527)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5229) Mesos containerizer should support file mounts

2016-04-18 Thread Joshua Cohen (JIRA)
Joshua Cohen created MESOS-5229:
---

 Summary: Mesos containerizer should support file mounts
 Key: MESOS-5229
 URL: https://issues.apache.org/jira/browse/MESOS-5229
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Joshua Cohen


When using an image to represent a container's file system, it's currently not 
possible to mount a single file into the filesystem. I had to resort to adding 
`RUN touch /path/to/my/file` in my Dockerfile in order to get the filesystem 
provisioned properly.

It would be great if this wasn't necessary. Even better would be if Mesos would 
create all mount points on demand, rather than requiring them to be present in 
the container filesystem (c.f. 
https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/isolators/filesystem/linux.cpp#L522-L527)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5228) Add tests for Capability API.

2016-04-18 Thread Jojy Varghese (JIRA)
Jojy Varghese created MESOS-5228:


 Summary: Add tests for Capability API.
 Key: MESOS-5228
 URL: https://issues.apache.org/jira/browse/MESOS-5228
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Jojy Varghese
Assignee: Jojy Varghese


Add basic tests for the capability API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5224) buffer overflow error in slave upon processing malformed UUIDs

2016-04-18 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-5224:
--
Summary: buffer overflow error in slave upon processing malformed UUIDs  
(was: buffer overflow error in slave upon processing status update from 
executor v1 http API)

> buffer overflow error in slave upon processing malformed UUIDs
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
> mesos::internal::operator<<()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
> mesos::internal::slave::Slave::statusUpdate()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
> mesos::internal::slave::Slave::Http::executor()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53236daa8 
> 

[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-18 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246492#comment-15246492
 ] 

James DeFelice commented on MESOS-5224:
---

Update: thanks for the suggestion, no more errors after setting UUID to the 
binary-encoded form vs. the string form.

That said, incorrectly coded UUID's should NOT crash the slave. So I'm leaving 
this ticket open until that's resolved.

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
> mesos::internal::operator<<()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
> mesos::internal::slave::Slave::statusUpdate()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
> mesos::internal::slave::Slave::Http::executor()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_
> 

[jira] [Comment Edited] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.

2016-04-18 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246371#comment-15246371
 ] 

Greg Mann edited comment on MESOS-5180 at 4/18/16 8:38 PM:
---

We're currently running into this in a long-running cluster with Mesos and 
Marathon. The master logs show the moment when Marathon disconnects:
{code}
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393314 21960 
master.cpp:1275] Framework 29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) 
at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 disconnected
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393350 21960 
master.cpp:2658] Disconnecting framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393373 21960 
master.cpp:2682] Deactivating framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393434 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393440 21958 
hierarchical.cpp:375] Deactivated framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393635 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393723 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393815 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393875 21960 
master.cpp:1299] Giving framework 29b0cddb-f239-47cd-9d43-84624751d5ad- 
(marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 
1weeks to failover
{code}

But looking in the Marathon logs around the same time doesn't yield an 
indication that the scheduler has disconnected. It continues to receive task 
status updates, but doesn't receive offers, as expected.

It would be great if the master's logging messages could provide more 
information about the disconnection when it occurs, if possible.


was (Author: greggomann):
We're currently running into this in a long-running cluster with Mesos and 
Marathon. The master logs show the moment when Marathon disconnects:
{code}
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393314 21960 
master.cpp:1275] Framework 29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) 
at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 disconnected
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393350 21960 
master.cpp:2658] Disconnecting framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393373 21960 
master.cpp:2682] Deactivating framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393434 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393440 21958 
hierarchical.cpp:375] Deactivated framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393635 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393723 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 

[jira] [Commented] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.

2016-04-18 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246469#comment-15246469
 ] 

Vinod Kone commented on MESOS-5180:
---

The offers are not being received by the scheduler, because the framework is 
marked as deactivated in the allocator.

The fact that status updates and other messages are received by the scheduler 
indicates that Master is able to open new temporary sockets to send those.

{quote}
It would be great if the master's logging messages could provide more 
information about the disconnection when it occurs, if possible.
{quote}

Master logs can provide more info if libprocess can provide more info (socket 
error?) in the exited() message. Can you see if that's possible?

> Scheduler driver does not detect disconnection with master and reregister.
> --
>
> Key: MESOS-5180
> URL: https://issues.apache.org/jira/browse/MESOS-5180
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 0.24.0
>Reporter: Joseph Wu
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with 
> the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master 
> changing.  (Currently, the scheduler driver will only re-register if the 
> master changes).
> If both links break or if just link (1) breaks, the master views the 
> framework as {{inactive}} and {{disconnected}}.  This means the framework 
> will not receive any more events (such as offers) from the master until it 
> re-registers.  There is currently no way for the scheduler to detect a 
> one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The 
> scheduler usually uses the link to send messages to the master, but 
> libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should 
> implement a `::exited` event handler for the master's {{pid}} and trigger a 
> master (re-)detection upon a disconnection. This in turn should make the 
> driver (re)-register with the master. The scheduler library already does 
> this: 
> https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-18 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246418#comment-15246418
 ] 

James DeFelice commented on MESOS-5224:
---

The docs say that all byte fields should be base64 encoded in JSON.

And, yes, my example executor is sending protobufs to the slave, but JSON makes 
for easier debugging when logged to a text stream :)

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
> mesos::internal::operator<<()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
> mesos::internal::slave::Slave::statusUpdate()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
> mesos::internal::slave::Slave::Http::executor()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E19_E9_M_invokeERKSt9_Any_dataS7_
> Apr 17 17:53:53 node-1 

[jira] [Commented] (MESOS-1837) failed to determine cgroup for the 'cpu' subsystem

2016-04-18 Thread Martin Tapp (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246393#comment-15246393
 ] 

Martin Tapp commented on MESOS-1837:


I have the same problem using Mesos 0.28.1 + Marathon 1.1.1. Using 3 mesos 
masters + 3 marathon masters with quorum set to 2.

I0418 15:35:44.278214  4849 slave.cpp:3002] Handling status update TASK_FAILED 
(UUID: 44321081-f2f8-4f32-ac17-ea32d2925d62) for task 
dst-app.971f328f-059c-11e6-b40f-44a84205ccb0 of framework 
4c095edc-a5c1-4540-9a07-72e4e3ec8c48- from executor(1)@10.48.176.41:45067
E0418 15:35:44.285468  4774 slave.cpp:3252] Failed to update resources for 
container 1954b97d-ee79-470d-885f-a4b13deae876 of executor 
'dst-app.971f328f-059c-11e6-b40f-44a84205ccb0' running task 
dst-app.971f328f-059c-11e6-b40f-44a84205ccb0 on status update for terminal 
task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: 
Failed to read /proc/35264/cgroup: Failed to open file '/proc/35264/cgroup': No 
such file or directory
I0418 15:35:44.285833  4800 docker.cpp:1696] Destroying container 
'1954b97d-ee79-470d-885f-a4b13deae876'

Any idea?

Thanks

> failed to determine cgroup for the 'cpu' subsystem
> --
>
> Key: MESOS-1837
> URL: https://issues.apache.org/jira/browse/MESOS-1837
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.20.1
> Environment: Ubuntu 14.04
>Reporter: Chris Fortier
>Assignee: Timothy Chen
>
> Attempting to launch Docker container with Marathon. Container is launched 
> then fails. 
> A search of /var/log/syslog reveals:
> Sep 27 03:01:43 vagrant-ubuntu-trusty-64 mesos-slave[1409]: E0927 
> 03:01:43.546957  1463 slave.cpp:2205] Failed to update resources for 
> container 8c2429d9-f090-4443-8108-0206ca37f3fd of executor 
> hello-world.970dbe74-45f2-11e4-8b1d-56847afe9799 running task 
> hello-world.970dbe74-45f2-11e4-8b1d-56847afe9799 on status update for 
> terminal task, destroying container: Failed to determine cgroup for the 'cpu' 
> subsystem: Failed to read /proc/9792/cgroup: Failed to open file 
> '/proc/9792/cgroup': No such file or directory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.

2016-04-18 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246371#comment-15246371
 ] 

Greg Mann commented on MESOS-5180:
--

We're currently running into this in a long-running cluster with Mesos and 
Marathon. The master logs show the moment when Marathon disconnects:
{code}
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393314 21960 
master.cpp:1275] Framework 29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) 
at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 disconnected
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393350 21960 
master.cpp:2658] Disconnecting framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393373 21960 
master.cpp:2682] Deactivating framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393434 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393440 21958 
hierarchical.cpp:375] Deactivated framework 
29b0cddb-f239-47cd-9d43-84624751d5ad-
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393635 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393723 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393815 21960 
master.hpp:1825] Master attempted to send message to disconnected framework 
29b0cddb-f239-47cd-9d43-84624751d5ad- (marathon) at 
scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393875 21960 
master.cpp:1299] Giving framework 29b0cddb-f239-47cd-9d43-84624751d5ad- 
(marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 
1weeks to failover
{code}

But looking in the Marathon logs around the same time doesn't yield an 
indication that the scheduler has disconnected. It continues to receive task 
status updates, but doesn't seem to be receiving offers.

It would be great if the master's logging messages could provide more 
information about the disconnection when it occurs, if possible.

> Scheduler driver does not detect disconnection with master and reregister.
> --
>
> Key: MESOS-5180
> URL: https://issues.apache.org/jira/browse/MESOS-5180
> Project: Mesos
>  Issue Type: Bug
>  Components: scheduler driver
>Affects Versions: 0.24.0
>Reporter: Joseph Wu
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with 
> the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master 
> changing.  (Currently, the scheduler driver will only re-register if the 
> master changes).
> If both links break or if just link (1) breaks, the master views the 
> framework as {{inactive}} and {{disconnected}}.  This means the framework 
> will not receive any more events (such as offers) from the master until it 
> re-registers.  There is currently no way for the scheduler to detect a 
> one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The 
> scheduler usually uses the link to send messages to the master, but 
> libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should 
> implement a `::exited` event handler for the master's {{pid}} and trigger a 
> master (re-)detection upon a disconnection. This in turn should make the 
> driver (re)-register with the master. The scheduler library already does 
> this: 
> https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5227) Implement HTTP Docker Executor that uses the Executor Library

2016-04-18 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-5227:
-

 Summary: Implement HTTP Docker Executor that uses the Executor 
Library
 Key: MESOS-5227
 URL: https://issues.apache.org/jira/browse/MESOS-5227
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone


Similar to what we did with the HTTP command executor in MESOS-3558 we should 
have a HTTP docker executor that can speak the v1 Executor API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2331) MasterSlaveReconciliationTest.ReconcileRace is flaky

2016-04-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246251#comment-15246251
 ] 

Benjamin Mahler commented on MESOS-2331:


[~haosd...@gmail.com] good eye! I had the same conclusion last week when I took 
a look, but didn't send out a patch. Here is my version of the fix: 
https://reviews.apache.org/r/46339/

> MasterSlaveReconciliationTest.ReconcileRace is flaky
> 
>
> Key: MESOS-2331
> URL: https://issues.apache.org/jira/browse/MESOS-2331
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
>Reporter: Yan Xu
>Assignee: Qian Zhang
>  Labels: flaky
>
> {noformat:title=}
> [ RUN  ] MasterSlaveReconciliationTest.ReconcileRace
> Using temporary directory 
> '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV'
> I0206 19:09:44.196542 32362 leveldb.cpp:175] Opened db in 38.230192ms
> I0206 19:09:44.206826 32362 leveldb.cpp:182] Compacted db in 9.988493ms
> I0206 19:09:44.207164 32362 leveldb.cpp:197] Created db iterator in 29979ns
> I0206 19:09:44.207641 32362 leveldb.cpp:203] Seeked to beginning of db in 
> 4478ns
> I0206 19:09:44.207929 32362 leveldb.cpp:272] Iterated through 0 keys in the 
> db in 737ns
> I0206 19:09:44.208222 32362 replica.cpp:743] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0206 19:09:44.209132 32384 recover.cpp:448] Starting replica recovery
> I0206 19:09:44.209524 32384 recover.cpp:474] Replica is in EMPTY status
> I0206 19:09:44.211094 32384 replica.cpp:640] Replica in EMPTY status received 
> a broadcasted recover request
> I0206 19:09:44.211385 32384 recover.cpp:194] Received a recover response from 
> a replica in EMPTY status
> I0206 19:09:44.211902 32384 recover.cpp:565] Updating replica status to 
> STARTING
> I0206 19:09:44.236177 32381 master.cpp:344] Master 
> 20150206-190944-16842879-36452-32362 (lucid) started on 127.0.1.1:36452
> I0206 19:09:44.236291 32381 master.cpp:390] Master only allowing 
> authenticated frameworks to register
> I0206 19:09:44.236305 32381 master.cpp:395] Master only allowing 
> authenticated slaves to register
> I0206 19:09:44.236327 32381 credentials.hpp:35] Loading credentials for 
> authentication from 
> '/tmp/MasterSlaveReconciliationTest_ReconcileRace_NE9nhV/credentials'
> I0206 19:09:44.236601 32381 master.cpp:439] Authorization enabled
> I0206 19:09:44.238539 32381 hierarchical_allocator_process.hpp:284] 
> Initialized hierarchical allocator process
> I0206 19:09:44.238662 32381 whitelist_watcher.cpp:64] No whitelist given
> I0206 19:09:44.239364 32381 master.cpp:1350] The newly elected leader is 
> master@127.0.1.1:36452 with id 20150206-190944-16842879-36452-32362
> I0206 19:09:44.239392 32381 master.cpp:1363] Elected as the leading master!
> I0206 19:09:44.239413 32381 master.cpp:1181] Recovering from registrar
> I0206 19:09:44.239645 32381 registrar.cpp:312] Recovering registrar
> I0206 19:09:44.241142 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 29.029117ms
> I0206 19:09:44.241189 32384 replica.cpp:322] Persisted replica status to 
> STARTING
> I0206 19:09:44.241478 32384 recover.cpp:474] Replica is in STARTING status
> I0206 19:09:44.243075 32384 replica.cpp:640] Replica in STARTING status 
> received a broadcasted recover request
> I0206 19:09:44.243398 32384 recover.cpp:194] Received a recover response from 
> a replica in STARTING status
> I0206 19:09:44.243964 32384 recover.cpp:565] Updating replica status to VOTING
> I0206 19:09:44.255692 32384 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 11.502759ms
> I0206 19:09:44.255765 32384 replica.cpp:322] Persisted replica status to 
> VOTING
> I0206 19:09:44.256009 32384 recover.cpp:579] Successfully joined the Paxos 
> group
> I0206 19:09:44.256253 32384 recover.cpp:463] Recover process terminated
> I0206 19:09:44.257669 32384 log.cpp:659] Attempting to start the writer
> I0206 19:09:44.259944 32377 replica.cpp:476] Replica received implicit 
> promise request with proposal 1
> I0206 19:09:44.268805 32377 leveldb.cpp:305] Persisting metadata (8 bytes) to 
> leveldb took 8.45858ms
> I0206 19:09:44.269067 32377 replica.cpp:344] Persisted promised to 1
> I0206 19:09:44.277974 32383 coordinator.cpp:229] Coordinator attemping to 
> fill missing position
> I0206 19:09:44.279767 32383 replica.cpp:377] Replica received explicit 
> promise request for position 0 with proposal 2
> I0206 19:09:44.288940 32383 leveldb.cpp:342] Persisting action (8 bytes) to 
> leveldb took 9.128603ms
> I0206 19:09:44.289294 32383 replica.cpp:678] Persisted action at 0
> I0206 19:09:44.296417 32377 replica.cpp:510] Replica received write request 
> for position 0
> I0206 19:09:44.296944 32377 leveldb.cpp:437] Reading position from 

[jira] [Updated] (MESOS-5196) Sandbox GC shouldn't return early in the face of an error.

2016-04-18 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-5196:
--
Shepherd: Yan Xu

> Sandbox GC shouldn't return early in the face of an error.
> --
>
> Key: MESOS-5196
> URL: https://issues.apache.org/jira/browse/MESOS-5196
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Yan Xu
>Assignee: Megha
>
> Since GC's purpose is to clean up stuff that no one cares about anymore, it 
> should do its best to recover as much as disk space possible.
> In practice it's not easy for GC to anticipate what the task has done to the 
> sandbox in a generic manner, e.g., immutable file attribute, mount points, 
> etc. The least it can do is to log the error and continue with the rest of 
> the files in the sandbox.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5196) Sandbox GC shouldn't return early in the face of an error.

2016-04-18 Thread Elizabeth Lingg (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246240#comment-15246240
 ] 

Elizabeth Lingg commented on MESOS-5196:


+1, Let's clean up everything we can in the sandbox. For example if GC 
encounters an immutable file or file owned by another user (other than the task 
user) or another error, we should log an error and skip that particular file 
and continue with GC instead of exiting.

> Sandbox GC shouldn't return early in the face of an error.
> --
>
> Key: MESOS-5196
> URL: https://issues.apache.org/jira/browse/MESOS-5196
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Yan Xu
>Assignee: Megha
>
> Since GC's purpose is to clean up stuff that no one cares about anymore, it 
> should do its best to recover as much as disk space possible.
> In practice it's not easy for GC to anticipate what the task has done to the 
> sandbox in a generic manner, e.g., immutable file attribute, mount points, 
> etc. The least it can do is to log the error and continue with the rest of 
> the files in the sandbox.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5101) Add CMake build to docker_build.sh

2016-04-18 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5101:
--
  Sprint: Mesosphere Sprint 33
Story Points: 2

> Add CMake build to docker_build.sh
> --
>
> Key: MESOS-5101
> URL: https://issues.apache.org/jira/browse/MESOS-5101
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Juan Larriba
>Assignee: Juan Larriba
>
> Add the CMake build system to docker_build.sh to automatically test the build 
> on Jenkins alongside gcc and clang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5196) Sandbox GC shouldn't return early in the face of an error.

2016-04-18 Thread Megha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Megha reassigned MESOS-5196:


Assignee: Megha

> Sandbox GC shouldn't return early in the face of an error.
> --
>
> Key: MESOS-5196
> URL: https://issues.apache.org/jira/browse/MESOS-5196
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Yan Xu
>Assignee: Megha
>
> Since GC's purpose is to clean up stuff that no one cares about anymore, it 
> should do its best to recover as much as disk space possible.
> In practice it's not easy for GC to anticipate what the task has done to the 
> sandbox in a generic manner, e.g., immutable file attribute, mount points, 
> etc. The least it can do is to log the error and continue with the rest of 
> the files in the sandbox.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5196) Sandbox GC shouldn't return early in the face of an error.

2016-04-18 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-5196:
--
Assignee: (was: Yan Xu)

> Sandbox GC shouldn't return early in the face of an error.
> --
>
> Key: MESOS-5196
> URL: https://issues.apache.org/jira/browse/MESOS-5196
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Yan Xu
>
> Since GC's purpose is to clean up stuff that no one cares about anymore, it 
> should do its best to recover as much as disk space possible.
> In practice it's not easy for GC to anticipate what the task has done to the 
> sandbox in a generic manner, e.g., immutable file attribute, mount points, 
> etc. The least it can do is to log the error and continue with the rest of 
> the files in the sandbox.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-18 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246188#comment-15246188
 ] 

Vinod Kone commented on MESOS-5224:
---

Hmm. Looks like you are base64 encoding the string version of UUID? You should 
encode the bytes version of UUID. See example here: 
https://github.com/apache/mesos/blob/master/src/launcher/http_command_executor.cpp#L808

Also, looks like you are using protobuf codec and not JSON codec to send 
messages?

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531cc617a 
> mesos::internal::operator<<()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d71837 
> mesos::internal::slave::Slave::statusUpdate()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d302a2 
> mesos::internal::slave::Slave::Http::executor()
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc531d4d4a3 
> 

[jira] [Assigned] (MESOS-5196) Sandbox GC shouldn't return early in the face of an error.

2016-04-18 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-5196:
-

Assignee: Yan Xu

> Sandbox GC shouldn't return early in the face of an error.
> --
>
> Key: MESOS-5196
> URL: https://issues.apache.org/jira/browse/MESOS-5196
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> Since GC's purpose is to clean up stuff that no one cares about anymore, it 
> should do its best to recover as much as disk space possible.
> In practice it's not easy for GC to anticipate what the task has done to the 
> sandbox in a generic manner, e.g., immutable file attribute, mount points, 
> etc. The least it can do is to log the error and continue with the rest of 
> the files in the sandbox.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246131#comment-15246131
 ] 

haosdent commented on MESOS-4279:
-

I saw [~tnachen] and [~jojy] comments in 
https://github.com/bydga/mesos/commit/acf79781c04ad9309083dc39131e2c8305331431 
and 
https://github.com/bydga/mesos/commit/a32369005073eee443329f8a6fd6c84f82a1483b
And the lastest patch from [~bydga] is 
https://github.com/bydga/mesos/commit/21dcf7453d8fd4a501bfa9878ad3f590f93672ee 
I think [~bydga] could post in in review board for future review.

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-18 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246093#comment-15246093
 ] 

Vinod Kone commented on MESOS-4279:
---

Thanks for your patience [~bydga].

What's the action item here? [~alexr] are you going to shepherd the fixes?

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245965#comment-15245965
 ] 

haosdent commented on MESOS-4812:
-

The gif a bit large although I have already compress it.
!https://issues.apache.org/jira/secure/attachment/12799279/health_task.gif!

It show my steps in marathon I mentioned above.  

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-4812:

Attachment: health_task.gif

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5197) Log executor commands w/o verbose logs enabled

2016-04-18 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245949#comment-15245949
 ] 

Michael Gummelt commented on MESOS-5197:


How can we make it so the commands are printed?

> Log executor commands w/o verbose logs enabled
> --
>
> Key: MESOS-5197
> URL: https://issues.apache.org/jira/browse/MESOS-5197
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Gummelt
>Assignee: Yong Tang
>  Labels: mesosphere
>
> To debug executors, it's often necessary to know the command that ran the 
> executor.  For example, when Spark executors fail, I'd like to know the 
> command used to invoke the executor (Spark uses the command executor in a 
> docker container).  Currently, it's only output if GLOG_v is enabled, but I 
> don't think this should be a "verbose" output.  It's a common debugging need.
> https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677
> cc [~kaysoky]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245931#comment-15245931
 ] 

haosdent commented on MESOS-4812:
-

{quote}
What you have to send to Mesos is
/bin/bash -c \" Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5197) Log executor commands w/o verbose logs enabled

2016-04-18 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245948#comment-15245948
 ] 

Michael Gummelt commented on MESOS-5197:


How can we make it so the commands are printed?

> Log executor commands w/o verbose logs enabled
> --
>
> Key: MESOS-5197
> URL: https://issues.apache.org/jira/browse/MESOS-5197
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Gummelt
>Assignee: Yong Tang
>  Labels: mesosphere
>
> To debug executors, it's often necessary to know the command that ran the 
> executor.  For example, when Spark executors fail, I'd like to know the 
> command used to invoke the executor (Spark uses the command executor in a 
> docker container).  Currently, it's only output if GLOG_v is enabled, but I 
> don't think this should be a "verbose" output.  It's a common debugging need.
> https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677
> cc [~kaysoky]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5197) Log executor commands w/o verbose logs enabled

2016-04-18 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated MESOS-5197:
---
Comment: was deleted

(was: How can we make it so the commands are printed?)

> Log executor commands w/o verbose logs enabled
> --
>
> Key: MESOS-5197
> URL: https://issues.apache.org/jira/browse/MESOS-5197
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Gummelt
>Assignee: Yong Tang
>  Labels: mesosphere
>
> To debug executors, it's often necessary to know the command that ran the 
> executor.  For example, when Spark executors fail, I'd like to know the 
> command used to invoke the executor (Spark uses the command executor in a 
> docker container).  Currently, it's only output if GLOG_v is enabled, but I 
> don't think this should be a "verbose" output.  It's a common debugging need.
> https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677
> cc [~kaysoky]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3335) FlagsBase copy-ctor leads to dangling pointer

2016-04-18 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245869#comment-15245869
 ] 

Benjamin Bannier edited comment on MESOS-3335 at 4/18/16 3:44 PM:
--

This appears a bigger problem than just possibly mutilating some help string.

Above call to {{add}} uses a pointer to memory not necessarily associated to 
the {{Flags}} object for both reading and writing, and like you show even in 
situations where the scope holding on to that memory has already been cleaned 
up.  We do use this {{add}} overload (there are actually two of them) to add a 
large portion of existing {{Flag}} values.

This being undefined behavior we cannot rely on this error just messing up 
things locally; in fact e.g., an optimized test built built with clang-3.9.0 
will immediately segfault because of this.

I believe a possible fix would be to enforce that we only pass pointers to 
member variables and at the usage sites get the actual values via the current 
{{Flags}} object; this is already used partially to implement {{add}} support 
of {{Flag}} values from derived {{Flags}}.



was (Author: bbannier):
This appears a bigger problem than just possibly mutilating some help string.

Above call to {{add}} uses a pointer to memory not necessarily associated to 
the {{Flags}} object for both reading and writing, and like you show even in 
situations where the scope holding on to that memory has already been cleaned 
up.  We do use this {{add}} overload (there are actually two of them) to add a 
large portion of existing {{Flag}} values.

This being undefined behavior we cannot rely on this error just messing up 
things locally; in fact e.g., an optimized test built built with clang-3.9.0 
will immediately segfault because of this.

I believe a possible fix would be to enforce that we only pass pointers to 
member variables and add the usage sites get the actual values via the current 
{{Flags}} object; this is already used partially to implement {{add}} support 
of {{Flag}} values from derived {{Flags}}.


> FlagsBase copy-ctor leads to dangling pointer
> -
>
> Key: MESOS-3335
> URL: https://issues.apache.org/jira/browse/MESOS-3335
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Benjamin Bannier
>Priority: Minor
> Attachments: lambda_capture_bug.cpp
>
>
> Per [#3328], ubsan detects the following problem:
> [ RUN ] FaultToleranceTest.ReregisterCompletedFrameworks
> /mesos/3rdparty/libprocess/3rdparty/stout/include/stout/flags/flags.hpp:303:25:
>  runtime error: load of value 33, which is not a valid value for type 'bool'
> I believe what is going on here is the following:
> * The test calls StartMaster(), which does MesosTest::CreateMasterFlags()
> * MesosTest::CreateMasterFlags() allocates a new master::Flags on the stack, 
> which is subsequently copy-constructed back to StartMaster()
> * The FlagsBase constructor is:
> bq. {{FlagsBase() { add(, "help", "...", false); }}}
> where "help" is a member variable -- i.e., it is allocated on the stack in 
> this case.
> * {{FlagsBase()::add}} captures {{}}, e.g.:
> {noformat}
> flag.stringify = [t1](const FlagsBase&) -> Option {
> return stringify(*t1);
>   };}}
> {noformat}
> * The implicit copy constructor for FlagsBase is just going to copy the 
> lambda above, i.e., the result of the copy constructor will have a lambda 
> that points into MesosTest::CreateMasterFlags()'s stack frame, which is bad 
> news.
> Not sure the right fix -- comments welcome. You could define a copy-ctor for 
> FlagsBase that does something gross (basically remove the old help flag and 
> define a new one that points into the target of the copy), but that seems, 
> well, gross.
> Probably not a pressing-problem to fix -- AFAICS worst symptom is that we end 
> up reading one byte from some random stack location when serving 
> {{state.json}}, for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3335) FlagsBase copy-ctor leads to dangling pointer

2016-04-18 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245869#comment-15245869
 ] 

Benjamin Bannier commented on MESOS-3335:
-

This appears a bigger problem than just possibly mutilating some help string.

Above call to {{add}} uses a pointer to memory not necessarily associated to 
the {{Flags}} object for both reading and writing, and like you show even in 
situations where the scope holding on to that memory has already been cleaned 
up.  We do use this {{add}} overload (there are actually two of them) to add a 
large portion of existing {{Flag}} values.

This being undefined behavior we cannot rely on this error just messing up 
things locally; in fact e.g., an optimized test built built with clang-3.9.0 
will immediately segfault because of this.

I believe a possible fix would be to enforce that we only pass pointers to 
member variables and add the usage sites get the actual values via the current 
{{Flags}} object; this is already used partially to implement {{add}} support 
of {{Flag}} values from derived {{Flags}}.


> FlagsBase copy-ctor leads to dangling pointer
> -
>
> Key: MESOS-3335
> URL: https://issues.apache.org/jira/browse/MESOS-3335
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Benjamin Bannier
>Priority: Minor
> Attachments: lambda_capture_bug.cpp
>
>
> Per [#3328], ubsan detects the following problem:
> [ RUN ] FaultToleranceTest.ReregisterCompletedFrameworks
> /mesos/3rdparty/libprocess/3rdparty/stout/include/stout/flags/flags.hpp:303:25:
>  runtime error: load of value 33, which is not a valid value for type 'bool'
> I believe what is going on here is the following:
> * The test calls StartMaster(), which does MesosTest::CreateMasterFlags()
> * MesosTest::CreateMasterFlags() allocates a new master::Flags on the stack, 
> which is subsequently copy-constructed back to StartMaster()
> * The FlagsBase constructor is:
> bq. {{FlagsBase() { add(, "help", "...", false); }}}
> where "help" is a member variable -- i.e., it is allocated on the stack in 
> this case.
> * {{FlagsBase()::add}} captures {{}}, e.g.:
> {noformat}
> flag.stringify = [t1](const FlagsBase&) -> Option {
> return stringify(*t1);
>   };}}
> {noformat}
> * The implicit copy constructor for FlagsBase is just going to copy the 
> lambda above, i.e., the result of the copy constructor will have a lambda 
> that points into MesosTest::CreateMasterFlags()'s stack frame, which is bad 
> news.
> Not sure the right fix -- comments welcome. You could define a copy-ctor for 
> FlagsBase that does something gross (basically remove the old help flag and 
> define a new one that points into the target of the copy), but that seems, 
> well, gross.
> Probably not a pressing-problem to fix -- AFAICS worst symptom is that we end 
> up reading one byte from some random stack location when serving 
> {{state.json}}, for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread Lukas Loesche (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245786#comment-15245786
 ] 

Lukas Loesche edited comment on MESOS-4812 at 4/18/16 2:49 PM:
---

I think you're mixing up syntax. What you wrote was
{noformat}
command.set_value("bash -c \" Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread Lukas Loesche (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245786#comment-15245786
 ] 

Lukas Loesche edited comment on MESOS-4812 at 4/18/16 2:49 PM:
---

I think you're mixing up syntax. What you wrote was
{noformat}
command.set_value("bash -c \" Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread Lukas Loesche (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245786#comment-15245786
 ] 

Lukas Loesche edited comment on MESOS-4812 at 4/18/16 2:48 PM:
---

I think you're mixing up syntax. What you wrote was
{noformat}
command.set_value("bash -c \" Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread Lukas Loesche (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245786#comment-15245786
 ] 

Lukas Loesche commented on MESOS-4812:
--

I think you're mixing up syntax. What you wrote was
{noformat}
command.set_value("bash -c \" Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-18 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245784#comment-15245784
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Its usually not my job description to make videos... :) But I knew I should 
have added Mission Impossible theme song to the background to make it more 
dramatic and attractive!

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245754#comment-15245754
 ] 

haosdent commented on MESOS-4279:
-

Your video looks funny :-)

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245741#comment-15245741
 ] 

haosdent commented on MESOS-4812:
-

[~lloesche] Sorry for my unclear explain. I mean I could not reproduce the 
problem you mentioned in github link. As you see, I use
{code}
"healthChecks": [
...
"command": {
  "value": "bash -c \" Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5226) The image-less task launched by mesos-execute can not join CNI network

2016-04-18 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245690#comment-15245690
 ] 

Qian Zhang commented on MESOS-5226:
---

RR: https://reviews.apache.org/r/46329/

> The image-less task launched by mesos-execute can not join CNI network
> --
>
> Key: MESOS-5226
> URL: https://issues.apache.org/jira/browse/MESOS-5226
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> With {{mesos-execute}}, if we launches a task which wants to join a CNI 
> network but has no image specified, like:
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --networks=net1 --command="ifconfig" --shell=true
> {code}
> The corresponding command executor actually will not join the specified CNI 
> network, instead it is still in agent host network namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-18 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245684#comment-15245684
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Just to make it transparent, I've exchanged a few emails with Alex to clarify 
and summarize the issues. Today I confirmed both issues still persists on mesos 
0.29 + marathon 1.1.1. Here's video demonstrating both of them: 
https://www.youtube.com/watch?v=vDUA9_ASYW0. 

Btw, don't forget on my github, I've already fixed both problems there... ;)

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-18 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245671#comment-15245671
 ] 

Qian Zhang commented on MESOS-5225:
---

[~avin...@mesosphere.io], yes, I think for the command executor which has its 
own rootfs, we should do the bind mounts in both host filesystem and its own 
filesystem.

I was already working on a patch for this bug, and has posted here: 
https://reviews.apache.org/r/46331/, please review and let me know for any 
comments :-)

> Command executor can not start when joining a CNI network
> -
>
> Key: MESOS-5225
> URL: https://issues.apache.org/jira/browse/MESOS-5225
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Avinash Sridharan
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
> '192.168.1.2/24' from CNI network 'net1' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
> container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
> I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
> I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 6.061056ms
> I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup 
> 

[jira] [Comment Edited] (MESOS-5142) Add agent flags for HTTP authorization

2016-04-18 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231895#comment-15231895
 ] 

Jan Schlicht edited comment on MESOS-5142 at 4/18/16 1:37 PM:
--

https://reviews.apache.org/r/45922/
https://reviews.apache.org/r/46203/


was (Author: nfnt):
https://reviews.apache.org/r/45922/

> Add agent flags for HTTP authorization
> --
>
> Key: MESOS-5142
> URL: https://issues.apache.org/jira/browse/MESOS-5142
> Project: Mesos
>  Issue Type: Bug
>  Components: security, slave
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> Flags should be added to the agent to:
> 1. Enable authorization ({{--authorizers}})
> 2. Provide ACLs ({{--acls}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5225) Command executor can not start when joining a CNI network

2016-04-18 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-5225:
--
Description: 
Reproduce steps:
1. Start master
{code}
sudo ./bin/mesos-master.sh --work_dir=/tmp
{code}
 
2. Start agent
{code}
sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 --containerizers=mesos 
--image_providers=docker 
--isolation=filesystem/linux,docker/runtime,network/cni 
--network_cni_config_dir=/opt/cni/net_configs 
--network_cni_plugins_dir=/opt/cni/plugins
{code}
 
3. Launch a command task with mesos-execute, and it will join a CNI network 
{{net1}}.
{code}
sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
--docker_image=library/busybox --networks=net1 --command="sleep 10" --shell=true
I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-'
Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
Received status update TASK_FAILED for task 'test'
  message: 'Executor terminated'
  source: SOURCE_AGENT
  reason: REASON_EXECUTOR_TERMINATED
{code}

So the task failed with the reason "executor terminated". Here is the agent log:
{code}
I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
3c4796f0-eee7-4939-a036-7c6387c370eb-
I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
'/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/t
est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
framework 3c4796f0-eee7-4939-a036-7c6387c370eb- with resources cpus(*):0.1; 
mem(*):32 in work directory '/t
mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
'2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
'3c4796f0-eee7-4939-a036-7c6387c370eb-00
00'
I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-
I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
'/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process with 
flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
'/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' for 
container 2b29d6d6-b314-4
77f-b734-7771d07d41e3
I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
'192.168.1.2/24' from CNI network 'net1' for container 
2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
'2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container 
'2b29d6d6-b314-477f-b734-7771d07d41e3'
I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
6.061056ms
I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
2.46016ms
I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace 
handle 
'/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' for 
container 2b29d6d6-b31
4-477f-b734-7771d07d41e3
I0418 08:25:37.874194 24914 cni.cpp:1132] Removed the container directory 
'/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3'
I0418 08:25:37.877306 24912 linux.cpp:814] Ignoring unmounting sandbox/work 
directory for container 2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:37.879295 24912 provisioner.cpp:338] Destroying container rootfs at 
'/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3
a-ea31-45f6-b578-a62cd02392e7' for container 
2b29d6d6-b314-477f-b734-7771d07d41e3
I0418 08:25:37.970871 24914 slave.cpp:4113] Executor 'test' of framework 

[jira] [Commented] (MESOS-4610) MasterContender/MasterDetector should be loadable as modules

2016-04-18 Thread Kapil Arya (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245577#comment-15245577
 ] 

Kapil Arya commented on MESOS-4610:
---

Followup cleanup commits:

{code}
commit 9ac2ddae78a26c2730b2a1c15e14f64b17a8583b
Author: Kapil Arya 
Date:   Thu Apr 7 18:24:06 2016 -0400

Removed unsed headers from master contender/detector files.

Review: https://reviews.apache.org/r/45901

commit a4e11f8186a15a8159ba0eb46335595843e053ba
Author: Kapil Arya 
Date:   Thu Apr 7 18:21:08 2016 -0400

Removed stale contender/detector files.

Also updated Makefile.am to move contender/detector module library
declarations into the existing list.

Review: https://reviews.apache.org/r/45900
{code}

> MasterContender/MasterDetector should be loadable as modules
> 
>
> Key: MESOS-4610
> URL: https://issues.apache.org/jira/browse/MESOS-4610
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Mark Cavage
>Assignee: Mark Cavage
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Currently mesos depends on Zookeeper for leader election and notification to 
> slaves, although there is a C++ hierarchy in the code to support alternatives 
> (e.g., unit tests use an in-memory implementation). From an operational 
> perspective, many organizations/users do not want to take a dependency on 
> Zookeeper, and use an alternative solution to implementing leader election. 
> Our organization in particular, very much wants this, and as a reference 
> there have been several requests from the community (see referenced tickets) 
> to replace with etcd/consul/etc.
> This ticket will serve as the work effort to modularize the 
> MasterContender/MasterDetector APIs such that integrators can build a 
> pluggable solution of their choice; this ticket will not fold in any 
> implementations such as etcd et al., but simply move this hierarchy to be 
> fully pluggable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2016-04-18 Thread Lukas Loesche (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245575#comment-15245575
 ] 

Lukas Loesche commented on MESOS-4812:
--

Hi, thanks for looking into this! I know how to make the health check pass. In 
the Github Link above I explained how I worked around the problem which is 
essentially your first solution.

The second solution is broken for systems that don't use bash for /bin/sh as 
/dev/tcp is a bash only thing.

Anyways finding some workaround is not the problem. This issue is about Mesos 
(or Marathon) doing the wrong thing, imho. Why would the user need to know 
about the details of the implementation to get a valid shell command executed?

Like why would you expect the user to read the Mesos source code to find out 
his command is executed inside a
{noformat}
/bin/sh -c ""
{noformat}
and that's why he has to escape double quotes in his own command before sending 
it to Mesos. That seems unreasonable to me.


> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3335) FlagsBase copy-ctor leads to dangling pointer

2016-04-18 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-3335:
---

Assignee: Benjamin Bannier

> FlagsBase copy-ctor leads to dangling pointer
> -
>
> Key: MESOS-3335
> URL: https://issues.apache.org/jira/browse/MESOS-3335
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Benjamin Bannier
>Priority: Minor
> Attachments: lambda_capture_bug.cpp
>
>
> Per [#3328], ubsan detects the following problem:
> [ RUN ] FaultToleranceTest.ReregisterCompletedFrameworks
> /mesos/3rdparty/libprocess/3rdparty/stout/include/stout/flags/flags.hpp:303:25:
>  runtime error: load of value 33, which is not a valid value for type 'bool'
> I believe what is going on here is the following:
> * The test calls StartMaster(), which does MesosTest::CreateMasterFlags()
> * MesosTest::CreateMasterFlags() allocates a new master::Flags on the stack, 
> which is subsequently copy-constructed back to StartMaster()
> * The FlagsBase constructor is:
> bq. {{FlagsBase() { add(, "help", "...", false); }}}
> where "help" is a member variable -- i.e., it is allocated on the stack in 
> this case.
> * {{FlagsBase()::add}} captures {{}}, e.g.:
> {noformat}
> flag.stringify = [t1](const FlagsBase&) -> Option {
> return stringify(*t1);
>   };}}
> {noformat}
> * The implicit copy constructor for FlagsBase is just going to copy the 
> lambda above, i.e., the result of the copy constructor will have a lambda 
> that points into MesosTest::CreateMasterFlags()'s stack frame, which is bad 
> news.
> Not sure the right fix -- comments welcome. You could define a copy-ctor for 
> FlagsBase that does something gross (basically remove the old help flag and 
> define a new one that points into the target of the copy), but that seems, 
> well, gross.
> Probably not a pressing-problem to fix -- AFAICS worst symptom is that we end 
> up reading one byte from some random stack location when serving 
> {{state.json}}, for example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5155) Consolidate authorization actions for quota.

2016-04-18 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245342#comment-15245342
 ] 

Alexander Rukletsov commented on MESOS-5155:


I'd also vote for Adam's option 1, with the addition that if only old types are 
used, a deprecation warning is issued.

Regarding `READ_QUOTA` — I think we can add it later if necessary. As [~zhitao] 
points out, we don't currently have ACLs for read-only actions.

> Consolidate authorization actions for quota.
> 
>
> Key: MESOS-5155
> URL: https://issues.apache.org/jira/browse/MESOS-5155
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Zhitao Li
>  Labels: mesosphere
>
> We should have just a single authz action: {{UPDATE_QUOTA_WITH_ROLE}}. It was 
> a mistake in retrospect to introduce multiple actions.
> Actions that are not symmetrical are register/teardown and dynamic 
> reservations. The way they are implemented in this way is because entities 
> that do one action differ from entities that do the other. For example, 
> register framework is issued by a framework, teardown by an operator. What is 
> a good way to identify a framework? A role it runs in, which may be different 
> each launch and makes no sense in multi-role frameworks setup or better a 
> sort of a group id, which is its principal. For dynamic reservations and 
> persistent volumes, they can be both issued by frameworks and operators, 
> hence similar reasoning applies. 
> Now, quota is associated with a role and set only by operators. Do we need to 
> care about principals that set it? Not that much. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5155) Consolidate authorization actions for quota.

2016-04-18 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245342#comment-15245342
 ] 

Alexander Rukletsov edited comment on MESOS-5155 at 4/18/16 9:13 AM:
-

I'd also vote for Adam's option 1, with the addition that if only old types are 
used, a deprecation warning is issued.

Regarding {{READ_QUOTA}} — I think we can add it later if necessary. As 
[~zhitao] points out, we don't currently have ACLs for read-only actions.


was (Author: alexr):
I'd also vote for Adam's option 1, with the addition that if only old types are 
used, a deprecation warning is issued.

Regarding `READ_QUOTA` — I think we can add it later if necessary. As [~zhitao] 
points out, we don't currently have ACLs for read-only actions.

> Consolidate authorization actions for quota.
> 
>
> Key: MESOS-5155
> URL: https://issues.apache.org/jira/browse/MESOS-5155
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: Zhitao Li
>  Labels: mesosphere
>
> We should have just a single authz action: {{UPDATE_QUOTA_WITH_ROLE}}. It was 
> a mistake in retrospect to introduce multiple actions.
> Actions that are not symmetrical are register/teardown and dynamic 
> reservations. The way they are implemented in this way is because entities 
> that do one action differ from entities that do the other. For example, 
> register framework is issued by a framework, teardown by an operator. What is 
> a good way to identify a framework? A role it runs in, which may be different 
> each launch and makes no sense in multi-role frameworks setup or better a 
> sort of a group id, which is its principal. For dynamic reservations and 
> persistent volumes, they can be both issued by frameworks and operators, 
> hence similar reasoning applies. 
> Now, quota is associated with a role and set only by operators. Do we need to 
> care about principals that set it? Not that much. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4576) Introduce a stout helper for "which"

2016-04-18 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu reassigned MESOS-4576:
--

Assignee: Guangya Liu  (was: Disha Singh)

> Introduce a stout helper for "which"
> 
>
> Key: MESOS-4576
> URL: https://issues.apache.org/jira/browse/MESOS-4576
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Joseph Wu
>Assignee: Guangya Liu
>  Labels: mesosphere
>
> We may want to add a helper to {{stout/os.hpp}} that will natively emulate 
> the functionality of the Linux utility {{which}}.  i.e.
> {code}
> Option which(const string& command)
> {
>   Option path = os::getenv("PATH");
>   // Loop through path and return the first one which os::exists(...).
>   return None();
> }
> {code}
> This helper may be useful:
> * for test filters in {{src/tests/environment.cpp}}
> * a few tests in {{src/tests/containerizer/port_mapping_tests.cpp}}
> * the {{sha512}} utility in {{src/common/command_utils.cpp}}
> * as runtime checks in the {{LogrotateContainerLogger}}
> * etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags

2016-04-18 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245223#comment-15245223
 ] 

Jay Guo edited comment on MESOS-3781 at 4/18/16 7:42 AM:
-

Here's what I understand from your comments:
1. We should enable multi-named flags in FlagsBase
2. While loading flag values from cmd/env in FlagsBase::load(), it generates 
warnings by determining actual name being used. (Add check logic to *Flag.load* 
lambda? It takes *DeprecatedNames* in capture, as well as the *name* used to 
actual load the value, and generate warnings if *name* falls into 
DeprecatedNames) Something like this:
{code}
flag.load = [t1, deprecatedNames](FlagsBase*, const std::string& name, const 
std::string& value) -> Try {
  ...
  if (deprecatedNames.find(name)) { deprecationWarning(name); }
  ...
};
{code}
3. Add duplicate names to all applicable flags

My concerns:
1. Why both _Name_ and _deprecatedName_ structs? Since we only need to know 
whether it is deprecated. Also, I don't see any instance that has *multiple* 
deprecated names, so why vector of structs?
2. If the sole purpose of having this vector of structs is to search for 
deprecated names, I suggest to use _set_ instead.
3. Are we overengineering this? 'slave' flags will eventually be removed, along 
with deprecatedNames. Nevertheless, I like the idea of having multi-name flags.
4. If we are renaming original flag names, we ought to rename them in the 
codebase where they are being used. Is it within the scope of this ticket?

Thanks!


was (Author: guoger):
Here's what I understand from your comments:
1. We should enable multi-named flags in FlagsBase
2. While loading flag values from cmd/env in FlagsBase::load(), it generates 
warnings by determining actual name being used. (Add check logic to *Flag.load* 
lambda? It takes *DeprecatedNames* in capture, as well as the *name* used to 
actual load the value, and generate warnings if *name* falls into 
DeprecatedNames) Something like this:
{code}
flag.load = [t1, deprecatedNames](FlagsBase*, const std::string& name, const 
std::string& value) -> Try {
  ...
  if (deprecatedNames.find(name)) { deprecationWarning(name); }
  ...
};
{code}
3. Add duplicate names to all applicable flags

My concerns:
1. Why both _Name_ and _deprecatedName_ structs? Since we only need to know 
whether it is deprecated. Also, I don't see any instance that has *multiple* 
deprecated names, so why vector of structs?
2. If the sole purpose of having this vector of structs is to search for 
deprecated names, I suggest to use _set_ instead.
3. Are we overengineering this? 'slave' flags will eventually be removed, along 
with deprecatedNames. Nevertheless, I like the idea of having multi-name flags.

Thanks!

> Replace Master/Slave Terminology Phase I - Add duplicate agent flags 
> -
>
> Key: MESOS-3781
> URL: https://issues.apache.org/jira/browse/MESOS-3781
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: Jay Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5060) Requesting /files/read.json with a negative length value causes subsequent /files requests to 404.

2016-04-18 Thread zhou xing (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245232#comment-15245232
 ] 

zhou xing edited comment on MESOS-5060 at 4/18/16 7:16 AM:
---

Hi [~greggomann], I took a look at the code today and based on your 
suggestions, I'm thinking whether we can change the logic as the followings:

* for {{length}}:
  (1). if user provides a non-negative number, then keep current code logic, 
set {{length=min(length, pageSize*16)}}
  (2). if user provides a negative number,  set {{length=pageSize * 1}}. Log a 
warning message on this.
  (3). if user does not provide this argument, I tend to think that this means 
user wants to view the content after {{offset}}, which means {{length=min(size 
- offset, pageSize * 16)}}. Log a message on this.

* for {{offset}}:
  (1). if user provides a non-negative number, then keep current code logic, 
set the offset as what user provides.
  (2). if user provides a negative number, then set {{offset=size}} (note that, 
now, if user gives a negative offset in the argument, it will lead to a reading 
file failure error). Log a warning message for this.
  (3). if user does not provide this argument, I tend to think this means user 
wants to view the whole file, set {{offset=0}}. Log a message for this.

If you think the above logic is ok, then I'll submit a patch on this. Thanks 
for your time on reviewing this!



was (Author: dongdong):
Hi Greg, I took a look at the code today and based on your suggestions, I'm 
thinking whether we can change the logic as the followings:

* for {{length}}:
  (1). if user provides a non-negative number, then keep current code logic, 
set {{length=min(length, pageSize*16)}}
  (2). if user provides a negative number,  set {{length=pageSize * 1}}. Log a 
warning message on this.
  (3). if user does not provide this argument, I tend to think that this means 
user wants to view the content after {{offset}}, which means {{length=min(size 
- offset, pageSize * 16)}}. Log a message on this.

* for {{offset}}:
  (1). if user provides a non-negative number, then keep current code logic, 
set the offset as what user provides.
  (2). if user provides a negative number, then set {{offset=size}} (note that, 
now, if user gives a negative offset in the argument, it will lead to a reading 
file failure error). Log a warning message for this.
  (3). if user does not provide this argument, I tend to think this means user 
wants to view the whole file, set {{offset=0}}. Log a message for this.

If you think the above logic is ok, then I'll submit a patch on this. Thanks 
for your time on reviewing this!


> Requesting /files/read.json with a negative length value causes subsequent 
> /files requests to 404.
> --
>
> Key: MESOS-5060
> URL: https://issues.apache.org/jira/browse/MESOS-5060
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.23.0
> Environment: Mesos 0.23.0 on CentOS 6, also Mesos 0.28.0 on OSX
>Reporter: Tom Petr
>Assignee: zhou xing
>Priority: Minor
> Fix For: 0.29.0
>
>
> I accidentally hit a slave's /files/read.json endpoint with a negative length 
> (ex. http://hostname:5051/files/read.json?path=XXX=0=-100). The 
> HTTP request timed out after 30 seconds with nothing relevant in the slave 
> logs, and subsequent calls to any of the /files endpoints on that slave 
> immediately returned a HTTP 404 response. We ultimately got things working 
> again by restarting the mesos-slave process (checkpointing FTW!), but it'd be 
> wise to guard against negative lengths on the slave's end too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5060) Requesting /files/read.json with a negative length value causes subsequent /files requests to 404.

2016-04-18 Thread zhou xing (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245232#comment-15245232
 ] 

zhou xing commented on MESOS-5060:
--

Hi Greg, I took a look at the code today and based on your suggestions, I'm 
thinking whether we can change the logic as the followings:

* for {{length}}:
  (1). if user provides a non-negative number, then keep current code logic, 
set {{length=min(length, pageSize*16)}}
  (2). if user provides a negative number,  set {{length=pageSize * 1}}. Log a 
warning message on this.
  (3). if user does not provide this argument, I tend to think that this means 
user wants to view the content after {{offset}}, which means {{length=min(size 
- offset, pageSize * 16)}}. Log a message on this.

* for {{offset}}:
  (1). if user provides a non-negative number, then keep current code logic, 
set the offset as what user provides.
  (2). if user provides a negative number, then set {{offset=size}} (note that, 
now, if user gives a negative offset in the argument, it will lead to a reading 
file failure error). Log a warning message for this.
  (3). if user does not provide this argument, I tend to think this means user 
wants to view the whole file, set {{offset=0}}. Log a message for this.

If you think the above logic is ok, then I'll submit a patch on this. Thanks 
for your time on reviewing this!


> Requesting /files/read.json with a negative length value causes subsequent 
> /files requests to 404.
> --
>
> Key: MESOS-5060
> URL: https://issues.apache.org/jira/browse/MESOS-5060
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.23.0
> Environment: Mesos 0.23.0 on CentOS 6, also Mesos 0.28.0 on OSX
>Reporter: Tom Petr
>Assignee: zhou xing
>Priority: Minor
> Fix For: 0.29.0
>
>
> I accidentally hit a slave's /files/read.json endpoint with a negative length 
> (ex. http://hostname:5051/files/read.json?path=XXX=0=-100). The 
> HTTP request timed out after 30 seconds with nothing relevant in the slave 
> logs, and subsequent calls to any of the /files endpoints on that slave 
> immediately returned a HTTP 404 response. We ultimately got things working 
> again by restarting the mesos-slave process (checkpointing FTW!), but it'd be 
> wise to guard against negative lengths on the slave's end too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags

2016-04-18 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245223#comment-15245223
 ] 

Jay Guo edited comment on MESOS-3781 at 4/18/16 7:12 AM:
-

Here's what I understand from your comments:
1. We should enable multi-named flags in FlagsBase
2. While loading flag values from cmd/env in FlagsBase::load(), it generates 
warnings by determining actual name being used. (Add check logic to *Flag.load* 
lambda? It takes *DeprecatedNames* in capture, as well as the *name* used to 
actual load the value, and generate warnings if *name* falls into 
DeprecatedNames) Something like this:
{code}
flag.load = [t1, deprecatedNames](FlagsBase*, const std::string& name, const 
std::string& value) -> Try {
  ...
  if (deprecatedNames.find(name)) { deprecationWarning(name); }
  ...
};
{code}
3. Add duplicate names to all applicable flags

My concerns:
1. Why both _Name_ and _deprecatedName_ structs? Since we only need to know 
whether it is deprecated. Also, I don't see any instance that has *multiple* 
deprecated names, so why vector of structs?
2. If the sole purpose of having this vector of structs is to search for 
deprecated names, I suggest to use _set_ instead.
3. Are we overengineering this? 'slave' flags will eventually be removed, along 
with deprecatedNames. Nevertheless, I like the idea of having multi-name flags.

Thanks!


was (Author: guoger):
Here's what I understand from your comments:
1. We should enable multi-named flags in FlagsBase
2. While loading flag values from cmd/env in FlagsBase::load(), it generates 
warnings by determining actual name being used. (Add check logic to 
__Flag.load__ lambda? It takes __DeprecatedName__ struct in capture, as well as 
the _name_ used to actual load the value, and generate warnings if _name_ falls 
into DeprecatedName) Something like this:
{code}
flag.load = [t1, deprecatedNames](FlagsBase*, const std::string& name, const 
std::string& value) -> Try {
  ...
  if (deprecatedNames.find(name)) { deprecationWarning(name); }
  ...
};
{code}
3. Add duplicate names to all applicable flags

My concerns:
1. Why both _Name_ and _deprecatedName_ structs? Since we only need to know 
whether it is deprecated. Also, I don't see any instance that has __multiple__ 
deprecated names, so why vector of structs?
2. If the sole purpose of having this vector of structs is to search for 
deprecated names, I suggest to use _set_ instead.
3. Are we overengineering this? 'slave' flags will eventually be removed, along 
with deprecatedNames. Nevertheless, I like the idea of having multi-name flags.

Thanks!

> Replace Master/Slave Terminology Phase I - Add duplicate agent flags 
> -
>
> Key: MESOS-3781
> URL: https://issues.apache.org/jira/browse/MESOS-3781
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: Jay Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags

2016-04-18 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245223#comment-15245223
 ] 

Jay Guo edited comment on MESOS-3781 at 4/18/16 7:10 AM:
-

Here's what I understand from your comments:
1. We should enable multi-named flags in FlagsBase
2. While loading flag values from cmd/env in FlagsBase::load(), it generates 
warnings by determining actual name being used. (Add check logic to 
__Flag.load__ lambda? It takes __DeprecatedName__ struct in capture, as well as 
the _name_ used to actual load the value, and generate warnings if _name_ falls 
into DeprecatedName) Something like this:
{code}
flag.load = [t1, deprecatedNames](FlagsBase*, const std::string& name, const 
std::string& value) -> Try {
  ...
  if (deprecatedNames.find(name)) { deprecationWarning(name); }
  ...
};
{code}
3. Add duplicate names to all applicable flags

My concerns:
1. Why both _Name_ and _deprecatedName_ structs? Since we only need to know 
whether it is deprecated. Also, I don't see any instance that has __multiple__ 
deprecated names, so why vector of structs?
2. If the sole purpose of having this vector of structs is to search for 
deprecated names, I suggest to use _set_ instead.
3. Are we overengineering this? 'slave' flags will eventually be removed, along 
with deprecatedNames. Nevertheless, I like the idea of having multi-name flags.

Thanks!


was (Author: guoger):
Here's what I understand from your comments:
1. We should enable multi-named flags in FlagsBase
2. While loading flag values from cmd/env in FlagsBase::load(), it generates 
warnings by determining actual name being used. (Add check logic to 
__Flag.load__ lambda? It takes __DeprecatedName__ struct in capture, as well as 
the _name_ used to actual load the value, and generate warnings if _name_ falls 
into DeprecatedName) Something like this:
```cpp
flag.load = [t1, deprecatedNames](FlagsBase*, const std::string& name, const 
std::string& value) -> Try {
  ...
  if (deprecatedNames.find(name)) { deprecationWarning(name); }
  ...
};
```
3. Add duplicate names to all applicable flags

My concerns:
1. Why both _Name_ and _deprecatedName_ structs? Since we only need to know 
whether it is deprecated. Also, I don't see any instance that has __multiple__ 
deprecated names, so why vector of structs?
2. If the sole purpose of having this vector of structs is to search for 
deprecated names, I suggest to use _set_ instead.
3. Are we overengineering this? 'slave' flags will eventually be removed, along 
with deprecatedNames. Nevertheless, I like the idea of having multi-name flags.

Thanks!

> Replace Master/Slave Terminology Phase I - Add duplicate agent flags 
> -
>
> Key: MESOS-3781
> URL: https://issues.apache.org/jira/browse/MESOS-3781
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: Jay Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags

2016-04-18 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245223#comment-15245223
 ] 

Jay Guo commented on MESOS-3781:


Here's what I understand from your comments:
1. We should enable multi-named flags in FlagsBase
2. While loading flag values from cmd/env in FlagsBase::load(), it generates 
warnings by determining actual name being used. (Add check logic to 
__Flag.load__ lambda? It takes __DeprecatedName__ struct in capture, as well as 
the _name_ used to actual load the value, and generate warnings if _name_ falls 
into DeprecatedName) Something like this:
```cpp
flag.load = [t1, deprecatedNames](FlagsBase*, const std::string& name, const 
std::string& value) -> Try {
  ...
  if (deprecatedNames.find(name)) { deprecationWarning(name); }
  ...
};
```
3. Add duplicate names to all applicable flags

My concerns:
1. Why both _Name_ and _deprecatedName_ structs? Since we only need to know 
whether it is deprecated. Also, I don't see any instance that has __multiple__ 
deprecated names, so why vector of structs?
2. If the sole purpose of having this vector of structs is to search for 
deprecated names, I suggest to use _set_ instead.
3. Are we overengineering this? 'slave' flags will eventually be removed, along 
with deprecatedNames. Nevertheless, I like the idea of having multi-name flags.

Thanks!

> Replace Master/Slave Terminology Phase I - Add duplicate agent flags 
> -
>
> Key: MESOS-3781
> URL: https://issues.apache.org/jira/browse/MESOS-3781
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: Jay Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3367) Mesos fetcher does not extract archives for URI with parameters

2016-04-18 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245208#comment-15245208
 ] 

haosdent commented on MESOS-3367:
-

Got it. I think MESOS-4735 is a better approach, let me close this. Feel free 
to reopen this if you think it still necessary.

> Mesos fetcher does not extract archives for URI with parameters
> ---
>
> Key: MESOS-3367
> URL: https://issues.apache.org/jira/browse/MESOS-3367
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.22.1, 0.23.0
> Environment: DCOS 1.1
>Reporter: Renat Zubairov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere
>
> I'm deploying using marathon applications with sources served from S3. I'm 
> using a signed URL to give only temporary access to the S3 resources, so URL 
> of the resource have some query parameters.
> So URI is 'https://foo.com/file.tgz?hasi' and fetcher stores it in the file 
> with the name 'file.tgz?hasi', then it thinks that extension 'hasi' is not 
> tgz hence extraction is skipped, despite the fact that MIME Type of the HTTP 
> resource is 'application/x-tar'.
> Workaround - add additional parameter like '=.tgz'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5224) buffer overflow error in slave upon processing status update from executor v1 http API

2016-04-18 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245206#comment-15245206
 ] 

James DeFelice commented on MESOS-5224:
---

Here's a JSON of an update that's "rejected" by the slave. I don't know if this 
is THE update that's crashing the slave. But it seems likely since the 
connection is dropped and I see an EOF on the executor. All of the updates are 
generated the exact same way (via 
https://github.com/mesos/mesos-go/blob/executor_proto/cmd/example-executor/main.go#L208).
{code}
{ 
  "executor_id":{"value":"default"},
  "framework_id":{"value":"ad9e5972-8b5e-4042-b97f-ecc36f2c046f-0011"},
  "type":"UPDATE",
  "update":{
"status":{ 
  "task_id":{"value":"1"},
  "state":"TASK_RUNNING",
  "source":"SOURCE_EXECUTOR",
  "executor_id":{"value":"default"},
  "uuid":"ZTZlZTRlNmMtNzE0Ni00NTAwLWJkZWYtNDc0Yzk2MWNmNGU4" // 
base64-decoded: e6ee4e6c-7146-4500-bdef-474c961cf4e8
}
  }
}
{code}

> buffer overflow error in slave upon processing status update from executor v1 
> http API
> --
>
> Key: MESOS-5224
> URL: https://issues.apache.org/jira/browse/MESOS-5224
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.28.0
> Environment: {code}
> $ dpkg -l|grep -e mesos
> ii  mesos   0.28.0-2.0.16.ubuntu1404 
> amd64Cluster resource manager with efficient resource isolation
> $ uname -a
> Linux node-3 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 
> x86_64 x86_64 x86_64 GNU/Linux
> {code}
>Reporter: James DeFelice
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> implementing support for executor HTTP v1 API in mesos-go:next and my 
> executor can't send status updates because the slave dies upon receiving 
> them. protobufs generated from 0.28.1
> from syslog:
> {code}
> Apr 17 17:53:53 node-1 mesos-slave[4462]: I0417 17:53:53.121467  4489 
> http.cpp:190] HTTP POST for /slave(1)/api/v1/executor from 10.2.0.5:51800 
> with User-Agent='Go-http-client/1.1'
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** buffer overflow detected ***: 
> /usr/sbin/mesos-slave terminated
> Apr 17 17:53:53 node-1 mesos-slave[4462]: === Backtrace: =
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x7338f)[0x7fc53064e38f]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7fc5306e5c9c]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(+0x109b60)[0x7fc5306e4b60]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internallsERSoRKNS0_12StatusUpdateE+0x16a)[0x7fc531cc617a]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN5mesos8internal5slave5Slave12statusUpdateENS0_12StatusUpdateERK6OptionIN7process4UPIDEE+0xe7)[0x7fc531d71837]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZNK5mesos8internal5slave5Slave4Http8executorERKN7process4http7RequestE+0xb52)[0x7fc531d302a2]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0xc754a3)[0x7fc531d4d4a3]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x1295aa8)[0x7fc53236daa8]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x2d1)[0x7fc532375a71]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/local/lib/libmesos-0.28.0.so(+0x129dd77)[0x7fc532375d77]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1bf0)[0x7fc530e85bf0]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc5309a8182]
> Apr 17 17:53:53 node-1 mesos-slave[4462]: 
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc5306d547d]
> ...
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** Aborted at 1460915633 (unix 
> time) try "date -d @1460915633" if you are using GNU date ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: PC: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: *** SIGABRT (@0x116e) received by 
> PID 4462 (TID 0x7fc5275f5700) from PID 4462; stack trace: ***
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5309b0340 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc530611cc9 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306150d8 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc53064e394 (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e5c9c (unknown)
> Apr 17 17:53:53 node-1 mesos-slave[4462]: @ 0x7fc5306e4b60 (unknown)
> Apr 17