[jira] [Commented] (MESOS-3136) COMMAND health checks with Marathon 0.10.0rc3 are broken

2015-07-25 Thread Dr. Stefan Schimanski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641581#comment-14641581
 ] 

Dr. Stefan Schimanski commented on MESOS-3136:
--

[~tnachen] I think you hit the nail on the head. The issue I have encountered 
is exactly that: the new docker executor does not support the task health 
checks yet, the old one in 0.22 did.

Now the actual problem is that Marathon supports those COMMAND health checks 
for a long time and people use it a lot. Hence, this looks like a regression in 
0.23+ for the Marathon users.

Would it be feasible to fix that by taking (or refactoring) the 
launcher/executor.cpp code for the docker executor? Or in fact, it would be 
even more awesome (although also kind of incompatible with the old behavior) if 
the new docker executor could execute the health checks inside the 
corresponding container via docker exec. Kubernetes is following this route 
as well.

 COMMAND health checks with Marathon 0.10.0rc3 are broken
 

 Key: MESOS-3136
 URL: https://issues.apache.org/jira/browse/MESOS-3136
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.23.0
Reporter: Dr. Stefan Schimanski

 When deploying Mesos 0.23rc4 with latest Marathon 0.10.0 RC3 command health 
 check stop working. Rolling back to Mesos 0.22.1 fixes the problem.
 Containerizer is Docker.
 All packages are from official Mesosphere Ubuntu 14.04 sources.
 The issue must be analyzed further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3119) Remove pthread specific code from Libprocess

2015-07-25 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641585#comment-14641585
 ] 

Joris Van Remoortere commented on MESOS-3119:
-

Hi Anand, The motiviation is in the epic :-)

 Remove pthread specific code from Libprocess
 

 Key: MESOS-3119
 URL: https://issues.apache.org/jira/browse/MESOS-3119
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Joris Van Remoortere
Assignee: Joris Van Remoortere
  Labels: libprocess, mesosphere, windows
 Fix For: 0.24.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3098) Implement WindowsContainerizer and WindowsDockerContainerizer

2015-07-25 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641595#comment-14641595
 ] 

haosdent edited comment on MESOS-3098 at 7/25/15 1:55 PM:
--

Is WindowsContainer a Linux actually? Or windows container is a windows and 
only could run windows program?


was (Author: haosd...@gmail.com):
Does WindowsContainer is a Linux actually? Or windows container is a windows 
and only could run windows program?

 Implement WindowsContainerizer and WindowsDockerContainerizer
 -

 Key: MESOS-3098
 URL: https://issues.apache.org/jira/browse/MESOS-3098
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Joseph Wu
Assignee: Alex Clemmer
  Labels: mesosphere

 The MVP for Windows support is a containerizer that (1) runs on Windows, and 
 (2) runs and passes all the tests that are relevant to the Windows platform 
 (_e.g._, not the tests that involve cgroups). To do this we require at least 
 a `WindowsContainerizer` (to be implemented alongside the 
 `MesosContainerizer`), which provides no meaningful (_e.g._) process 
 namespacing (much like the default unix containerizer). In the long term 
 (hopefully before MesosCon) we want to support also the Windows container 
 API. This will require implementing a separate containerizer, maybe called 
 `WindowsDockerContainerizer`.
 Since the Windows container API is actually officially supported through the 
 Docker interface (_i.e._, MSFT actually ported the Docker engine to Windows, 
 and that is the official API), the interfaces (like the fetcher) shouldn't 
 change much. The tests probably will have to change, as we don't have access 
 to any isolation primitives like cgroups for those tests.
 Outstanding TODO([~hausdorff]): Flesh out this description when more details 
 are available, regarding:
 * The container API for Windows (when we know them)
 * The nuances of Windows vs Linux (when we know them)
 * etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-527) Command Executors do not have Executor IDs in the master.

2015-07-25 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641651#comment-14641651
 ] 

haosdent commented on MESOS-527:


Patch: https://reviews.apache.org/r/36814/

I use executor in mesos-execute directly. [~bmahler] Does this way acceptable?

 Command Executors do not have Executor IDs in the master.
 -

 Key: MESOS-527
 URL: https://issues.apache.org/jira/browse/MESOS-527
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
  Labels: twitter

 The webui is broken for command executors because the master does not know 
 the executor ID for the tasks using a command executor. This is because the 
 Task protobuf only has the executor_id field, no other field to indicate the 
 presence of the command executor.
 It seems the slave also doesn't set the Task.executor_id for command 
 executors, thus relying on it being optionally set in executorTerminated() to 
 determine whether the task used a command executor.
 This all seems pretty messy, a few things to consider:
 1) Should we simply always set the Task.executor_id for these tasks? The 
 master could do so currently, but there would be an implicit contract that 
 the slave and master both use the task id as the executor id.
 2) We can add a boolean is_command_executor to Task, so that both the master 
 and slave can set the field, and the slave can use the boolean in 
 executorTerminated() to determine whether the task used a command executor.
 3) Alternatively, we can add a /frameworks/FID/tasks/TID url format for the 
 broken links on the master webui, so that we can search for the task in the 
 slave state to locate its executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2857) FetcherCacheTest.LocalCachedExtract is flaky.

2015-07-25 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641500#comment-14641500
 ] 

Bernd Mathiske commented on MESOS-2857:
---

Thx. Will investigate. 



 FetcherCacheTest.LocalCachedExtract is flaky.
 -

 Key: MESOS-2857
 URL: https://issues.apache.org/jira/browse/MESOS-2857
 Project: Mesos
  Issue Type: Bug
  Components: fetcher, test
Reporter: Benjamin Mahler
Assignee: Bernd Mathiske
  Labels: flaky-test, mesosphere

 From jenkins:
 {noformat}
 [ RUN  ] FetcherCacheTest.LocalCachedExtract
 Using temporary directory '/tmp/FetcherCacheTest_LocalCachedExtract_Cwdcdj'
 I0610 20:04:48.591573 24561 leveldb.cpp:176] Opened db in 3.512525ms
 I0610 20:04:48.592456 24561 leveldb.cpp:183] Compacted db in 828630ns
 I0610 20:04:48.592512 24561 leveldb.cpp:198] Created db iterator in 32992ns
 I0610 20:04:48.592531 24561 leveldb.cpp:204] Seeked to beginning of db in 
 8967ns
 I0610 20:04:48.592545 24561 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 7762ns
 I0610 20:04:48.592604 24561 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0610 20:04:48.593438 24587 recover.cpp:449] Starting replica recovery
 I0610 20:04:48.593698 24587 recover.cpp:475] Replica is in EMPTY status
 I0610 20:04:48.595641 24580 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0610 20:04:48.596086 24590 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0610 20:04:48.596607 24590 recover.cpp:566] Updating replica status to 
 STARTING
 I0610 20:04:48.597507 24590 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 717888ns
 I0610 20:04:48.597535 24590 replica.cpp:323] Persisted replica status to 
 STARTING
 I0610 20:04:48.597697 24590 recover.cpp:475] Replica is in STARTING status
 I0610 20:04:48.599165 24584 replica.cpp:641] Replica in STARTING status 
 received a broadcasted recover request
 I0610 20:04:48.599434 24584 recover.cpp:195] Received a recover response from 
 a replica in STARTING status
 I0610 20:04:48.599915 24590 recover.cpp:566] Updating replica status to VOTING
 I0610 20:04:48.600545 24590 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 432335ns
 I0610 20:04:48.600574 24590 replica.cpp:323] Persisted replica status to 
 VOTING
 I0610 20:04:48.600659 24590 recover.cpp:580] Successfully joined the Paxos 
 group
 I0610 20:04:48.600797 24590 recover.cpp:464] Recover process terminated
 I0610 20:04:48.602905 24594 master.cpp:363] Master 
 20150610-200448-3875541420-32907-24561 (dbade881e927) started on 
 172.17.0.231:32907
 I0610 20:04:48.602957 24594 master.cpp:365] Flags at startup: --acls= 
 --allocation_interval=1secs --allocator=HierarchicalDRF 
 --authenticate=true --authenticate_slaves=true --authenticators=crammd5 
 --credentials=/tmp/FetcherCacheTest_LocalCachedExtract_Cwdcdj/credentials 
 --framework_sorter=drf --help=false --initialize_driver_logging=true 
 --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO 
 --quiet=false --recovery_slave_removal_limit=100% 
 --registry=replicated_log --registry_fetch_timeout=1mins 
 --registry_store_timeout=25secs --registry_strict=true 
 --root_submissions=true --slave_reregister_timeout=10mins 
 --user_sorter=drf --version=false 
 --webui_dir=/mesos/mesos-0.23.0/_inst/share/mesos/webui 
 --work_dir=/tmp/FetcherCacheTest_LocalCachedExtract_Cwdcdj/master 
 --zk_session_timeout=10secs
 I0610 20:04:48.603374 24594 master.cpp:410] Master only allowing 
 authenticated frameworks to register
 I0610 20:04:48.603392 24594 master.cpp:415] Master only allowing 
 authenticated slaves to register
 I0610 20:04:48.603404 24594 credentials.hpp:37] Loading credentials for 
 authentication from 
 '/tmp/FetcherCacheTest_LocalCachedExtract_Cwdcdj/credentials'
 I0610 20:04:48.603751 24594 master.cpp:454] Using default 'crammd5' 
 authenticator
 I0610 20:04:48.604928 24594 master.cpp:491] Authorization enabled
 I0610 20:04:48.606034 24593 hierarchical.hpp:309] Initialized hierarchical 
 allocator process
 I0610 20:04:48.606106 24593 whitelist_watcher.cpp:79] No whitelist given
 I0610 20:04:48.607430 24594 master.cpp:1476] The newly elected leader is 
 master@172.17.0.231:32907 with id 20150610-200448-3875541420-32907-24561
 I0610 20:04:48.607466 24594 master.cpp:1489] Elected as the leading master!
 I0610 20:04:48.607481 24594 master.cpp:1259] Recovering from registrar
 I0610 20:04:48.607712 24594 registrar.cpp:313] Recovering registrar
 I0610 20:04:48.608543 24588 log.cpp:661] Attempting to start the writer
 I0610 20:04:48.610231 24588 replica.cpp:477] Replica received implicit 
 promise request with proposal 1
 I0610 20:04:48.611335 24588 leveldb.cpp:306] Persisting metadata (8 

[jira] [Commented] (MESOS-2411) trailing slash in work_dir causes sandbox link issues

2015-07-25 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641494#comment-14641494
 ] 

haosdent commented on MESOS-2411:
-

Could not reproduce this problem in 0.23. Please check again [~elmalto]

 trailing slash in work_dir causes sandbox link issues
 -

 Key: MESOS-2411
 URL: https://issues.apache.org/jira/browse/MESOS-2411
 Project: Mesos
  Issue Type: Bug
  Components: webui
Affects Versions: 0.20.1
Reporter: Malte Buecken
Assignee: haosdent
Priority: Trivial
   Original Estimate: 1h
  Remaining Estimate: 1h

 OS: Debian (wheezy-backports 7)
 When you define a work_dir and the work_dir has a trailing /, you cannot open 
 the sandbox in the webui any more because of two // in the url which produce 
 a angular.js parse error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.

2015-07-25 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641502#comment-14641502
 ] 

Klaus Ma commented on MESOS-3070:
-

[~jieyu], do you have any more comments for this? Maybe we can just log an 
error message and make the new task failed; because it's just running for one 
second. Just not sure whether other case will also trigger check failed.

Thanks
Klaus

 Master CHECK failure if a framework uses duplicated task id.
 

 Key: MESOS-3070
 URL: https://issues.apache.org/jira/browse/MESOS-3070
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.22.1
Reporter: Jie Yu

 We observed this in one of our testing cluster.
 One framework (under development) keeps launching tasks using the same 
 task_id. We don't expect the master to crash even if the framework is not 
 doing what it's supposed to do. However, under a series of events, this could 
 happen and keeps crashing the master.
 1) frameworkA launches task 'task_id_1' on slaveA
 2) master fails over
 3) slaveA has not re-registered yet
 4) frameworkA re-registered and launches task 'task_id_1' on slaveB
 5) slaveA re-registering and add task task_id_1' to frameworkA
 6) CHECK failure in addTask
 {noformat}
 I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with 
 resources cpus(*):4; mem(*):32768 on slave 
 20150417-232509-1735470090-5050-48870-S25 (hostname)
 ...
 ...
 F0716 21:52:50.760136 28805 master.hpp:362] Check failed: 
 !tasks.contains(task-task_id()) Duplicate task 'task_id_1' of framework 
 framework_id
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)