[jira] [Commented] (MESOS-3136) COMMAND health checks with Marathon 0.10.0rc3 are broken
[ https://issues.apache.org/jira/browse/MESOS-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641581#comment-14641581 ] Dr. Stefan Schimanski commented on MESOS-3136: -- [~tnachen] I think you hit the nail on the head. The issue I have encountered is exactly that: the new docker executor does not support the task health checks yet, the old one in 0.22 did. Now the actual problem is that Marathon supports those COMMAND health checks for a long time and people use it a lot. Hence, this looks like a regression in 0.23+ for the Marathon users. Would it be feasible to fix that by taking (or refactoring) the launcher/executor.cpp code for the docker executor? Or in fact, it would be even more awesome (although also kind of incompatible with the old behavior) if the new docker executor could execute the health checks inside the corresponding container via docker exec. Kubernetes is following this route as well. COMMAND health checks with Marathon 0.10.0rc3 are broken Key: MESOS-3136 URL: https://issues.apache.org/jira/browse/MESOS-3136 Project: Mesos Issue Type: Bug Affects Versions: 0.23.0 Reporter: Dr. Stefan Schimanski When deploying Mesos 0.23rc4 with latest Marathon 0.10.0 RC3 command health check stop working. Rolling back to Mesos 0.22.1 fixes the problem. Containerizer is Docker. All packages are from official Mesosphere Ubuntu 14.04 sources. The issue must be analyzed further. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3119) Remove pthread specific code from Libprocess
[ https://issues.apache.org/jira/browse/MESOS-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641585#comment-14641585 ] Joris Van Remoortere commented on MESOS-3119: - Hi Anand, The motiviation is in the epic :-) Remove pthread specific code from Libprocess Key: MESOS-3119 URL: https://issues.apache.org/jira/browse/MESOS-3119 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Labels: libprocess, mesosphere, windows Fix For: 0.24.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3098) Implement WindowsContainerizer and WindowsDockerContainerizer
[ https://issues.apache.org/jira/browse/MESOS-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641595#comment-14641595 ] haosdent edited comment on MESOS-3098 at 7/25/15 1:55 PM: -- Is WindowsContainer a Linux actually? Or windows container is a windows and only could run windows program? was (Author: haosd...@gmail.com): Does WindowsContainer is a Linux actually? Or windows container is a windows and only could run windows program? Implement WindowsContainerizer and WindowsDockerContainerizer - Key: MESOS-3098 URL: https://issues.apache.org/jira/browse/MESOS-3098 Project: Mesos Issue Type: Task Components: containerization Reporter: Joseph Wu Assignee: Alex Clemmer Labels: mesosphere The MVP for Windows support is a containerizer that (1) runs on Windows, and (2) runs and passes all the tests that are relevant to the Windows platform (_e.g._, not the tests that involve cgroups). To do this we require at least a `WindowsContainerizer` (to be implemented alongside the `MesosContainerizer`), which provides no meaningful (_e.g._) process namespacing (much like the default unix containerizer). In the long term (hopefully before MesosCon) we want to support also the Windows container API. This will require implementing a separate containerizer, maybe called `WindowsDockerContainerizer`. Since the Windows container API is actually officially supported through the Docker interface (_i.e._, MSFT actually ported the Docker engine to Windows, and that is the official API), the interfaces (like the fetcher) shouldn't change much. The tests probably will have to change, as we don't have access to any isolation primitives like cgroups for those tests. Outstanding TODO([~hausdorff]): Flesh out this description when more details are available, regarding: * The container API for Windows (when we know them) * The nuances of Windows vs Linux (when we know them) * etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-527) Command Executors do not have Executor IDs in the master.
[ https://issues.apache.org/jira/browse/MESOS-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641651#comment-14641651 ] haosdent commented on MESOS-527: Patch: https://reviews.apache.org/r/36814/ I use executor in mesos-execute directly. [~bmahler] Does this way acceptable? Command Executors do not have Executor IDs in the master. - Key: MESOS-527 URL: https://issues.apache.org/jira/browse/MESOS-527 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Labels: twitter The webui is broken for command executors because the master does not know the executor ID for the tasks using a command executor. This is because the Task protobuf only has the executor_id field, no other field to indicate the presence of the command executor. It seems the slave also doesn't set the Task.executor_id for command executors, thus relying on it being optionally set in executorTerminated() to determine whether the task used a command executor. This all seems pretty messy, a few things to consider: 1) Should we simply always set the Task.executor_id for these tasks? The master could do so currently, but there would be an implicit contract that the slave and master both use the task id as the executor id. 2) We can add a boolean is_command_executor to Task, so that both the master and slave can set the field, and the slave can use the boolean in executorTerminated() to determine whether the task used a command executor. 3) Alternatively, we can add a /frameworks/FID/tasks/TID url format for the broken links on the master webui, so that we can search for the task in the slave state to locate its executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2857) FetcherCacheTest.LocalCachedExtract is flaky.
[ https://issues.apache.org/jira/browse/MESOS-2857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641500#comment-14641500 ] Bernd Mathiske commented on MESOS-2857: --- Thx. Will investigate. FetcherCacheTest.LocalCachedExtract is flaky. - Key: MESOS-2857 URL: https://issues.apache.org/jira/browse/MESOS-2857 Project: Mesos Issue Type: Bug Components: fetcher, test Reporter: Benjamin Mahler Assignee: Bernd Mathiske Labels: flaky-test, mesosphere From jenkins: {noformat} [ RUN ] FetcherCacheTest.LocalCachedExtract Using temporary directory '/tmp/FetcherCacheTest_LocalCachedExtract_Cwdcdj' I0610 20:04:48.591573 24561 leveldb.cpp:176] Opened db in 3.512525ms I0610 20:04:48.592456 24561 leveldb.cpp:183] Compacted db in 828630ns I0610 20:04:48.592512 24561 leveldb.cpp:198] Created db iterator in 32992ns I0610 20:04:48.592531 24561 leveldb.cpp:204] Seeked to beginning of db in 8967ns I0610 20:04:48.592545 24561 leveldb.cpp:273] Iterated through 0 keys in the db in 7762ns I0610 20:04:48.592604 24561 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0610 20:04:48.593438 24587 recover.cpp:449] Starting replica recovery I0610 20:04:48.593698 24587 recover.cpp:475] Replica is in EMPTY status I0610 20:04:48.595641 24580 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0610 20:04:48.596086 24590 recover.cpp:195] Received a recover response from a replica in EMPTY status I0610 20:04:48.596607 24590 recover.cpp:566] Updating replica status to STARTING I0610 20:04:48.597507 24590 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 717888ns I0610 20:04:48.597535 24590 replica.cpp:323] Persisted replica status to STARTING I0610 20:04:48.597697 24590 recover.cpp:475] Replica is in STARTING status I0610 20:04:48.599165 24584 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0610 20:04:48.599434 24584 recover.cpp:195] Received a recover response from a replica in STARTING status I0610 20:04:48.599915 24590 recover.cpp:566] Updating replica status to VOTING I0610 20:04:48.600545 24590 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 432335ns I0610 20:04:48.600574 24590 replica.cpp:323] Persisted replica status to VOTING I0610 20:04:48.600659 24590 recover.cpp:580] Successfully joined the Paxos group I0610 20:04:48.600797 24590 recover.cpp:464] Recover process terminated I0610 20:04:48.602905 24594 master.cpp:363] Master 20150610-200448-3875541420-32907-24561 (dbade881e927) started on 172.17.0.231:32907 I0610 20:04:48.602957 24594 master.cpp:365] Flags at startup: --acls= --allocation_interval=1secs --allocator=HierarchicalDRF --authenticate=true --authenticate_slaves=true --authenticators=crammd5 --credentials=/tmp/FetcherCacheTest_LocalCachedExtract_Cwdcdj/credentials --framework_sorter=drf --help=false --initialize_driver_logging=true --log_auto_initialize=true --logbufsecs=0 --logging_level=INFO --quiet=false --recovery_slave_removal_limit=100% --registry=replicated_log --registry_fetch_timeout=1mins --registry_store_timeout=25secs --registry_strict=true --root_submissions=true --slave_reregister_timeout=10mins --user_sorter=drf --version=false --webui_dir=/mesos/mesos-0.23.0/_inst/share/mesos/webui --work_dir=/tmp/FetcherCacheTest_LocalCachedExtract_Cwdcdj/master --zk_session_timeout=10secs I0610 20:04:48.603374 24594 master.cpp:410] Master only allowing authenticated frameworks to register I0610 20:04:48.603392 24594 master.cpp:415] Master only allowing authenticated slaves to register I0610 20:04:48.603404 24594 credentials.hpp:37] Loading credentials for authentication from '/tmp/FetcherCacheTest_LocalCachedExtract_Cwdcdj/credentials' I0610 20:04:48.603751 24594 master.cpp:454] Using default 'crammd5' authenticator I0610 20:04:48.604928 24594 master.cpp:491] Authorization enabled I0610 20:04:48.606034 24593 hierarchical.hpp:309] Initialized hierarchical allocator process I0610 20:04:48.606106 24593 whitelist_watcher.cpp:79] No whitelist given I0610 20:04:48.607430 24594 master.cpp:1476] The newly elected leader is master@172.17.0.231:32907 with id 20150610-200448-3875541420-32907-24561 I0610 20:04:48.607466 24594 master.cpp:1489] Elected as the leading master! I0610 20:04:48.607481 24594 master.cpp:1259] Recovering from registrar I0610 20:04:48.607712 24594 registrar.cpp:313] Recovering registrar I0610 20:04:48.608543 24588 log.cpp:661] Attempting to start the writer I0610 20:04:48.610231 24588 replica.cpp:477] Replica received implicit promise request with proposal 1 I0610 20:04:48.611335 24588 leveldb.cpp:306] Persisting metadata (8
[jira] [Commented] (MESOS-2411) trailing slash in work_dir causes sandbox link issues
[ https://issues.apache.org/jira/browse/MESOS-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641494#comment-14641494 ] haosdent commented on MESOS-2411: - Could not reproduce this problem in 0.23. Please check again [~elmalto] trailing slash in work_dir causes sandbox link issues - Key: MESOS-2411 URL: https://issues.apache.org/jira/browse/MESOS-2411 Project: Mesos Issue Type: Bug Components: webui Affects Versions: 0.20.1 Reporter: Malte Buecken Assignee: haosdent Priority: Trivial Original Estimate: 1h Remaining Estimate: 1h OS: Debian (wheezy-backports 7) When you define a work_dir and the work_dir has a trailing /, you cannot open the sandbox in the webui any more because of two // in the url which produce a angular.js parse error -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3070) Master CHECK failure if a framework uses duplicated task id.
[ https://issues.apache.org/jira/browse/MESOS-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641502#comment-14641502 ] Klaus Ma commented on MESOS-3070: - [~jieyu], do you have any more comments for this? Maybe we can just log an error message and make the new task failed; because it's just running for one second. Just not sure whether other case will also trigger check failed. Thanks Klaus Master CHECK failure if a framework uses duplicated task id. Key: MESOS-3070 URL: https://issues.apache.org/jira/browse/MESOS-3070 Project: Mesos Issue Type: Bug Affects Versions: 0.22.1 Reporter: Jie Yu We observed this in one of our testing cluster. One framework (under development) keeps launching tasks using the same task_id. We don't expect the master to crash even if the framework is not doing what it's supposed to do. However, under a series of events, this could happen and keeps crashing the master. 1) frameworkA launches task 'task_id_1' on slaveA 2) master fails over 3) slaveA has not re-registered yet 4) frameworkA re-registered and launches task 'task_id_1' on slaveB 5) slaveA re-registering and add task task_id_1' to frameworkA 6) CHECK failure in addTask {noformat} I0716 21:52:50.759305 28805 master.hpp:159] Adding task 'task_id_1' with resources cpus(*):4; mem(*):32768 on slave 20150417-232509-1735470090-5050-48870-S25 (hostname) ... ... F0716 21:52:50.760136 28805 master.hpp:362] Check failed: !tasks.contains(task-task_id()) Duplicate task 'task_id_1' of framework framework_id {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)