[jira] [Comment Edited] (MESOS-1812) Queued tasks are not actually launched in the order they were queued
[ https://issues.apache.org/jira/browse/MESOS-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140671#comment-14140671 ] Tom Arnfeld edited comment on MESOS-1812 at 9/19/14 2:47 PM: - I think there are use cases for it. For example, the modifications I am making to the hadoop framework. Ultimately I am trying to control how long an Executor process lives for, and be able to trigger it to commit suicide, from the framework. Framework/Executor messages are currently not a reliable form of communication over mesos (as far as I know) and after my tasks are done I need the executor to stay around for a specific amount of time. Perhaps what I really need here is some kind of {{shutdownExecutor}} driver call. was (Author: tarnfeld): I think there are use cases for it. For example, the modifications I am making to the hadoop framework. Ultimately I am trying to control how long an Executor process lives for, and be able to trigger it to commit suicide. Framework messages are currently not a reliable form of communication over mesos (as far as I know) and after my tasks are done I need the executor to stay around for a specific amount of time. Perhaps what I really need here is some kind of {{shutdownExecutor}} driver call. Queued tasks are not actually launched in the order they were queued Key: MESOS-1812 URL: https://issues.apache.org/jira/browse/MESOS-1812 Project: Mesos Issue Type: Bug Components: slave Reporter: Tom Arnfeld Even though tasks are assigned and queued in the order in which they are launched (e.g multiple tasks in reply to one offer), due to timing issues with the futures, this can sometimes break the causality and end up not being launched in order. Example trace from a slave... In this example the Task_Tracker_10 task should be launched before slots_Task_Tracker_10. {code} I0918 02:10:50.371445 17072 slave.cpp:933] Got assigned task Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.372110 17072 slave.cpp:933] Got assigned task slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.372172 17073 gc.cpp:84] Unscheduling '/mnt/mesos-slave/slaves/20140915-112519-3171422218-5050-5016-6/frameworks/20140916-233111-3171422218-5050-14295-0015' from gc I0918 02:10:50.375018 17072 slave.cpp:1043] Launching task slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.386282 17072 slave.cpp:1153] Queuing task 'slots_Task_Tracker_10' for executor executor_Task_Tracker_10 of framework '20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.386312 17070 mesos_containerizer.cpp:537] Starting container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' for executor 'executor_Task_Tracker_10' of framework '20140916-233111-3171422218-5050-14295-0015' I0918 02:10:50.388942 17072 slave.cpp:1043] Launching task Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.406277 17070 launcher.cpp:117] Forked child with pid '817' for container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' I0918 02:10:50.406563 17072 slave.cpp:1153] Queuing task 'Task_Tracker_10' for executor executor_Task_Tracker_10 of framework '20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.408499 17069 mesos_containerizer.cpp:647] Fetching URIs for container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' using command '/usr/local/libexec/mesos/mesos-fetcher' I0918 02:11:11.650687 17071 slave.cpp:2873] Current usage 17.34%. Max allowed age: 5.086371210668750days I0918 02:11:16.590270 17075 slave.cpp:2355] Monitoring executor 'executor_Task_Tracker_10' of framework '20140916-233111-3171422218-5050-14295-0015' in container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' I0918 02:11:17.701015 17070 slave.cpp:1664] Got registration for executor 'executor_Task_Tracker_10' of framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:11:17.701897 17070 slave.cpp:1783] Flushing queued task slots_Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:11:17.702350 17070 slave.cpp:1783] Flushing queued task Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:11:18.588388 17070 mesos_containerizer.cpp:1112] Executor for container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' has exited I0918 02:11:18.588665 17070 mesos_containerizer.cpp:996] Destroying container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' I0918 02:11:18.599234 17072 slave.cpp:2413] Executor 'executor_Task_Tracker_10' of framework 20140916-233111-3171422218-5050-14295-0015 has exited
[jira] [Comment Edited] (MESOS-1812) Queued tasks are not actually launched in the order they were queued
[ https://issues.apache.org/jira/browse/MESOS-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140671#comment-14140671 ] Tom Arnfeld edited comment on MESOS-1812 at 9/19/14 2:55 PM: - I think there are use cases for it. For example, the modifications I am making to the hadoop framework. Ultimately I am trying to control how long an Executor process lives for, and be able to trigger it to commit suicide, from the framework. Framework/Executor messages are currently not a reliable form of communication over mesos (as far as I know) and after my tasks are done I need the executor to stay around for a specific amount of time. Currently I am launching two tasks, one as a controller for the executor (issuing {{killTask}} on this task ID will cause the executor to terminate. Then another N tasks for the actual work. I'd like to ensure the first task always launches first. Perhaps what I really need here is some kind of {{shutdownExecutor}} driver call. was (Author: tarnfeld): I think there are use cases for it. For example, the modifications I am making to the hadoop framework. Ultimately I am trying to control how long an Executor process lives for, and be able to trigger it to commit suicide, from the framework. Framework/Executor messages are currently not a reliable form of communication over mesos (as far as I know) and after my tasks are done I need the executor to stay around for a specific amount of time. Perhaps what I really need here is some kind of {{shutdownExecutor}} driver call. Queued tasks are not actually launched in the order they were queued Key: MESOS-1812 URL: https://issues.apache.org/jira/browse/MESOS-1812 Project: Mesos Issue Type: Bug Components: slave Reporter: Tom Arnfeld Even though tasks are assigned and queued in the order in which they are launched (e.g multiple tasks in reply to one offer), due to timing issues with the futures, this can sometimes break the causality and end up not being launched in order. Example trace from a slave... In this example the Task_Tracker_10 task should be launched before slots_Task_Tracker_10. {code} I0918 02:10:50.371445 17072 slave.cpp:933] Got assigned task Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.372110 17072 slave.cpp:933] Got assigned task slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.372172 17073 gc.cpp:84] Unscheduling '/mnt/mesos-slave/slaves/20140915-112519-3171422218-5050-5016-6/frameworks/20140916-233111-3171422218-5050-14295-0015' from gc I0918 02:10:50.375018 17072 slave.cpp:1043] Launching task slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.386282 17072 slave.cpp:1153] Queuing task 'slots_Task_Tracker_10' for executor executor_Task_Tracker_10 of framework '20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.386312 17070 mesos_containerizer.cpp:537] Starting container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' for executor 'executor_Task_Tracker_10' of framework '20140916-233111-3171422218-5050-14295-0015' I0918 02:10:50.388942 17072 slave.cpp:1043] Launching task Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.406277 17070 launcher.cpp:117] Forked child with pid '817' for container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' I0918 02:10:50.406563 17072 slave.cpp:1153] Queuing task 'Task_Tracker_10' for executor executor_Task_Tracker_10 of framework '20140916-233111-3171422218-5050-14295-0015 I0918 02:10:50.408499 17069 mesos_containerizer.cpp:647] Fetching URIs for container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' using command '/usr/local/libexec/mesos/mesos-fetcher' I0918 02:11:11.650687 17071 slave.cpp:2873] Current usage 17.34%. Max allowed age: 5.086371210668750days I0918 02:11:16.590270 17075 slave.cpp:2355] Monitoring executor 'executor_Task_Tracker_10' of framework '20140916-233111-3171422218-5050-14295-0015' in container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' I0918 02:11:17.701015 17070 slave.cpp:1664] Got registration for executor 'executor_Task_Tracker_10' of framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:11:17.701897 17070 slave.cpp:1783] Flushing queued task slots_Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:11:17.702350 17070 slave.cpp:1783] Flushing queued task Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 20140916-233111-3171422218-5050-14295-0015 I0918 02:11:18.588388 17070 mesos_containerizer.cpp:1112] Executor for container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' has
[jira] [Commented] (MESOS-809) External control of the ip that Mesos components publish to zookeeper
[ https://issues.apache.org/jira/browse/MESOS-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140769#comment-14140769 ] Anindya Sinha commented on MESOS-809: - I have a patch internally which adds public_ip and public_port to mesos command line args (as optional params). If set, these values are passed over to libprocess via 2 separate env vars (similar to LIBPROCESS_IP and LIBPROCESS_PORT), say $PUBLIC_IP and $PUBLIC_PORT. mesos would still bind the socket on [$LIBPROCESS_IP:$LIBPROCESS_PORT] (as existing functionality), but for the rest of the cluster (such as to advertise to zookeeper), it would use $PUBLIC_IP:$PUBLIC_PORT, ie. to be specific, __ip__ and __port__ would be set to $PUBLIC_IP and $PUBLIC_PORT in that case. External control of the ip that Mesos components publish to zookeeper - Key: MESOS-809 URL: https://issues.apache.org/jira/browse/MESOS-809 Project: Mesos Issue Type: Improvement Components: framework, master, slave Affects Versions: 0.14.2 Reporter: Khalid Goudeaux Priority: Minor With tools like Docker making containers more manageable, it's tempting to use containers for all software installation. The CoreOS project is an example of this. When an application is run inside a container it sees a different ip/hostname from the host system running the container. That ip is only valid from inside that host, no other machine can see it. From inside a container, the Mesos master and slave publish that private ip to zookeeper and as a result they can't find each other if they're on different machines. The --ip option can't help because the public ip isn't available for binding from within a container. Essentially, from inside the container, mesos processes don't know the ip they're available at (they may not know the port either). It would be nice to bootstrap the processes with the correct ip for them to publish to zookeeper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1819) Ignore signals during executor critical startup
Tobias Weingartner created MESOS-1819: - Summary: Ignore signals during executor critical startup Key: MESOS-1819 URL: https://issues.apache.org/jira/browse/MESOS-1819 Project: Mesos Issue Type: Bug Components: containerization, isolation, slave Reporter: Tobias Weingartner Priority: Minor If the slave receives a SIGTERM between the time that it checkpoints a PID of a new task/container, and the time that the container is fully functional, the task will end up getting lost upon recovery. Possibly handle this via either a graceful shutdown hook (via signal handler, or possibly web endpoint), or possibly defer signals during the critical section. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1820) Log anonymizer
Tobias Weingartner created MESOS-1820: - Summary: Log anonymizer Key: MESOS-1820 URL: https://issues.apache.org/jira/browse/MESOS-1820 Project: Mesos Issue Type: Story Components: master, slave Reporter: Tobias Weingartner Priority: Minor It would be awesome to have a way to anonymize the logs that the master and slave keep, so that users of the Mesos ecosystem could submit logs in a manner that would keep them safe from divulging too much internal information, such as task names, framework names, slave names, etc. If the anonymization was done in a repeatable fashion, then future interactions with customs could possibly be done in a correlated fashion, but still in a manner that protects the bulk of sensitive information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule
[ https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140977#comment-14140977 ] Bernd Mathiske commented on MESOS-1384: --- [~tstclair] We'll only support absolute, complete paths in the first patch. Good idea to handle lib exts automatically. We'd like to put that into the next patch then. Add support for loadable MesosModule Key: MESOS-1384 URL: https://issues.apache.org/jira/browse/MESOS-1384 Project: Mesos Issue Type: Improvement Affects Versions: 0.19.0 Reporter: Timothy St. Clair Assignee: Niklas Quarfot Nielsen I think we should break this into multiple phases. -(1) Let's get the dynamic library loading via a stout-ified version of https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h. - *DONE* (2) Use (1) to instantiate some classes in Mesos (like an Authenticator and/or isolator) from a dynamic library. This will give us some more experience with how we want to name the underlying library symbol, how we want to specify flags for finding the library, what types of validation we want when loading a library. *TARGET* (3) After doing (2) for one or two classes in Mesos I think we can formalize the approach in a mesos-ified version of https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h. *NEXT* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version
[ https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140998#comment-14140998 ] Vinod Kone commented on MESOS-1675: --- Just curious, what else would they link to if they are depending on the shared lib? Decouple version of the mesos library from the package release version -- Key: MESOS-1675 URL: https://issues.apache.org/jira/browse/MESOS-1675 Project: Mesos Issue Type: Bug Reporter: Vinod Kone This discussion should be rolled into the larger discussion around how to version Mesos (APIs, packages, libraries etc). Some notes from libtool docs. http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1821) CHECK failure in master.
[ https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1821: --- Priority: Blocker (was: Major) CHECK failure in master. Key: MESOS-1821 URL: https://issues.apache.org/jira/browse/MESOS-1821 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.21.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Blocker Looks like the recent CHECKs I've added exposed a bug in the framework re-registration logic by which we didn't keep the executors consistent between the Slave and Framework structs: {noformat: title=Master Log} I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) exited with status 0 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0 *** Check failure stack trace: *** @ 0x7fd16c81737d google::LogMessage::Fail() @ 0x7fd16c8191c4 google::LogMessage::SendToLog() @ 0x7fd16c816f6c google::LogMessage::Flush() @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() @ 0x7fd16c348269 ProtobufProcess::handler4() @ 0x7fd16c2fc18e std::_Function_handler::_M_invoke() @ 0x7fd16c322132 ProtobufProcess::visit() @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() @ 0x7fd16c7c2502 process::ProcessManager::resume() @ 0x7fd16c7c280c process::schedule() @ 0x7fd16b9c683d start_thread @ 0x7fd16a2b626d clone {noformat} This occurs sometime after a failover and indicates that the Slave and Framework structs are not kept in sync. Problem seems to be here, when re-registering a framework on a failed over master, we only consider executors for which there are tasks stored in the master: {code} void Master::_reregisterFramework( const UPID from, const FrameworkInfo frameworkInfo, bool failover, const FutureOptionError validationError) { ... if (frameworks.registered.count(frameworkInfo.id()) 0) { ... } else { // We don't have a framework with this ID, so we must be a newly // elected Mesos master to which either an existing scheduler or a // failed-over one is connecting. Create a Framework object and add // any tasks it has that have been reported by reconnecting slaves. Framework* framework = new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); framework-reregisteredTime = Clock::now(); // TODO(benh): Check for root submissions like above! // Add any running tasks reported by slaves for this framework. foreachvalue (Slave* slave, slaves.registered) { foreachkey (const FrameworkID frameworkId, slave-tasks) { foreachvalue (Task* task, slave-tasks[frameworkId]) { if (framework-id == task-framework_id()) { framework-addTask(task); // Also add the task's executor for resource accounting // if it's still alive on the slave and we've not yet // added it to the framework. if (task-has_executor_id() slave-hasExecutor(framework-id, task-executor_id()) !framework-hasExecutor(slave-id, task-executor_id())) { // XXX: If an executor has no tasks, the executor will not // XXX: be added to the Framework struct! const ExecutorInfo executorInfo = slave-executors[framework-id][task-executor_id()]; framework-addExecutor(slave-id, executorInfo); } } } } } // N.B. Need to add the framework _after_ we add its tasks // (above) so that we can properly determine the resources it's // currently using! addFramework(framework); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1821) CHECK failure in master.
[ https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141186#comment-14141186 ] Benjamin Mahler commented on MESOS-1821: https://reviews.apache.org/r/25843/ CHECK failure in master. Key: MESOS-1821 URL: https://issues.apache.org/jira/browse/MESOS-1821 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.21.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Blocker Looks like the recent CHECKs I've added exposed a bug in the framework re-registration logic by which we didn't keep the executors consistent between the Slave and Framework structs: {noformat: title=Master Log} I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) exited with status 0 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0 *** Check failure stack trace: *** @ 0x7fd16c81737d google::LogMessage::Fail() @ 0x7fd16c8191c4 google::LogMessage::SendToLog() @ 0x7fd16c816f6c google::LogMessage::Flush() @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() @ 0x7fd16c348269 ProtobufProcess::handler4() @ 0x7fd16c2fc18e std::_Function_handler::_M_invoke() @ 0x7fd16c322132 ProtobufProcess::visit() @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() @ 0x7fd16c7c2502 process::ProcessManager::resume() @ 0x7fd16c7c280c process::schedule() @ 0x7fd16b9c683d start_thread @ 0x7fd16a2b626d clone {noformat} This occurs sometime after a failover and indicates that the Slave and Framework structs are not kept in sync. Problem seems to be here, when re-registering a framework on a failed over master, we only consider executors for which there are tasks stored in the master: {code} void Master::_reregisterFramework( const UPID from, const FrameworkInfo frameworkInfo, bool failover, const FutureOptionError validationError) { ... if (frameworks.registered.count(frameworkInfo.id()) 0) { ... } else { // We don't have a framework with this ID, so we must be a newly // elected Mesos master to which either an existing scheduler or a // failed-over one is connecting. Create a Framework object and add // any tasks it has that have been reported by reconnecting slaves. Framework* framework = new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); framework-reregisteredTime = Clock::now(); // TODO(benh): Check for root submissions like above! // Add any running tasks reported by slaves for this framework. foreachvalue (Slave* slave, slaves.registered) { foreachkey (const FrameworkID frameworkId, slave-tasks) { foreachvalue (Task* task, slave-tasks[frameworkId]) { if (framework-id == task-framework_id()) { framework-addTask(task); // Also add the task's executor for resource accounting // if it's still alive on the slave and we've not yet // added it to the framework. if (task-has_executor_id() slave-hasExecutor(framework-id, task-executor_id()) !framework-hasExecutor(slave-id, task-executor_id())) { // XXX: If an executor has no tasks, the executor will not // XXX: be added to the Framework struct! const ExecutorInfo executorInfo = slave-executors[framework-id][task-executor_id()]; framework-addExecutor(slave-id, executorInfo); } } } } } // N.B. Need to add the framework _after_ we add its tasks // (above) so that we can properly determine the resources it's // currently using! addFramework(framework); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version
[ https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141252#comment-14141252 ] Timothy St. Clair commented on MESOS-1675: -- So I ran a quick experiment, and It looks like it will require a re-link : Before: ldd /usr/sbin/mesos-slave libmesos-0.20.0.so = /lib64/libmesos-0.20.0.so After: ldd mesos-slave libmesos-0.21.0.so.0 = /home/tstclair/work/spaces/mesos/active/src/src/.libs/libmesos-0.21.0.so.0 (0x7f22f02eb000) So if you had previous frameworks that were linked. Decouple version of the mesos library from the package release version -- Key: MESOS-1675 URL: https://issues.apache.org/jira/browse/MESOS-1675 Project: Mesos Issue Type: Bug Reporter: Vinod Kone This discussion should be rolled into the larger discussion around how to version Mesos (APIs, packages, libraries etc). Some notes from libtool docs. http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1822) web ui redirection does not work when masters are not publicly reachable
craig mcmillan created MESOS-1822: - Summary: web ui redirection does not work when masters are not publicly reachable Key: MESOS-1822 URL: https://issues.apache.org/jira/browse/MESOS-1822 Project: Mesos Issue Type: Bug Affects Versions: 0.20.0 Reporter: craig mcmillan issues : https://issues.apache.org/jira/browse/MESOS-672 https://issues.apache.org/jira/browse/MESOS-903 address the problem of web-ui redirection not working when mesos masters are publicly reachable, but if the masters are only accessible through an SSH tunnel then redirection doesn't work at all : a single master must be chosen when setting up the SSH tunnel, and a redirect means having to manually kill the tunnel and re-point it at the correct leader marathon addresses this issue by having non-leader masters proxy to the the leader, so an SSH tunnel can be pointed at any of the masters, leader or not : could mesos do the same ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1821) CHECK failure in master.
[ https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1821: --- Sprint: Mesos Q3 Sprint 5 CHECK failure in master. Key: MESOS-1821 URL: https://issues.apache.org/jira/browse/MESOS-1821 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.21.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Blocker Looks like the recent CHECKs I've added exposed a bug in the framework re-registration logic by which we didn't keep the executors consistent between the Slave and Framework structs: {noformat: title=Master Log} I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) exited with status 0 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0 *** Check failure stack trace: *** @ 0x7fd16c81737d google::LogMessage::Fail() @ 0x7fd16c8191c4 google::LogMessage::SendToLog() @ 0x7fd16c816f6c google::LogMessage::Flush() @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() @ 0x7fd16c348269 ProtobufProcess::handler4() @ 0x7fd16c2fc18e std::_Function_handler::_M_invoke() @ 0x7fd16c322132 ProtobufProcess::visit() @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() @ 0x7fd16c7c2502 process::ProcessManager::resume() @ 0x7fd16c7c280c process::schedule() @ 0x7fd16b9c683d start_thread @ 0x7fd16a2b626d clone {noformat} This occurs sometime after a failover and indicates that the Slave and Framework structs are not kept in sync. Problem seems to be here, when re-registering a framework on a failed over master, we only consider executors for which there are tasks stored in the master: {code} void Master::_reregisterFramework( const UPID from, const FrameworkInfo frameworkInfo, bool failover, const FutureOptionError validationError) { ... if (frameworks.registered.count(frameworkInfo.id()) 0) { ... } else { // We don't have a framework with this ID, so we must be a newly // elected Mesos master to which either an existing scheduler or a // failed-over one is connecting. Create a Framework object and add // any tasks it has that have been reported by reconnecting slaves. Framework* framework = new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); framework-reregisteredTime = Clock::now(); // TODO(benh): Check for root submissions like above! // Add any running tasks reported by slaves for this framework. foreachvalue (Slave* slave, slaves.registered) { foreachkey (const FrameworkID frameworkId, slave-tasks) { foreachvalue (Task* task, slave-tasks[frameworkId]) { if (framework-id == task-framework_id()) { framework-addTask(task); // Also add the task's executor for resource accounting // if it's still alive on the slave and we've not yet // added it to the framework. if (task-has_executor_id() slave-hasExecutor(framework-id, task-executor_id()) !framework-hasExecutor(slave-id, task-executor_id())) { // XXX: If an executor has no tasks, the executor will not // XXX: be added to the Framework struct! const ExecutorInfo executorInfo = slave-executors[framework-id][task-executor_id()]; framework-addExecutor(slave-id, executorInfo); } } } } } // N.B. Need to add the framework _after_ we add its tasks // (above) so that we can properly determine the resources it's // currently using! addFramework(framework); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-1756) Support etcd as an alternative for Zk in Mesos
[ https://issues.apache.org/jira/browse/MESOS-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen closed MESOS-1756. --- Support etcd as an alternative for Zk in Mesos -- Key: MESOS-1756 URL: https://issues.apache.org/jira/browse/MESOS-1756 Project: Mesos Issue Type: Improvement Reporter: Timothy Chen Priority: Minor With the increase number of etcd users, for them to use Mesos it's often required to run another zookeeper cluster just for Mesos. Will be ideal if Mesos can just run on Etcd as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper
[ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141409#comment-14141409 ] Timothy Chen commented on MESOS-1806: - [~tstclair] I don't really have a branch (I do, but contains 10 lines of code change...). [~Ed Ropple] are you going to start working on this soon? Substituting etcd or ReplicatedLog for Zookeeper Key: MESOS-1806 URL: https://issues.apache.org/jira/browse/MESOS-1806 Project: Mesos Issue Type: Task Reporter: Ed Ropple Priority: Minor adam_mesos eropple: Could you also file a new JIRA for Mesos to drop ZK in favor of etcd or ReplicatedLog? Would love to get some momentum going on that one. -- Consider it filed. =) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-1593) Add DockerInfo Configuration
[ https://issues.apache.org/jira/browse/MESOS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen closed MESOS-1593. --- Add DockerInfo Configuration Key: MESOS-1593 URL: https://issues.apache.org/jira/browse/MESOS-1593 Project: Mesos Issue Type: Task Reporter: Timothy Chen Assignee: Timothy Chen Fix For: 0.20.0 We want to add a new proto message to encapsulate all Docker related configurations into DockerInfo. Here is the document that describes the design for DockerInfo: https://github.com/tnachen/mesos/wiki/DockerInfo-design -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1593) Add DockerInfo Configuration
[ https://issues.apache.org/jira/browse/MESOS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141464#comment-14141464 ] Timothy Chen commented on MESOS-1593: - commit 2057e3fa37f880b52d766feb5ed33a0209f218bc Author: Timothy Chen tnac...@apache.org Date: Thu Aug 14 09:58:11 2014 -0700 Added explicit DockerInfo within ContainerInfo. Added new DockerInfo to explicitly capture Docker options, and allow command URIs to be fetched and mapped into sandbox, which gets bind-mounted into the container. Review: https://reviews.apache.org/r/24475 Add DockerInfo Configuration Key: MESOS-1593 URL: https://issues.apache.org/jira/browse/MESOS-1593 Project: Mesos Issue Type: Task Reporter: Timothy Chen Assignee: Timothy Chen Fix For: 0.20.0 We want to add a new proto message to encapsulate all Docker related configurations into DockerInfo. Here is the document that describes the design for DockerInfo: https://github.com/tnachen/mesos/wiki/DockerInfo-design -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule
[ https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141570#comment-14141570 ] Bernd Mathiske commented on MESOS-1384: --- A patch has been submitted: https://reviews.apache.org/r/25848/ Add support for loadable MesosModule Key: MESOS-1384 URL: https://issues.apache.org/jira/browse/MESOS-1384 Project: Mesos Issue Type: Improvement Affects Versions: 0.19.0 Reporter: Timothy St. Clair Assignee: Niklas Quarfot Nielsen I think we should break this into multiple phases. -(1) Let's get the dynamic library loading via a stout-ified version of https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h. - *DONE* (2) Use (1) to instantiate some classes in Mesos (like an Authenticator and/or isolator) from a dynamic library. This will give us some more experience with how we want to name the underlying library symbol, how we want to specify flags for finding the library, what types of validation we want when loading a library. *TARGET* (3) After doing (2) for one or two classes in Mesos I think we can formalize the approach in a mesos-ified version of https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h. *NEXT* -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1081) Master should not deactivate authenticated framework/slave on new AuthenticateMessage unless new authentication succeeds.
[ https://issues.apache.org/jira/browse/MESOS-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141612#comment-14141612 ] Vinod Kone commented on MESOS-1081: --- https://reviews.apache.org/r/25866/ Master should not deactivate authenticated framework/slave on new AuthenticateMessage unless new authentication succeeds. - Key: MESOS-1081 URL: https://issues.apache.org/jira/browse/MESOS-1081 Project: Mesos Issue Type: Bug Components: master Reporter: Adam B Labels: authentication, master, security Master should not deactivate an authenticated framework/slave upon receiving a new AuthenticateMessage unless new authentication succeeds. As it stands now, a malicious user could spoof the pid of an authenticated framework/slave and send an AuthenticateMessage to knock a valid framework/slave off the authenticated list, forcing the valid framework/slave to re-authenticate and re-register. This could be used in a DoS attack. But how should we handle the scenario when the actual authenticated framework/slave sends an AuthenticateMessage that fails authentication? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.
[ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1668: -- Shepherd: Benjamin Mahler https://reviews.apache.org/r/25867/ Handle a temporary one-way master -- slave socket closure. --- Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Assignee: Vinod Kone Priority: Minor Labels: reliability In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1081) Master should not deactivate authenticated framework/slave on new AuthenticateMessage unless new authentication succeeds.
[ https://issues.apache.org/jira/browse/MESOS-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1081: -- Sprint: Mesos Q3 Sprint 5 Assignee: Vinod Kone Story Points: 1 Master should not deactivate authenticated framework/slave on new AuthenticateMessage unless new authentication succeeds. - Key: MESOS-1081 URL: https://issues.apache.org/jira/browse/MESOS-1081 Project: Mesos Issue Type: Bug Components: master Reporter: Adam B Assignee: Vinod Kone Labels: authentication, master, security Master should not deactivate an authenticated framework/slave upon receiving a new AuthenticateMessage unless new authentication succeeds. As it stands now, a malicious user could spoof the pid of an authenticated framework/slave and send an AuthenticateMessage to knock a valid framework/slave off the authenticated list, forcing the valid framework/slave to re-authenticate and re-register. This could be used in a DoS attack. But how should we handle the scenario when the actual authenticated framework/slave sends an AuthenticateMessage that fails authentication? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1811) Reconcile disconnected/deactivated semantics in the master code
[ https://issues.apache.org/jira/browse/MESOS-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141614#comment-14141614 ] Vinod Kone commented on MESOS-1811: --- https://reviews.apache.org/r/25866/ Reconcile disconnected/deactivated semantics in the master code --- Key: MESOS-1811 URL: https://issues.apache.org/jira/browse/MESOS-1811 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone Assignee: Vinod Kone Currently the master code treats a deactivated and disconnected slave similarly, by setting 'disconnected' variable in the slave struct. This causes us to disconnect() a slave in cases where we really only want to deactivate() the slave (e.g., authentication). It would be nice to differentiate these semantics by adding a new variable active in the Slave struct. We might want to do the same with the Framework struct for consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)