[jira] [Comment Edited] (MESOS-1812) Queued tasks are not actually launched in the order they were queued

2014-09-19 Thread Tom Arnfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140671#comment-14140671
 ] 

Tom Arnfeld edited comment on MESOS-1812 at 9/19/14 2:47 PM:
-

I think there are use cases for it. For example, the modifications I am making 
to the hadoop framework.

Ultimately I am trying to control how long an Executor process lives for, and 
be able to trigger it to commit suicide, from the framework. Framework/Executor 
messages are currently not a reliable form of communication over mesos (as far 
as I know) and after my tasks are done I need the executor to stay around for a 
specific amount of time.

Perhaps what I really need here is some kind of {{shutdownExecutor}} driver 
call.


was (Author: tarnfeld):
I think there are use cases for it. For example, the modifications I am making 
to the hadoop framework.

Ultimately I am trying to control how long an Executor process lives for, and 
be able to trigger it to commit suicide. Framework messages are currently not a 
reliable form of communication over mesos (as far as I know) and after my tasks 
are done I need the executor to stay around for a specific amount of time.

Perhaps what I really need here is some kind of {{shutdownExecutor}} driver 
call.

 Queued tasks are not actually launched in the order they were queued
 

 Key: MESOS-1812
 URL: https://issues.apache.org/jira/browse/MESOS-1812
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Tom Arnfeld

 Even though tasks are assigned and queued in the order in which they are 
 launched (e.g multiple tasks in reply to one offer), due to timing issues 
 with the futures, this can sometimes break the causality and end up not being 
 launched in order.
 Example trace from a slave... In this example the Task_Tracker_10 task should 
 be launched before slots_Task_Tracker_10.
 {code}
 I0918 02:10:50.371445 17072 slave.cpp:933] Got assigned task Task_Tracker_10 
 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.372110 17072 slave.cpp:933] Got assigned task 
 slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.372172 17073 gc.cpp:84] Unscheduling 
 '/mnt/mesos-slave/slaves/20140915-112519-3171422218-5050-5016-6/frameworks/20140916-233111-3171422218-5050-14295-0015'
  from gc
 I0918 02:10:50.375018 17072 slave.cpp:1043] Launching task 
 slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.386282 17072 slave.cpp:1153] Queuing task 
 'slots_Task_Tracker_10' for executor executor_Task_Tracker_10 of framework 
 '20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.386312 17070 mesos_containerizer.cpp:537] Starting container 
 '5f507f09-b48e-44ea-b74e-740b0e8bba4d' for executor 
 'executor_Task_Tracker_10' of framework 
 '20140916-233111-3171422218-5050-14295-0015'
 I0918 02:10:50.388942 17072 slave.cpp:1043] Launching task Task_Tracker_10 
 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.406277 17070 launcher.cpp:117] Forked child with pid '817' for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
 I0918 02:10:50.406563 17072 slave.cpp:1153] Queuing task 'Task_Tracker_10' 
 for executor executor_Task_Tracker_10 of framework 
 '20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.408499 17069 mesos_containerizer.cpp:647] Fetching URIs for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' using command 
 '/usr/local/libexec/mesos/mesos-fetcher'
 I0918 02:11:11.650687 17071 slave.cpp:2873] Current usage 17.34%. Max allowed 
 age: 5.086371210668750days
 I0918 02:11:16.590270 17075 slave.cpp:2355] Monitoring executor 
 'executor_Task_Tracker_10' of framework 
 '20140916-233111-3171422218-5050-14295-0015' in container 
 '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
 I0918 02:11:17.701015 17070 slave.cpp:1664] Got registration for executor 
 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:17.701897 17070 slave.cpp:1783] Flushing queued task 
 slots_Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:17.702350 17070 slave.cpp:1783] Flushing queued task 
 Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:18.588388 17070 mesos_containerizer.cpp:1112] Executor for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' has exited
 I0918 02:11:18.588665 17070 mesos_containerizer.cpp:996] Destroying container 
 '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
 I0918 02:11:18.599234 17072 slave.cpp:2413] Executor 
 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015 has exited 

[jira] [Comment Edited] (MESOS-1812) Queued tasks are not actually launched in the order they were queued

2014-09-19 Thread Tom Arnfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140671#comment-14140671
 ] 

Tom Arnfeld edited comment on MESOS-1812 at 9/19/14 2:55 PM:
-

I think there are use cases for it. For example, the modifications I am making 
to the hadoop framework.

Ultimately I am trying to control how long an Executor process lives for, and 
be able to trigger it to commit suicide, from the framework. Framework/Executor 
messages are currently not a reliable form of communication over mesos (as far 
as I know) and after my tasks are done I need the executor to stay around for a 
specific amount of time.

Currently I am launching two tasks, one as a controller for the executor 
(issuing {{killTask}} on this task ID will cause the executor to terminate. 
Then another N tasks for the actual work. I'd like to ensure the first task 
always launches first.

Perhaps what I really need here is some kind of {{shutdownExecutor}} driver 
call.


was (Author: tarnfeld):
I think there are use cases for it. For example, the modifications I am making 
to the hadoop framework.

Ultimately I am trying to control how long an Executor process lives for, and 
be able to trigger it to commit suicide, from the framework. Framework/Executor 
messages are currently not a reliable form of communication over mesos (as far 
as I know) and after my tasks are done I need the executor to stay around for a 
specific amount of time.

Perhaps what I really need here is some kind of {{shutdownExecutor}} driver 
call.

 Queued tasks are not actually launched in the order they were queued
 

 Key: MESOS-1812
 URL: https://issues.apache.org/jira/browse/MESOS-1812
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Tom Arnfeld

 Even though tasks are assigned and queued in the order in which they are 
 launched (e.g multiple tasks in reply to one offer), due to timing issues 
 with the futures, this can sometimes break the causality and end up not being 
 launched in order.
 Example trace from a slave... In this example the Task_Tracker_10 task should 
 be launched before slots_Task_Tracker_10.
 {code}
 I0918 02:10:50.371445 17072 slave.cpp:933] Got assigned task Task_Tracker_10 
 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.372110 17072 slave.cpp:933] Got assigned task 
 slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.372172 17073 gc.cpp:84] Unscheduling 
 '/mnt/mesos-slave/slaves/20140915-112519-3171422218-5050-5016-6/frameworks/20140916-233111-3171422218-5050-14295-0015'
  from gc
 I0918 02:10:50.375018 17072 slave.cpp:1043] Launching task 
 slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.386282 17072 slave.cpp:1153] Queuing task 
 'slots_Task_Tracker_10' for executor executor_Task_Tracker_10 of framework 
 '20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.386312 17070 mesos_containerizer.cpp:537] Starting container 
 '5f507f09-b48e-44ea-b74e-740b0e8bba4d' for executor 
 'executor_Task_Tracker_10' of framework 
 '20140916-233111-3171422218-5050-14295-0015'
 I0918 02:10:50.388942 17072 slave.cpp:1043] Launching task Task_Tracker_10 
 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.406277 17070 launcher.cpp:117] Forked child with pid '817' for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
 I0918 02:10:50.406563 17072 slave.cpp:1153] Queuing task 'Task_Tracker_10' 
 for executor executor_Task_Tracker_10 of framework 
 '20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.408499 17069 mesos_containerizer.cpp:647] Fetching URIs for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' using command 
 '/usr/local/libexec/mesos/mesos-fetcher'
 I0918 02:11:11.650687 17071 slave.cpp:2873] Current usage 17.34%. Max allowed 
 age: 5.086371210668750days
 I0918 02:11:16.590270 17075 slave.cpp:2355] Monitoring executor 
 'executor_Task_Tracker_10' of framework 
 '20140916-233111-3171422218-5050-14295-0015' in container 
 '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
 I0918 02:11:17.701015 17070 slave.cpp:1664] Got registration for executor 
 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:17.701897 17070 slave.cpp:1783] Flushing queued task 
 slots_Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:17.702350 17070 slave.cpp:1783] Flushing queued task 
 Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:18.588388 17070 mesos_containerizer.cpp:1112] Executor for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' has 

[jira] [Commented] (MESOS-809) External control of the ip that Mesos components publish to zookeeper

2014-09-19 Thread Anindya Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140769#comment-14140769
 ] 

Anindya Sinha commented on MESOS-809:
-

I have a patch internally which adds public_ip and public_port to mesos command 
line args (as optional params).
If set, these values are passed over to libprocess via 2 separate env vars 
(similar to LIBPROCESS_IP and LIBPROCESS_PORT), say $PUBLIC_IP and 
$PUBLIC_PORT. mesos would still bind the socket on 
[$LIBPROCESS_IP:$LIBPROCESS_PORT] (as existing functionality), but for the rest 
of the cluster (such as to advertise to zookeeper), it would use 
$PUBLIC_IP:$PUBLIC_PORT, ie. to be specific, __ip__ and __port__ would be set 
to $PUBLIC_IP and $PUBLIC_PORT in that case.

 External control of the ip that Mesos components publish to zookeeper
 -

 Key: MESOS-809
 URL: https://issues.apache.org/jira/browse/MESOS-809
 Project: Mesos
  Issue Type: Improvement
  Components: framework, master, slave
Affects Versions: 0.14.2
Reporter: Khalid Goudeaux
Priority: Minor

 With tools like Docker making containers more manageable, it's tempting to 
 use containers for all software installation. The CoreOS project is an 
 example of this.
 When an application is run inside a container it sees a different ip/hostname 
 from the host system running the container. That ip is only valid from inside 
 that host, no other machine can see it.
 From inside a container, the Mesos master and slave publish that private ip 
 to zookeeper and as a result they can't find each other if they're on 
 different machines. The --ip option can't help because the public ip isn't 
 available for binding from within a container.
 Essentially, from inside the container, mesos processes don't know the ip 
 they're available at (they may not know the port either).
 It would be nice to bootstrap the processes with the correct ip for them to 
 publish to zookeeper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1819) Ignore signals during executor critical startup

2014-09-19 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1819:
-

 Summary: Ignore signals during executor critical startup
 Key: MESOS-1819
 URL: https://issues.apache.org/jira/browse/MESOS-1819
 Project: Mesos
  Issue Type: Bug
  Components: containerization, isolation, slave
Reporter: Tobias Weingartner
Priority: Minor


If the slave receives a SIGTERM between the time that it checkpoints a PID of a 
new task/container, and the time that the container is fully functional, the 
task will end up getting lost upon recovery.

Possibly handle this via either a graceful shutdown hook (via signal handler, 
or possibly web endpoint), or possibly defer signals during the critical 
section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1820) Log anonymizer

2014-09-19 Thread Tobias Weingartner (JIRA)
Tobias Weingartner created MESOS-1820:
-

 Summary: Log anonymizer
 Key: MESOS-1820
 URL: https://issues.apache.org/jira/browse/MESOS-1820
 Project: Mesos
  Issue Type: Story
  Components: master, slave
Reporter: Tobias Weingartner
Priority: Minor


It would be awesome to have a way to anonymize the logs that the master and 
slave keep, so that users of the Mesos ecosystem could submit logs in a manner 
that would keep them safe from divulging too much internal information, such as 
task names, framework names, slave names, etc.

If the anonymization was done in a repeatable fashion, then future interactions 
with customs could possibly be done in a correlated fashion, but still in a 
manner that protects the bulk of sensitive information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule

2014-09-19 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140977#comment-14140977
 ] 

Bernd Mathiske commented on MESOS-1384:
---

[~tstclair] We'll only support absolute, complete paths in the first patch. 
Good idea to handle lib exts automatically. We'd like to put that into the next 
patch then. 

 Add support for loadable MesosModule
 

 Key: MESOS-1384
 URL: https://issues.apache.org/jira/browse/MESOS-1384
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

 I think we should break this into multiple phases.
 -(1) Let's get the dynamic library loading via a stout-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
  -
 *DONE*
 (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
 and/or isolator) from a dynamic library. This will give us some more 
 experience with how we want to name the underlying library symbol, how we 
 want to specify flags for finding the library, what types of validation we 
 want when loading a library.
 *TARGET* 
 (3) After doing (2) for one or two classes in Mesos I think we can formalize 
 the approach in a mesos-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
 *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version

2014-09-19 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140998#comment-14140998
 ] 

Vinod Kone commented on MESOS-1675:
---

Just curious, what else would they link to if they are depending on the shared 
lib?

 Decouple version of the mesos library from the package release version
 --

 Key: MESOS-1675
 URL: https://issues.apache.org/jira/browse/MESOS-1675
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone

 This discussion should be rolled into the larger discussion around how to 
 version Mesos (APIs, packages, libraries etc).
 Some notes from libtool docs.
 http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
 http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1821) CHECK failure in master.

2014-09-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1821:
---
Priority: Blocker  (was: Major)

 CHECK failure in master.
 

 Key: MESOS-1821
 URL: https://issues.apache.org/jira/browse/MESOS-1821
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.21.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Blocker

 Looks like the recent CHECKs I've added exposed a bug in the framework 
 re-registration logic by which we didn't keep the executors consistent 
 between the Slave and Framework structs:
 {noformat: title=Master Log}
 I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME) exited with status 0
 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' 
 with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME)
 F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: 
 hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 
 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0
 *** Check failure stack trace: ***
 @ 0x7fd16c81737d  google::LogMessage::Fail()
 @ 0x7fd16c8191c4  google::LogMessage::SendToLog()
 @ 0x7fd16c816f6c  google::LogMessage::Flush()
 @ 0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
 @ 0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
 @ 0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
 @ 0x7fd16c348269  ProtobufProcess::handler4()
 @ 0x7fd16c2fc18e  std::_Function_handler::_M_invoke()
 @ 0x7fd16c322132  ProtobufProcess::visit()
 @ 0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
 @ 0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
 @ 0x7fd16c7c2502  process::ProcessManager::resume()
 @ 0x7fd16c7c280c  process::schedule()
 @ 0x7fd16b9c683d  start_thread
 @ 0x7fd16a2b626d  clone
 {noformat}
 This occurs sometime after a failover and indicates that the Slave and 
 Framework structs are not kept in sync.
 Problem seems to be here, when re-registering a framework on a failed over 
 master, we only consider executors for which there are tasks stored in the 
 master:
 {code}
 void Master::_reregisterFramework(
 const UPID from,
 const FrameworkInfo frameworkInfo,
 bool failover,
 const FutureOptionError  validationError)
 {
   ...
   if (frameworks.registered.count(frameworkInfo.id())  0) {
 ...
   } else {
 // We don't have a framework with this ID, so we must be a newly
 // elected Mesos master to which either an existing scheduler or a
 // failed-over one is connecting. Create a Framework object and add
 // any tasks it has that have been reported by reconnecting slaves.
 Framework* framework =
   new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
 framework-reregisteredTime = Clock::now();
 // TODO(benh): Check for root submissions like above!
 // Add any running tasks reported by slaves for this framework.
 foreachvalue (Slave* slave, slaves.registered) {
   foreachkey (const FrameworkID frameworkId, slave-tasks) {
 foreachvalue (Task* task, slave-tasks[frameworkId]) {
   if (framework-id == task-framework_id()) {
 framework-addTask(task);
 // Also add the task's executor for resource accounting
 // if it's still alive on the slave and we've not yet
 // added it to the framework.
 if (task-has_executor_id() 
 slave-hasExecutor(framework-id, task-executor_id()) 
 !framework-hasExecutor(slave-id, task-executor_id())) {
   // XXX: If an executor has no tasks, the executor will not
   // XXX: be added to the Framework struct!
   const ExecutorInfo executorInfo =
 slave-executors[framework-id][task-executor_id()];
   framework-addExecutor(slave-id, executorInfo);
 }
   }
 }
   }
 }
 // N.B. Need to add the framework _after_ we add its tasks
 // (above) so that we can properly determine the resources it's
 // currently using!
 addFramework(framework);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1821) CHECK failure in master.

2014-09-19 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141186#comment-14141186
 ] 

Benjamin Mahler commented on MESOS-1821:


https://reviews.apache.org/r/25843/

 CHECK failure in master.
 

 Key: MESOS-1821
 URL: https://issues.apache.org/jira/browse/MESOS-1821
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.21.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Blocker

 Looks like the recent CHECKs I've added exposed a bug in the framework 
 re-registration logic by which we didn't keep the executors consistent 
 between the Slave and Framework structs:
 {noformat: title=Master Log}
 I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME) exited with status 0
 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' 
 with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME)
 F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: 
 hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 
 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0
 *** Check failure stack trace: ***
 @ 0x7fd16c81737d  google::LogMessage::Fail()
 @ 0x7fd16c8191c4  google::LogMessage::SendToLog()
 @ 0x7fd16c816f6c  google::LogMessage::Flush()
 @ 0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
 @ 0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
 @ 0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
 @ 0x7fd16c348269  ProtobufProcess::handler4()
 @ 0x7fd16c2fc18e  std::_Function_handler::_M_invoke()
 @ 0x7fd16c322132  ProtobufProcess::visit()
 @ 0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
 @ 0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
 @ 0x7fd16c7c2502  process::ProcessManager::resume()
 @ 0x7fd16c7c280c  process::schedule()
 @ 0x7fd16b9c683d  start_thread
 @ 0x7fd16a2b626d  clone
 {noformat}
 This occurs sometime after a failover and indicates that the Slave and 
 Framework structs are not kept in sync.
 Problem seems to be here, when re-registering a framework on a failed over 
 master, we only consider executors for which there are tasks stored in the 
 master:
 {code}
 void Master::_reregisterFramework(
 const UPID from,
 const FrameworkInfo frameworkInfo,
 bool failover,
 const FutureOptionError  validationError)
 {
   ...
   if (frameworks.registered.count(frameworkInfo.id())  0) {
 ...
   } else {
 // We don't have a framework with this ID, so we must be a newly
 // elected Mesos master to which either an existing scheduler or a
 // failed-over one is connecting. Create a Framework object and add
 // any tasks it has that have been reported by reconnecting slaves.
 Framework* framework =
   new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
 framework-reregisteredTime = Clock::now();
 // TODO(benh): Check for root submissions like above!
 // Add any running tasks reported by slaves for this framework.
 foreachvalue (Slave* slave, slaves.registered) {
   foreachkey (const FrameworkID frameworkId, slave-tasks) {
 foreachvalue (Task* task, slave-tasks[frameworkId]) {
   if (framework-id == task-framework_id()) {
 framework-addTask(task);
 // Also add the task's executor for resource accounting
 // if it's still alive on the slave and we've not yet
 // added it to the framework.
 if (task-has_executor_id() 
 slave-hasExecutor(framework-id, task-executor_id()) 
 !framework-hasExecutor(slave-id, task-executor_id())) {
   // XXX: If an executor has no tasks, the executor will not
   // XXX: be added to the Framework struct!
   const ExecutorInfo executorInfo =
 slave-executors[framework-id][task-executor_id()];
   framework-addExecutor(slave-id, executorInfo);
 }
   }
 }
   }
 }
 // N.B. Need to add the framework _after_ we add its tasks
 // (above) so that we can properly determine the resources it's
 // currently using!
 addFramework(framework);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version

2014-09-19 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141252#comment-14141252
 ] 

Timothy St. Clair commented on MESOS-1675:
--

So I ran a quick experiment, and It looks like it will require a re-link :

Before: 
ldd /usr/sbin/mesos-slave
libmesos-0.20.0.so = /lib64/libmesos-0.20.0.so

After: 
ldd mesos-slave
libmesos-0.21.0.so.0 = 
/home/tstclair/work/spaces/mesos/active/src/src/.libs/libmesos-0.21.0.so.0 
(0x7f22f02eb000)

So if you had previous frameworks that were linked.  


 Decouple version of the mesos library from the package release version
 --

 Key: MESOS-1675
 URL: https://issues.apache.org/jira/browse/MESOS-1675
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone

 This discussion should be rolled into the larger discussion around how to 
 version Mesos (APIs, packages, libraries etc).
 Some notes from libtool docs.
 http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
 http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1822) web ui redirection does not work when masters are not publicly reachable

2014-09-19 Thread craig mcmillan (JIRA)
craig mcmillan created MESOS-1822:
-

 Summary: web ui redirection does not work when masters are not 
publicly reachable
 Key: MESOS-1822
 URL: https://issues.apache.org/jira/browse/MESOS-1822
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.20.0
Reporter: craig mcmillan


issues :

https://issues.apache.org/jira/browse/MESOS-672
https://issues.apache.org/jira/browse/MESOS-903

address the problem of web-ui redirection not working when mesos masters are 
publicly reachable, but if the masters are only accessible through an SSH 
tunnel then redirection doesn't work at all : a single master must be chosen 
when setting up the SSH tunnel, and a redirect means having to manually kill 
the tunnel and re-point it at the correct leader

marathon addresses this issue by having non-leader masters proxy to the the 
leader, so an SSH tunnel can be pointed at any of the masters, leader or not : 
could mesos do the same ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1821) CHECK failure in master.

2014-09-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1821:
---
Sprint: Mesos Q3 Sprint 5

 CHECK failure in master.
 

 Key: MESOS-1821
 URL: https://issues.apache.org/jira/browse/MESOS-1821
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.21.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Blocker

 Looks like the recent CHECKs I've added exposed a bug in the framework 
 re-registration logic by which we didn't keep the executors consistent 
 between the Slave and Framework structs:
 {noformat: title=Master Log}
 I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME) exited with status 0
 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' 
 with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME)
 F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: 
 hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 
 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0
 *** Check failure stack trace: ***
 @ 0x7fd16c81737d  google::LogMessage::Fail()
 @ 0x7fd16c8191c4  google::LogMessage::SendToLog()
 @ 0x7fd16c816f6c  google::LogMessage::Flush()
 @ 0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
 @ 0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
 @ 0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
 @ 0x7fd16c348269  ProtobufProcess::handler4()
 @ 0x7fd16c2fc18e  std::_Function_handler::_M_invoke()
 @ 0x7fd16c322132  ProtobufProcess::visit()
 @ 0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
 @ 0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
 @ 0x7fd16c7c2502  process::ProcessManager::resume()
 @ 0x7fd16c7c280c  process::schedule()
 @ 0x7fd16b9c683d  start_thread
 @ 0x7fd16a2b626d  clone
 {noformat}
 This occurs sometime after a failover and indicates that the Slave and 
 Framework structs are not kept in sync.
 Problem seems to be here, when re-registering a framework on a failed over 
 master, we only consider executors for which there are tasks stored in the 
 master:
 {code}
 void Master::_reregisterFramework(
 const UPID from,
 const FrameworkInfo frameworkInfo,
 bool failover,
 const FutureOptionError  validationError)
 {
   ...
   if (frameworks.registered.count(frameworkInfo.id())  0) {
 ...
   } else {
 // We don't have a framework with this ID, so we must be a newly
 // elected Mesos master to which either an existing scheduler or a
 // failed-over one is connecting. Create a Framework object and add
 // any tasks it has that have been reported by reconnecting slaves.
 Framework* framework =
   new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
 framework-reregisteredTime = Clock::now();
 // TODO(benh): Check for root submissions like above!
 // Add any running tasks reported by slaves for this framework.
 foreachvalue (Slave* slave, slaves.registered) {
   foreachkey (const FrameworkID frameworkId, slave-tasks) {
 foreachvalue (Task* task, slave-tasks[frameworkId]) {
   if (framework-id == task-framework_id()) {
 framework-addTask(task);
 // Also add the task's executor for resource accounting
 // if it's still alive on the slave and we've not yet
 // added it to the framework.
 if (task-has_executor_id() 
 slave-hasExecutor(framework-id, task-executor_id()) 
 !framework-hasExecutor(slave-id, task-executor_id())) {
   // XXX: If an executor has no tasks, the executor will not
   // XXX: be added to the Framework struct!
   const ExecutorInfo executorInfo =
 slave-executors[framework-id][task-executor_id()];
   framework-addExecutor(slave-id, executorInfo);
 }
   }
 }
   }
 }
 // N.B. Need to add the framework _after_ we add its tasks
 // (above) so that we can properly determine the resources it's
 // currently using!
 addFramework(framework);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-1756) Support etcd as an alternative for Zk in Mesos

2014-09-19 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen closed MESOS-1756.
---

 Support etcd as an alternative for Zk in Mesos
 --

 Key: MESOS-1756
 URL: https://issues.apache.org/jira/browse/MESOS-1756
 Project: Mesos
  Issue Type: Improvement
Reporter: Timothy Chen
Priority: Minor

 With the increase number of etcd users, for them to use Mesos it's often 
 required to run another zookeeper cluster just for Mesos.
 Will be ideal if Mesos can just run on Etcd as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper

2014-09-19 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141409#comment-14141409
 ] 

Timothy Chen commented on MESOS-1806:
-

[~tstclair] I don't really have a branch (I do, but contains  10 lines of code 
change...).
[~Ed Ropple] are you going to start working on this soon? 

 Substituting etcd or ReplicatedLog for Zookeeper
 

 Key: MESOS-1806
 URL: https://issues.apache.org/jira/browse/MESOS-1806
 Project: Mesos
  Issue Type: Task
Reporter: Ed Ropple
Priority: Minor

 adam_mesos   eropple: Could you also file a new JIRA for Mesos to drop ZK 
 in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
 that one.
 --
 Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-1593) Add DockerInfo Configuration

2014-09-19 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen closed MESOS-1593.
---

 Add DockerInfo Configuration
 

 Key: MESOS-1593
 URL: https://issues.apache.org/jira/browse/MESOS-1593
 Project: Mesos
  Issue Type: Task
Reporter: Timothy Chen
Assignee: Timothy Chen
 Fix For: 0.20.0


 We want to add a new proto message to encapsulate all Docker related 
 configurations into DockerInfo.
 Here is the document that describes the design for DockerInfo:
 https://github.com/tnachen/mesos/wiki/DockerInfo-design



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1593) Add DockerInfo Configuration

2014-09-19 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141464#comment-14141464
 ] 

Timothy Chen commented on MESOS-1593:
-

commit 2057e3fa37f880b52d766feb5ed33a0209f218bc
Author: Timothy Chen tnac...@apache.org
Date:   Thu Aug 14 09:58:11 2014 -0700

Added explicit DockerInfo within ContainerInfo.

Added new DockerInfo to explicitly capture Docker options, and allow
command URIs to be fetched and mapped into sandbox, which gets
bind-mounted into the container.

Review: https://reviews.apache.org/r/24475


 Add DockerInfo Configuration
 

 Key: MESOS-1593
 URL: https://issues.apache.org/jira/browse/MESOS-1593
 Project: Mesos
  Issue Type: Task
Reporter: Timothy Chen
Assignee: Timothy Chen
 Fix For: 0.20.0


 We want to add a new proto message to encapsulate all Docker related 
 configurations into DockerInfo.
 Here is the document that describes the design for DockerInfo:
 https://github.com/tnachen/mesos/wiki/DockerInfo-design



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule

2014-09-19 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141570#comment-14141570
 ] 

Bernd Mathiske commented on MESOS-1384:
---

A patch has been submitted: https://reviews.apache.org/r/25848/

 Add support for loadable MesosModule
 

 Key: MESOS-1384
 URL: https://issues.apache.org/jira/browse/MESOS-1384
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

 I think we should break this into multiple phases.
 -(1) Let's get the dynamic library loading via a stout-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
  -
 *DONE*
 (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
 and/or isolator) from a dynamic library. This will give us some more 
 experience with how we want to name the underlying library symbol, how we 
 want to specify flags for finding the library, what types of validation we 
 want when loading a library.
 *TARGET* 
 (3) After doing (2) for one or two classes in Mesos I think we can formalize 
 the approach in a mesos-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
 *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1081) Master should not deactivate authenticated framework/slave on new AuthenticateMessage unless new authentication succeeds.

2014-09-19 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141612#comment-14141612
 ] 

Vinod Kone commented on MESOS-1081:
---

https://reviews.apache.org/r/25866/

 Master should not deactivate authenticated framework/slave on new 
 AuthenticateMessage unless new authentication succeeds.
 -

 Key: MESOS-1081
 URL: https://issues.apache.org/jira/browse/MESOS-1081
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Adam B
  Labels: authentication, master, security

 Master should not deactivate an authenticated framework/slave upon receiving 
 a new AuthenticateMessage unless new authentication succeeds. As it stands 
 now, a malicious user could spoof the pid of an authenticated framework/slave 
 and send an AuthenticateMessage to knock a valid framework/slave off the 
 authenticated list, forcing the valid framework/slave to re-authenticate and 
 re-register. This could be used in a DoS attack.
 But how should we handle the scenario when the actual authenticated 
 framework/slave sends an AuthenticateMessage that fails authentication?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.

2014-09-19 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1668:
--
Shepherd: Benjamin Mahler

https://reviews.apache.org/r/25867/

 Handle a temporary one-way master -- slave socket closure.
 ---

 Key: MESOS-1668
 URL: https://issues.apache.org/jira/browse/MESOS-1668
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Vinod Kone
Priority: Minor
  Labels: reliability

 In MESOS-1529, we realized that it's possible for a slave to remain 
 disconnected in the master if the following occurs:
 → Master and Slave connected operating normally.
 → Temporary one-way network failure, master→slave link breaks.
 → Master marks slave as disconnected.
 → Network restored and health checking continues normally, slave is not 
 removed as a result. Slave does not attempt to re-register since it is 
 receiving pings once again.
 → Slave remains disconnected according to the master, and the slave does not 
 try to re-register. Bad!
 We were originally thinking of using a failover timeout in the master to 
 remove these slaves that don't re-register. However, it can be dangerous when 
 ZooKeeper issues are preventing the slave from re-registering with the 
 master; we do not want to remove a ton of slaves in this situation.
 Rather, when the slave is health checking correctly but does not re-register 
 within a timeout, we could send a registration request from the master to the 
 slave, telling the slave that it must re-register. This message could also be 
 used when receiving status updates (or other messages) from slaves that are 
 disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1081) Master should not deactivate authenticated framework/slave on new AuthenticateMessage unless new authentication succeeds.

2014-09-19 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1081:
--
  Sprint: Mesos Q3 Sprint 5
Assignee: Vinod Kone
Story Points: 1

 Master should not deactivate authenticated framework/slave on new 
 AuthenticateMessage unless new authentication succeeds.
 -

 Key: MESOS-1081
 URL: https://issues.apache.org/jira/browse/MESOS-1081
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Adam B
Assignee: Vinod Kone
  Labels: authentication, master, security

 Master should not deactivate an authenticated framework/slave upon receiving 
 a new AuthenticateMessage unless new authentication succeeds. As it stands 
 now, a malicious user could spoof the pid of an authenticated framework/slave 
 and send an AuthenticateMessage to knock a valid framework/slave off the 
 authenticated list, forcing the valid framework/slave to re-authenticate and 
 re-register. This could be used in a DoS attack.
 But how should we handle the scenario when the actual authenticated 
 framework/slave sends an AuthenticateMessage that fails authentication?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1811) Reconcile disconnected/deactivated semantics in the master code

2014-09-19 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141614#comment-14141614
 ] 

Vinod Kone commented on MESOS-1811:
---

https://reviews.apache.org/r/25866/

 Reconcile disconnected/deactivated semantics in the master code
 ---

 Key: MESOS-1811
 URL: https://issues.apache.org/jira/browse/MESOS-1811
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
Assignee: Vinod Kone

 Currently the master code treats a deactivated and disconnected slave 
 similarly, by setting 'disconnected' variable in the slave struct. This 
 causes us to disconnect() a slave in cases where we really only want to 
 deactivate() the slave (e.g., authentication).
 It would be nice to differentiate these semantics by adding a new variable 
 active in the Slave struct.
 We might want to do the same with the Framework struct for consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)