> On Nov. 3, 2014, 1:59 p.m., Vinod Kone wrote: > > src/master/master.cpp, lines 5186-5230 > > <https://reviews.apache.org/r/27531/diff/1/?file=747568#file747568line5186> > > > > Hmm. This is rather unweildy. > > > > Exposing the cross product of status and source and reason in the > > metrics is a bit strange, considering most reasons are only related to > > TASK_LOST. IOW, most of those metrics will be 0s which I find weird. > > > > AFAICT, this is what we have: > > > > TASK_STAGING, TASK_STARTING, TASK_RUNNING and TASK_FINISHED are sent by > > the master (reconciliation) or executor. Either way, there is no reason > > associated with them. > > > > TASK_FAILED is generated by master (reconciliation) or slave (oom or > > command executor failed) or executor. I'll comment on the slave aspect in > > the dependent review, because i realized it sets incorrect reason. > > > > TASK_KILLED can be generated by master (reconciliation, pending), slave > > (pending, framework removed) or executor. > > > > TASK_LOST can be generated by master or slave or executor and can > > contain any of the reasons. > > > > > > Given the above, I would rather we use explicit combinations of status, > > source and reason metrics to capture these semantics, rather than using a > > vector of vector of vector. > > Dominic Hamon wrote: > All task status can have reason 'reconciliation' or 'None', so there > needs to be a reason for everything. > > I'll code up the explicit combination, but it will likely be as unwieldy > in terms of tracking which are valid combinations. It also means that changes > to sources/reasons will require API changes whereas this covers every > possible future combination and is a complete API. > > I'm starting to think I should pull the metrics out into a separate file > (as per a TODO) to avoid churn. Maybe as a dependent review. What do you > think? > > Vinod Kone wrote: > ``` > It also means that changes to sources/reasons will require API changes > ``` > I don't think I follow. Why? > > Another thing to consider is to reduce the scope of this work for 0.21.0. > Maybe just include reasons for TASK_LOST and nothing else? Would that make it > less unweildy for now? > > Dominic Hamon wrote: > if we define the explicit set of task state/source/reason tuples and a > reason changes for a given state/source, we'd have to deprecate the old > combination to introduce the new one. concrete example: > > TASK_FAILED/SOURCE_SLAVE/REASON_EXECUTOR_TERMINATED will become > TASK_FAILED/SOURCE_SLAVE/REASON_OOM and > TASK_FAILED/SOURCE_SLAVE/REASON_INVALID_COMMAND in a future patch > > we'd then need to deprecate the REASON_EXECUTOR_TERMINATED metric through > a cycle as it would then be unused. > > Actually, thinking about it more, we'd have to do that anyway and the > explicit list of metrics makes it easier! > > let me code up the explicit combinations and see what it looks like.
i was thinking about this too hard. making it simpler.. incoming patch. sorry for the pointless review. - Dominic ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27531/#review59613 ----------------------------------------------------------- On Nov. 4, 2014, 8:58 a.m., Dominic Hamon wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/27531/ > ----------------------------------------------------------- > > (Updated Nov. 4, 2014, 8:58 a.m.) > > > Review request for mesos, Tobias Weingartner and Vinod Kone. > > > Bugs: MESOS-1830 > https://issues.apache.org/jira/browse/MESOS-1830 > > > Repository: mesos-git > > > Description > ------- > > Update metrics in Master to match the source and reason split for task > statuses. > > > Diffs > ----- > > src/master/master.hpp 687f1789712dcd867b411badd85f4a12ae8f16d7 > src/master/master.cpp d914786a2517677ee7dd4a3130b4b191ef416c26 > src/slave/slave.hpp 6c183f827ec2521052f8b088d4583f13e661db2c > src/slave/slave.cpp b8935173bae9c124b8d08db590893334d7a45a23 > src/tests/master_tests.cpp 2e525749247626c05efb2f54a707599facb114b6 > src/tests/slave_tests.cpp d2cbaf82391411bdc5d85327478d7ca8072048af > > Diff: https://reviews.apache.org/r/27531/diff/ > > > Testing > ------- > > make check > run master and check endpoint: > > { > ... > master/reconciliation_states/task_error: 0, > master/reconciliation_states/task_failed: 0, > master/reconciliation_states/task_finished: 0, > master/reconciliation_states/task_killed: 0, > master/reconciliation_states/task_lost: 0, > master/reconciliation_states/task_running: 0, > master/reconciliation_states/task_staging: 0, > master/reconciliation_states/task_starting: 0, > master/recovery_slave_removals: 0, > master/slave_registrations: 1, > master/slave_removals: 0, > master/slave_reregistrations: 0, > master/slaves_active: 1, > master/slaves_connected: 1, > master/slaves_disconnected: 0, > master/slaves_inactive: 0, > master/tasks_failed: 0, > master/tasks_finished: 0, > master/tasks_killed: 0, > master/tasks_lost: 0, > master/tasks_lost/reason_executor_terminated: 0, > master/tasks_lost/reason_executor_unregistered: 0, > master/tasks_lost/reason_framework_removed: 0, > master/tasks_lost/reason_gc_error: 0, > master/tasks_lost/reason_invalid_command: 0, > master/tasks_lost/reason_invalid_frameworkid: 0, > master/tasks_lost/reason_invalid_offers: 0, > master/tasks_lost/reason_master_disconnected: 0, > master/tasks_lost/reason_memory_limit: 0, > master/tasks_lost/reason_reconciliation: 0, > master/tasks_lost/reason_slave_disconnected: 0, > master/tasks_lost/reason_slave_removed: 0, > master/tasks_lost/reason_slave_restarted: 0, > master/tasks_lost/reason_slave_unknown: 0, > master/tasks_lost/reason_task_invalid: 0, > master/tasks_lost/reason_task_unauthorized: 0, > master/tasks_lost/reason_task_unknown: 0, > ... > } > > and slave: > > { > ... > slave/tasks_lost: 0, > slave/tasks_lost/reason_executor_terminated: 0, > slave/tasks_lost/reason_executor_unregistered: 0, > slave/tasks_lost/reason_framework_removed: 0, > slave/tasks_lost/reason_gc_error: 0, > slave/tasks_lost/reason_invalid_command: 0, > slave/tasks_lost/reason_invalid_frameworkid: 0, > slave/tasks_lost/reason_invalid_offers: 0, > slave/tasks_lost/reason_master_disconnected: 0, > slave/tasks_lost/reason_memory_limit: 0, > slave/tasks_lost/reason_reconciliation: 0, > slave/tasks_lost/reason_slave_disconnected: 0, > slave/tasks_lost/reason_slave_removed: 0, > slave/tasks_lost/reason_slave_restarted: 0, > slave/tasks_lost/reason_slave_unknown: 0, > slave/tasks_lost/reason_task_invalid: 0, > slave/tasks_lost/reason_task_unauthorized: 0, > slave/tasks_lost/reason_task_unknown: 0, > slave/tasks_running: 0, > slave/tasks_staging: 0, > slave/tasks_starting: 0, > ... > } > > > Thanks, > > Dominic Hamon > >