Re: Mesos Executor Failing

2017-05-22 Thread Chawla,Sumit
Hi Joseph

I am using 0.27.0.  Is there any diagnosis tool or command line that i can
run to ascertain that why its happening?

Regards
Sumit Chawla


On Fri, May 19, 2017 at 2:31 PM, Joseph Wu  wrote:

> What version of Mesos are you using?  (Just based on the word "slave" in
> that error message, I'm guessing 0.28 or older.)
>
> The "Failed to synchronize" error is something that can occur while the
> agent is launching the executor.  During the launch, the agent will create
> a pipe to the executor subprocess; and the executor makes a blocking read
> on this pipe.  The agent will write a value to the pipe to signal the
> executor to proceed.  If the agent restarts or the pipe breaks at this
> point in the launch, then you'll see this error message.
>
> On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit 
> wrote:
>
>> Hi
>>
>> I am facing a peculiar issue on one of the slave nodes of our cluster.  I
>> have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
>> with exit code 0.
>>
>> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
>> exited caused by one of the running tasks) Reason: Unknown executor exit
>> code (0)
>>
>>
>> I cannot seem to find anything in mesos-slave.logs, and there is nothing
>> being written to stdout/stderr.  Are there any debugging utitlities that i
>> can use to debug what can be getting wrong on that particular slave?
>>
>> I tried running following but got stuck at:
>>
>>
>> /mesos-containerizer launch 
>> --command='{"environment":{},"shell":true,"value":"ls
>> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-0312/executors/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f --help=false
>> --pipe_read=0 --pipe_write=0 --user=smi
>>
>> Failed to synchronize with slave (it's probably exited)
>>
>>
>> Would apprecite pointing to any debugging methods/documentation to
>> diagnose these kind of problems.
>>
>> Regards
>> Sumit Chawla
>>
>>
>


Coverity Scan: Analysis completed for Mesos

2017-05-22 Thread scan-admin

Your request for analysis of Mesos has been completed successfully.
The results are available at 
https://u2389337.ct.sendgrid.net/wf/click?upn=08onrYu34A-2BWcWUl-2F-2BfV0V05UPxvVjWch-2Bd2MGckcRZ-2B0hUmbDL5L44V5w491gwG_yCAaqzzx-2F-2BA2mRMpk03t3x9hscHw355FKzcsrEtTtpF7iS3qVoJLJfkqKQTSbFixD5MRIB80LvD8g5e7afrDAjh7Z9vPmdjlPb9UPgvo8F-2FKeec-2B3ImRrbIAfuZlu75PkLJNi2QjdbP66labLvm6e9GkKzlgCS6h9FZzSkXLK-2Bbi-2BBAEbPK5gxWZAHYA2zmXufrdvAkm-2FQkTSSWQJaZxG96MjLEa82hWm9oYTc3AyZM-3D

Analysis Summary:
   New defects found: 0
   Defects eliminated: 0



Re: Isolating metrics collection from master/agent slowness

2017-05-22 Thread Zhitao Li
Thanks for the feedback, James.

Replying to your points inline:

On Mon, May 22, 2017 at 10:56 AM, James Peach  wrote:

>
> > On May 19, 2017, at 11:35 AM, Zhitao Li  wrote:
> >
> > Hi,
> >
> > I'd like to start a conversation to talk about metrics collection
> endpoints
> > (especially `/metrics/snapshot`) behavior.
> >
> > Right now, these endpoints are served from the same master/agent's
> > libprocess, and extensively uses `gauge` to chain further callbacks to
> > collect various metrics (DRF allocator specifically adds several metrics
> > per role).
> >
> > This brings a problem when the system is under load: when the
> > master/allocator libprocess becomes busy, stats collection itself becomes
> > slow too. Flying dark when the system is under load is specifically
> painful
> > for an operator.
>
> Yes, sampling metrics should approach zero cost.
>
> > I would like to explore the direction of isolating metric collection even
> > when the master is slow. A couple of ideas:
> >
> > - (short term) reduce usage of gauge and prefer counter (since I believe
> > they are less affected);
>
> I'd rather not squash the semantics for performance reasons. If a metric
> has gauge semantics, I don't think we should represent that as a Counter.
>

I recall that I had a previous conversation with @bmahler and he thought
that certain gauges could be expresses as the differential of two counters.

We definitely cannot express a gauge as a Counter, because gauge value can
decrease while counter should always be treated as monotonically increasing
until process restart.


>
> > - alternative implementation of `gauge` which does not contend on
> > master/allocator's event queue;
>
> This is doable in some circumstances, but not always. For example,
> Master::_uptime_secs() doesn't need to run on the master queue, but
> Master::_outstanding_offers arguably does. The latter could be implemented
> by sampling an variable that is updated, but that's not very generic, so we
> should try to think of something better.
>

I agree that this will not be trivial cut. The fact that this is not
something trivially achievable is the primary I want to start this
conversation with the general dev community. We can absorb some work to
optimize on certain hot paths (I suspect roles specific ones in allocator
being one of them for us), but maintaining this in the long term will
definitely requires all contributors to help.

w.r.t. examples, it seems that Master::_outstanding_offers simply calls a
hashmap::size() on a hashmap object, so if the underlying container type
conforms to C++11's thread safe requirement (
http://en.cppreference.com/w/cpp/container#Thread_safety), we should at
least be able to call the size() function with the understanding that we
might get slightly stale value?

I think a more interesting example is Master::_task_starting(): not only
that is is not calculated from a simple const method,but also that the
result is actually generated by iterating on all tasks registered to the
master. This means the cost of calculating this is linear to number of
tasks in the cluster.



>
> > - serving metrics collection from a different libprocess routine.
>
> See MetricsProcess. One (mitigation?) approach would be to sample the
> metrics at a fixed rate and then serve the cached samples from the
> MetricsProcess. I expect most installations have multiple clients sampling
> the metrics, so this would at least decouple the sample rate from the
> metrics request rate.
>

That sounds a very good idea to start with. I still think certain code
should be augmented so they maintain the gauge value more efficiently, but
for whatever code which is harder to rewrite this can definitely improve
the situation.


>
> >
> > Any thoughts on these?
> >
> > --
> > Cheers,
> >
> > Zhitao Li
>
>


-- 
Cheers,

Zhitao Li


[GitHub] mesos pull request #206: Update contributors.yaml

2017-05-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/mesos/pull/206


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Added task status update reason for health checks

2017-05-22 Thread Alex Rukletsov
James,

We are more than happy to write a comment if folks think it is useful. Do
you have anything specific in mind you want to be captured there? For me,
the reason's name is self-explanatory.

Alex.

On 22 May 2017 17:32, "James Peach"  wrote:

>
> > On May 22, 2017, at 5:28 AM, Andrei Budnik 
> wrote:
> >
> > Hi All,
> >
> > The new reason is REASON_TASK_HEALTH_CHECK_STATUS_UPDATED.
> > The corresponding ticket is https://issues.apache.org/
> jira/browse/MESOS-6905
>
> Is there any documentation about how executors ought to use this reason?
> Even a comment in the proto files would help executor authors use this
> consistently.
>
> J


Re: Isolating metrics collection from master/agent slowness

2017-05-22 Thread James Peach

> On May 19, 2017, at 11:35 AM, Zhitao Li  wrote:
> 
> Hi,
> 
> I'd like to start a conversation to talk about metrics collection endpoints
> (especially `/metrics/snapshot`) behavior.
> 
> Right now, these endpoints are served from the same master/agent's
> libprocess, and extensively uses `gauge` to chain further callbacks to
> collect various metrics (DRF allocator specifically adds several metrics
> per role).
> 
> This brings a problem when the system is under load: when the
> master/allocator libprocess becomes busy, stats collection itself becomes
> slow too. Flying dark when the system is under load is specifically painful
> for an operator.

Yes, sampling metrics should approach zero cost.

> I would like to explore the direction of isolating metric collection even
> when the master is slow. A couple of ideas:
> 
> - (short term) reduce usage of gauge and prefer counter (since I believe
> they are less affected);

I'd rather not squash the semantics for performance reasons. If a metric has 
gauge semantics, I don't think we should represent that as a Counter.

> - alternative implementation of `gauge` which does not contend on
> master/allocator's event queue;

This is doable in some circumstances, but not always. For example, 
Master::_uptime_secs() doesn't need to run on the master queue, but 
Master::_outstanding_offers arguably does. The latter could be implemented by 
sampling an variable that is updated, but that's not very generic, so we should 
try to think of something better.

> - serving metrics collection from a different libprocess routine.

See MetricsProcess. One (mitigation?) approach would be to sample the metrics 
at a fixed rate and then serve the cached samples from the MetricsProcess. I 
expect most installations have multiple clients sampling the metrics, so this 
would at least decouple the sample rate from the metrics request rate.

> 
> Any thoughts on these?
> 
> -- 
> Cheers,
> 
> Zhitao Li



Use of ACLs.RegisterAgent.agent

2017-05-22 Thread Alexander Rojas
Hey guys,

We just noted that there was an error when the `RegisterAgent` act was 
introduced. Namely, its object field is listed as `agent` when by convention we 
have used plural, so it should be `agents`. This ACL hasn’t been part of any 
released version of Mesos, so if no one is using it I will try to push for a 
rename without going through any deprecation cycle.

The big question is if any of you are using this particular ACL in production 
right now?

Alexander Rojas
alexan...@mesosphere.io






Re: Added task status update reason for health checks

2017-05-22 Thread James Peach

> On May 22, 2017, at 5:28 AM, Andrei Budnik  wrote:
> 
> Hi All,
> 
> The new reason is REASON_TASK_HEALTH_CHECK_STATUS_UPDATED.
> The corresponding ticket is https://issues.apache.org/jira/browse/MESOS-6905

Is there any documentation about how executors ought to use this reason? Even a 
comment in the proto files would help executor authors use this consistently.

J

Re: GPU Users -- Deprecation of GPU_RESOURCES capability

2017-05-22 Thread Zhitao Li
Hi Kevin,

Thanks for engaging with the community on this. My 2 cents:

1. I feel that this capabilities has a particular useful semantic which is
lacking in the current reservation system: reserving some scarce resource
for a* dynamic list of multiple roles:*

Right now, any reservation (static or dynamic) can only express the
semantic of "reserving this resource for the given role R". However, in a
complex cluster, it is possible that we have [R1, R2, ..., RN] which wants
to share the scarce resource among them but there is another set of roles
which should never see the given resource.

The new hierarchical role (and/or multi-role?) might be able to provide a
better solution, but until that's widely available and adopted, the
capabilities based hack is the only thing I know that can solve the problem.

In fact, I think if we are going to wo with `--filter-gpu-resources` path,
I think we should make the filter more powerful (i.e, able to handle all
known framework <-> resource/host constraints and more types of scarce
resources) instead of the piecewise patches on a specific use case.

Happy to chat more on this topic.

On Sat, May 20, 2017 at 6:45 PM, Kevin Klues  wrote:

> Hello GPU users,
>
> We are currently considering deprecating the requirement that frameworks
> register with the GPU _RESOURCES capability in order to receive offers that
> contain GPUs. Going forward, we will recommend that users rely on Mesos's
> builtin `reservation` mechanism to achieve similar results.
>
> Before deprecating it, we wanted to get a sense from the community if
> anyone is currently relying on this capability and would like to see it
> persist. If not, we will begin deprecating it in the next Mesos release and
> completely remove it in Mesos 2.0.
>
> As background, the original motivation for this capability was to keep
> “legacy” frameworks from inadvertently scheduling jobs that don’t require
> GPUs on GPU capable machines and thus starving out other frameworks that
> legitimately want to place GPU jobs on those machines. The assumption here
> was that most machines in a cluster won't have GPUs installed on them, so
> some mechanism was necessary to keep legacy frameworks from scheduling jobs
> on those machines. In essence, it provided an implicit reservation of GPU
> machines for "GPU aware" frameworks, bypassing the traditional
> `reservation` mechanism already built into Mesos.
>
> In such a setup, legacy frameworks would be free to schedule jobs on
> non-GPU machines, and "GPU aware" frameworks would be free to schedule GPU
> jobs GPU machines and other types of jobs on other machines (or mix and
> match them however they please).
>
> However, the problem comes when *all* machines in a cluster contain GPUs
> (or even if most of the machines in a cluster container them). When this is
> the case, we have the opposite problem we were trying to solve by
> introducing the GPU_RESOURCES capability in the first place. We end up
> starving out jobs from legacy frameworks that *don’t* require GPU resources
> because there are not enough machines available that don’t have GPUs on
> them to service those jobs. We've actually seen this problem manifest in
> the wild at least once.
>
> An alternative to completely deprecating the GPU_RESOURCES flag would be to
> add a new flag to the mesos master called `--filter-gpu-resources`. When
> set to `true`, this flag will cause the mesos master to continue to
> function as it does today. That is, it would filter offers containing GPU
> resources and only send them to frameworks that opt into the GPU_RESOURCES
> framework capability. When set to `false`, this flag would cause the master
> to *not* filter offers containing GPU resources, and indiscriminately send
> them to all frameworks whether they set the GPU_RESOURCES capability or
> not.
>
> , this flag would allow them to keep relying on it without disruption.
>
> We'd prefer to deprecate the capability completely, but would consider
> adding this flag if people are currently relying on the GPU_RESOURCES
> capability and would like to see it persist
>
> We welcome any feedback you have.
>
> Kevin + Ben
>



-- 
Cheers,

Zhitao Li


Re: GPU Users -- Deprecation of GPU_RESOURCES capability

2017-05-22 Thread Olivier Sallou


On 05/21/2017 03:45 AM, Kevin Klues wrote:
> Hello GPU users,
>
> We are currently considering deprecating the requirement that frameworks
> register with the GPU _RESOURCES capability in order to receive offers that
> contain GPUs. Going forward, we will recommend that users rely on Mesos's
> builtin `reservation` mechanism to achieve similar results.
>
> Before deprecating it, we wanted to get a sense from the community if
> anyone is currently relying on this capability and would like to see it
> persist. If not, we will begin deprecating it in the next Mesos release and
> completely remove it in Mesos 2.0.
Well, I am using it for GoDocker framework where jos can specify to sue
(or not) some GPUs.
>
> As background, the original motivation for this capability was to keep
> “legacy” frameworks from inadvertently scheduling jobs that don’t require
> GPUs on GPU capable machines and thus starving out other frameworks that
> legitimately want to place GPU jobs on those machines. The assumption here
> was that most machines in a cluster won't have GPUs installed on them, so
> some mechanism was necessary to keep legacy frameworks from scheduling jobs
> on those machines. In essence, it provided an implicit reservation of GPU
> machines for "GPU aware" frameworks, bypassing the traditional
> `reservation` mechanism already built into Mesos.
>
> In such a setup, legacy frameworks would be free to schedule jobs on
> non-GPU machines, and "GPU aware" frameworks would be free to schedule GPU
> jobs GPU machines and other types of jobs on other machines (or mix and
> match them however they please).
>
> However, the problem comes when *all* machines in a cluster contain GPUs
> (or even if most of the machines in a cluster container them). When this is
> the case, we have the opposite problem we were trying to solve by
> introducing the GPU_RESOURCES capability in the first place. We end up
> starving out jobs from legacy frameworks that *don’t* require GPU resources
> because there are not enough machines available that don’t have GPUs on
> them to service those jobs. We've actually seen this problem manifest in
> the wild at least once.
>
> An alternative to completely deprecating the GPU_RESOURCES flag would be to
> add a new flag to the mesos master called `--filter-gpu-resources`. When
> set to `true`, this flag will cause the mesos master to continue to
> function as it does today. That is, it would filter offers containing GPU
> resources and only send them to frameworks that opt into the GPU_RESOURCES
> framework capability. When set to `false`, this flag would cause the master
> to *not* filter offers containing GPU resources, and indiscriminately send
> them to all frameworks whether they set the GPU_RESOURCES capability or not.
>
> , this flag would allow them to keep relying on it without disruption.
>
> We'd prefer to deprecate the capability completely, but would consider
> adding this flag if people are currently relying on the GPU_RESOURCES
> capability and would like to see it persist
>
> We welcome any feedback you have.
>
> Kevin + Ben
>

-- 
Olivier Sallou
IRISA / University of Rennes 1
Campus de Beaulieu, 35000 RENNES - FRANCE
Tel: 02.99.84.71.95

gpg key id: 4096R/326D8438  (keyring.debian.org)
Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438



Added task status update reason for health checks

2017-05-22 Thread Andrei Budnik
Hi All,

The new reason is REASON_TASK_HEALTH_CHECK_STATUS_UPDATED.
The corresponding ticket is https://issues.apache.org/jira/browse/MESOS-6905


Best,
Andrei Budnik