On augmenting TLS configuration options in libprocess

2019-05-24 Thread Alex Rukletsov
Folks,

We reviewed TLS configuration options in libprocess and came up with the
following proposal [1] to allow for certificate verification in client mode
only.

In short, the proposal suggests to add two flags to libprocess so that it
can be configured to:
* always require presence and verify server certificates,
* never request client certificates,
* validate hostname using OpenSSL calls.

Please review.

[1]
https://docs.google.com/document/d/1O3q7UOXVGNw81xOkRNFPzrtbC__D-N_D_mwV6D--y0k/edit


Re: '*.json' endpoints removed in 1.7

2019-05-11 Thread Alex Rukletsov
Before we decide, I'd like to propose another view angle. Thanks to the
removal of the endpoint aliases, a widely used but only occasionally
maintained mesos-dns have been updated and a newer version will be released
soon [1] — thanks to jdef. I understand the frustration when software stops
working after an update for silly reasons like removing endpoint aliases,
but at the same time it can be an incentive to update other components in
the ecosystem as well, switching not just from one endpoint to another, but
bringing other changes together into the release.

[1] https://github.com/mesosphere/mesos-dns/releases/tag/v0.7.0-rc2

On Fri, May 10, 2019 at 5:03 PM Vinod Kone  wrote:

> I propose that we revert this change and keep the ".json" endpoints in
> master branch and 1.8.x
>
> My reasoning is that, we have ecosystem components (e.g., mesos-dns which
> is yet to have a release with fix) and anecdotally a bunch of custom
> tooling at user sites that depend on these ".json" endpoints (esp.
> /state.json). The amount of techdebt that we saved or consistency we
> achieved in the codebase by doing this is not worth the tradeoff of
> breaking some user/tooling, in my opinion. We could revisit this if and
> when we do a Mesos 2.0.
>
> On Wed, Aug 8, 2018 at 9:25 AM Alex Rukletsov  wrote:
>
> > Folks,
> >
> > The long ago deprecated '*.json' endpoints will be removed in Mesos
> 1.7.0.
> > Please use their non-'.json' counterparts instead.
> >
> > Commit:
> >
> https://github.com/apache/mesos/commit/42551cb5290b7b04101f7d800b4b8fd573e47b91
> > JIRA ticket: https://issues.apache.org/jira/browse/MESOS-4509
> >
> > Alex.
> >
>


Re: [VOTE] Release Apache Mesos 1.8.0 (rc3)

2019-04-30 Thread Alex Rukletsov
Modulo Jorge's comment (hope he'll come back soon),

+1 (binding).

This rc has been deployed on a cluster internally by us at Mesosphere and
has been running without noticeable issues for a couple of days for now.

Alex.

On Mon, Apr 29, 2019 at 10:05 PM Benno Evers  wrote:

> Hi Jorge,
>
> I'm admittedly not too familiar with CUDA and tensorflow but the error
> message you describe sounds to me more like a build issue, i.e. it sounds
> like the version of the nvidia driver is different between the docker image
> and the host system?
>
> Maybe you could continue investigating to see if this is related to the
> release itself or caused by some external cause, and create a JIRA ticket
> to capture your findings?
>
> Thanks,
> Benno
>
> On Fri, Apr 26, 2019 at 9:55 PM Jorge Machado  wrote:
>
> > Hi all,
> >
> > did someone tested it on ubuntu 18.04 + nvidia-docker2 ? We are having
> > some issues using the cuda 10+ images when doing real processing. We
> still
> > need to check some things but basically we get:
> >
> > kernel version 418.56.0 does not match DSO version 410.48.0 -- cannot
> find working devices in this configuration
> >
> >
> > Logs:
> >
> > I0424 13:27:14.00058630 executor.cpp:726] Forked command at 73
> > Preparing rootfs at
> '/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b'
> > Marked '/' as rslave
> > Executing pre-exec command
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpuacct"],"shell":false,"value":"ln"}'
> > Executing pre-exec command
> '{"arguments":["ln","-s","/sys/fs/cgroup/cpu,cpuacct","/data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b/sys/fs/cgroup/cpu"],"shell":false,"value":"ln"}'
> > Changing root to
> /data0/mesos/work/provisioner/containers/548d3cae-30b5-4530-a8db-f94b00215718/backends/overlay/rootfses/e1ceb89e-3abc-4587-a87c-d63037b7ae8b
> > 2019-04-24 13:27:18.346994: I
> tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
> instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
> > 2019-04-24 13:27:18.352203: E
> tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit:
> CUDA_ERROR_UNKNOWN: unknown error
> > 2019-04-24 13:27:18.352243: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA
> diagnostic information for host: __host__
> > 2019-04-24 13:27:18.352252: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: __host__
> > 2019-04-24 13:27:18.352295: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported
> version is: 410.48.0
> > 2019-04-24 13:27:18.352329: I
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported
> version is: 418.56.0*2019-04-24 13:27:18.352338: E
> tensorflow/stream_executor/cuda/cuda_diagnostics.cc:306 <
> http://cuda_diagnostics.cc:306>] kernel version 418.56.0 does not match
> DSO version 410.48.0 -- cannot find working devices in this configuration*
> > 2019-04-24 13:27:18.374940: I
> tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency:
> 259392 Hz
> > 2019-04-24 13:27:18.378793: I
> tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f41e10
> executing computations on platform Host. Devices:
> > 2019-04-24 13:27:18.378821: I
> tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device
> (0): , 
> > W0424 13:27:18.385210 140191267731200 deprecation.py:323] From
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263:
> colocate_with (from tensorflow.python.framework.ops) is deprecated and will
> be removed in a future version.
> > Instructions for updating:
> > Colocations handled automatically by placer.
> > W0424 13:27:18.399287 140191267731200 deprecation.py:323] From
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:129:
> conv2d (from tensorflow.python.layers.convolutional) is deprecated and will
> be removed in a future version.
> > Instructions for updating:
> > Use keras.layers.conv2d instead.
> > W0424 13:27:18.433226 140191267731200 deprecation.py:323] From
> /user/tf-benchmarks-113/scripts/tf_cnn_benchmarks/convnet_builder.py:261:
> max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and
> will be removed in a future version.
> > Instructions for updating:
> > Use keras.layers.max_pooling2d instead.
> > W0424 13:27:20.197937 140191267731200 deprecation.py:323] From
> /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:209:
> to_float (from tensorflow.python.ops.math_ops) is deprecated and will be
> removed in a future version.
> > Instructions for updating:
> > Use tf.cast instead.
> > W0424 

Re: [VOTE] Release Apache Mesos 1.4.3 (rc1)

2019-01-28 Thread Alex Rukletsov
This will be the last official 1.4.x release. Even though we agreed to keep
the branch and occasionally back port fixes to it post last release, maybe
it makes sense to include all pending patches into 1.4.3? I see for example
Gilbert added the fix for MESOS-9532 [1]. We were also considering back
porting other test fixes [2] to 1.4.x branch.

[1] https://github.com/apache/mesos/commits/1.4.x
[2] https://gist.github.com/rukletsov/a2a7bedad58010ab8adf209cdc5eef0c

On Fri, Jan 25, 2019 at 11:12 PM Meng Zhu  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.4.3.
>
> 1.4.3 includes the following:
>
> 
> https://issues.apache.org/jira/issues/?filter=12345433
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.4.3-rc1
>
> 
>
> The candidate for Mesos 1.4.3 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz
>
> The tag to be voted on is 1.4.3-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.3-rc1
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1244
>
> Please vote on releasing this package as Apache Mesos 1.4.3!
>
> The vote is open until Mon Jan 30th 14:02:55 PST 2019 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.4.3
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Meng
>


Re: full Zookeeper authentication

2018-12-24 Thread Alex Rukletsov
Made you a contributor and assigned the issue to you. Thanks!

Joseph, will you shepherd this?

On Fri, Dec 21, 2018 at 4:32 PM Kishchukov, Dmitrii (NIH/NLM/NCBI) [C] <
dmitrii.kishchu...@nih.gov> wrote:

> I created a JIRA account. Username is dkishchukov
>
> --
>
> Dmitrii Kishchukov.
> Leading software developer
> Submission Portal Team
>
>
> On 12/21/18, 4:02 AM, "Alex Rukletsov"  wrote:
>
> Dmitrii—
>
> here we go: MESOS-9499 [1]. I've noticed you don't have an Apache JIRA
> account, I'd suggest you create one so that you can assign the ticket
> to
> you and hence get credit properly. Hope it is not your last
> contribution to
> Apache projects : ).
>
> [1] https://issues.apache.org/jira/browse/MESOS-9499
>
>


Re: full Zookeeper authentication

2018-12-21 Thread Alex Rukletsov
Dmitrii—

here we go: MESOS-9499 [1]. I've noticed you don't have an Apache JIRA
account, I'd suggest you create one so that you can assign the ticket to
you and hence get credit properly. Hope it is not your last contribution to
Apache projects : ).

[1] https://issues.apache.org/jira/browse/MESOS-9499

On Thu, Dec 20, 2018 at 6:40 PM Kishchukov, Dmitrii (NIH/NLM/NCBI) [C] <
dmitrii.kishchu...@nih.gov> wrote:

> The patch turned out to be quite simple. I Changed only Authentication and
> URL classes.
>
> Should I create a Jira ticket for this or someone will create it for me?
>
>
> --
>
> Dmitrii Kishchukov.
> Leading software developer
> Submission Portal Team
>
>
> On 12/10/18, 3:01 PM, "Joseph Wu"  wrote:
>
> There are two options for contributing:
> 1) You can make a pull request against the GitHub mirror:
> https://github.com/apache/mesos .  We generally only use PRs for minor
> changes, like typos, documentation, or uploading binaries.  See
> http://mesos.apache.org/documentation/latest/beginner-contribution/
> 2) For larger changes, or more involved/impactful changes, we prefer
> https://reviews.apache.org/ instead.  See
> http://mesos.apache.org/documentation/latest/advanced-contribution/
>
> I suspect this ZK Auth feature will be a fairly significant change, so
> I
> recommend option (2).
>
> On Mon, Dec 10, 2018 at 11:47 AM Kishchukov, Dmitrii (NIH/NLM/NCBI)
> [C] <
> dmitrii.kishchu...@nih.gov> wrote:
>
> > I have a working version. How should I make the patch? A branch in
> the git
> > repository? Do I need to get permissions?
> >
> > --
> >
> > Dmitrii Kishchukov.
> > Leading software developer
> > Submission Portal Team
> >
> >
> > On 12/6/18, 12:56 PM, "Vinod Kone"  wrote:
> >
> > Dmitrii.
> >
> > That approach sounds reasonable. Would you like to work on this?
> Are
> > you
> > looking for a reviewer/shepherd?
> >
> > On Thu, Dec 6, 2018 at 11:28 AM Kishchukov, Dmitrii
> (NIH/NLM/NCBI) [C]
> > <
> > dmitrii.kishchu...@nih.gov> wrote:
> >
> > > Mesos allow using only digest authentication scheme for
> Zookeeper.
> > Which
> > > is bad because Zookeeper has quite a flexible security model.
> > > It is easy to make you own authenticator with its own scheme
> name.
> > >
> > > To support fully Zookeeper authentication, Mesos has pass two
> items
> > into
> > > Zookeeper:
> > > scheme and credentials.
> > > credentials can have different format depending on
> authentication
> > scheme.
> > > For digest scheme it is ‘login:password’
> > >
> > > All Mesos should do just pass scheme and credentials to
> Zookeeper.
> > >
> > > Another improvement might be be to configure credentials via
> file
> > instead
> > > of URI
> > >
> > > For example it can be two command line options:
> > > --zk_auth_scheme and –zk_auth_credentials
> > >
> > > It can be used like this:
> > > --zk_auth_scheme=some_custome_scheme
> –zk_auth_credentials=filename
> > >
> > > --zk_auth_credentials can just get all contents of the file as
> > credentials
> > > string.
> > >
> > > Class Authentication in Mesos already contains all that we
> need. The
> > > problem is what Mesos pass to the constructor.
> > >
> > >
> > > --
> > >
> > > Dmitrii Kishchukov.
> > >
> > >
> >
> >
> >
>
>
>


Re: Propose to create a Kubernetes framework for Mesos

2018-11-23 Thread Alex Rukletsov
I'm in favour of the proposal, Cameron. Building a bridge between Mesos and
Kubernetes will be beneficial for both communities. Virtual kubelet effort
looks promising indeed and is definitely a worthwhile approach to build the
bridge.

While we will need some sort of a scheduler when implementing a provider
for mesos, we don't need to implement and use a "default" one: a simple
mesos-go based scheduler will be fine for the start. We can of course
consider building a default scheduler, but this will significantly increase
the size of the project.

An exercise we will have to do here is determine which parts of a
kubernetes task specification can be "converted" and hence launched on a
Mesos cluster. Once we have a working prototype we can start testing and
collecting data.

Do you want to come up with a plan and maybe a more detailed proposal?

Best,
Alex


Re: Join us at MesosCon 2018 next week!

2018-11-07 Thread Alex Rukletsov
I'd like to thank everyone involved in organising this MesosCon, and
especially Gastón, Jörg, and Andy. I enjoyed the laid-back "underground"
style this year; it was easy to engage in conversations with users and
Mesos developers. Looking forward to the next MesosCon!

Alex

On Thu, Nov 1, 2018 at 10:07 PM Vaibhav Khanduja 
wrote:

> Thank You,
>
> I am looking at the schedule of events. There is a hackathon on Wednesday;
> are there more details available? When to register etc?
>
> On Thu, Nov 1, 2018 at 11:37 AM Gastón Kleiman 
> wrote:
>
> > You can pick up your ticket at 30% off here 
> (source
> > tweet ).
> >
> > On Thu, Nov 1, 2018 at 10:33 AM Vaibhav Khanduja <
> > vaibhavkhand...@gmail.com> wrote:
> >
> >> Thanks for the email.
> >>
> >> Are there any promotional code available for enterprises?
> >>
> >> On Wed, Oct 31, 2018 at 5:06 PM Gastón Kleiman 
> >> wrote:
> >>
> >>> MesosCon 2018 is taking place next week! Join us and celebrate the 5th
> >>> anniversary of MesosCon November 5th-7th, in the The Village (969
> Market
> >>> St, San Francisco).
> >>>
> >>> MesosCon North America is an annual conference organized by the Apache
> >>> Mesos community, bringing together users and developers to share and
> >>> learn
> >>> about the Apache Mesos project, containers, DevOps, and automation.
> >>>
> >>> What to expect
> >>>
> >>> MesosCon will include tracks focused on case studies and architecture
> of
> >>> modern, containerized applications, fast data tools like Spark,
> >>> Cassandra,
> >>> and TensorFlow, and about Mesos itself. Attendees can expect engaging
> >>> keynotes, technical breakout sessions, and collaborative town hall
> >>> sessions
> >>> to include Mesos and the broader ecosystem. Attendees can expect to:
> >>>
> >>>
> >>>-
> >>>
> >>>Learn how to design and build their own custom frameworks
> >>>-
> >>>
> >>>Discover how easy it is to build, deploy, and scale your
> applications
> >>>-
> >>>
> >>>Dive deep into Mesos internals, storage, security, and networking
> >>>-
> >>>
> >>>Network with the community and share best practices and lessons
> >>> learned
> >>>
> >>>
> >>> Check out the schedule and register at http://mesoscon2018.org.
> >>>
> >>> Cheers,
> >>>
> >>> The MesosCon 2018 organization team
> >>>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups
> >> "marathon-framework" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an
> >> email to marathon-framework+unsubscr...@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> >>
> >
>


Re: Request for Comments - Health Check API Proposal

2018-10-18 Thread Alex Rukletsov
Why do we need to resolve this note now? It is obvious that health
interpretation must be part of the API. I'm not sure I understand what
concerns you have, Vinod.

On Thu, Oct 18, 2018 at 8:20 PM Vinod Kone  wrote:

> I understand and am in agreement that `HealthCheckStatusInfo` will have
> more information than `CheckStatusInfo`.
>
> I would like us to put a little more thought into how that would look like
> to be doubly sure that what we are introducing today will be evolvable into
> that envisioned future. We have to live with API changes for a long time,
> so I would like to see more rigor here (e.g., has the note on top of the
> `HealthCheckStatusInfo` in the doc
> <
> https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU/edit#heading=h.lessdcojxc5v
> >
> has
> been discussed/resolved?) to avoid costly changes/deprecations.
>
> On Thu, Oct 18, 2018 at 4:04 AM Alex Rukletsov 
> wrote:
>
> > Thanks for the thoughts, Vinod! Answers inlined.
> >
> > On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone  wrote:
> >
> > > One of the things we discussed when we added `CheckInfo` and
> > > `CheckStatusInfo` was to make the older `HealthCheck` and `bool
> healthy`
> > > field (inside `TaskStatus`) consistent with the new `Check` format.
> > >
> > Correct.
> >
> > >
> > > IIRC, some of the changes we wanted to do were
> > >
> > >- Deprecate `HealthCheck` and introduce a new `HealthCheckInfo`
> proto
> > >
> > Correct.
> >
> > >- The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
> >
> >should be named differently in `HealthCheckInfo` (e.g., `Http`)
> > >
> > Likely, yes.
> >
> > >- Deprecate `bool healthy` in TaskStatusInfo and introduce a new
> > >`HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`
> > >
> > Correct.
> >
> > >
> > > Right now, the proposal seems to only address the last point without
> > > addressing the first two, which feels weird to me. I would prefer to
> see
> > > them addressed in one shot.
> > >
> > Can you please explain why? Is there any problem you foresee if we do it
> > step by step? Introducing `HealthCheckStatusInfo` now solves an important
> > problem and does not seem to introduce new issues.
> >
> > >
> > > Additionally, the proposed `HealthCheckStatusInfo` proto looks
> completely
> > > different from `CheckStatusInfo`. Is that intentional? I hope we are
> not
> > > thinking of deprecating it again when we come around to fix
> `HealthCheck`
> > > proto to be consistent with `CheckInfo` ?
> > >
> > How do you think it should look like? Why will we deprecate it?
> >
> > Health checks are different from checks in the way the result of a check
> is
> > interpreted on the agent. In other words health check is an extra step on
> > top of a check. We might include `CheckStatusInfo` or its contents into
> > `HealthCheckStatusInfo`, but... should we think about this now? It is
> nice
> > to have lower level info from the check in the heath status update, but
> it
> > also means more data to transfer. But interpretation—health—we definitely
> > need.
> >
> > Greg, I'm +1 on your proposal.
> >
> > >
> > > Thanks,
> > >
> > > On Wed, Oct 17, 2018 at 1:26 PM Greg Mann  wrote:
> > >
> > > > Hi all,
> > > > Some users have recently reported issues with our current
> > implementation
> > > > of health checks. See this ticket
> > > > <https://issues.apache.org/jira/browse/MESOS-6417> for an
> introduction
> > > to
> > > > the issue.
> > > >
> > > > To summarize: we currently use a single 'optional bool healthy' field
> > > > within the 'TaskStatus' message to indicate the result of a health
> > check.
> > > > This allows us to expose 3 health states to users:
> > > > 1) 'healthy' field is unset = no health check specified, or health
> > check
> > > > failed but grace period has not yet elapsed, or health check has not
> > yet
> > > > been attempted
> > > > 2) 'healthy' field is set to 'false' = a health check is specified
> and
> > it
> > > > returned 'false'
> > > > 3) 'healthy' field is set to 'true' = a health check is specified and
> > it
> > > > returned 'true'
> > > >
> > > > The issue is that some users need to distinguish

Re: Request for Comments - Health Check API Proposal

2018-10-18 Thread Alex Rukletsov
Thanks for the thoughts, Vinod! Answers inlined.

On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone  wrote:

> One of the things we discussed when we added `CheckInfo` and
> `CheckStatusInfo` was to make the older `HealthCheck` and `bool healthy`
> field (inside `TaskStatus`) consistent with the new `Check` format.
>
Correct.

>
> IIRC, some of the changes we wanted to do were
>
>- Deprecate `HealthCheck` and introduce a new `HealthCheckInfo` proto
>
Correct.

>- The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)

   should be named differently in `HealthCheckInfo` (e.g., `Http`)
>
Likely, yes.

>- Deprecate `bool healthy` in TaskStatusInfo and introduce a new
>`HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`
>
Correct.

>
> Right now, the proposal seems to only address the last point without
> addressing the first two, which feels weird to me. I would prefer to see
> them addressed in one shot.
>
Can you please explain why? Is there any problem you foresee if we do it
step by step? Introducing `HealthCheckStatusInfo` now solves an important
problem and does not seem to introduce new issues.

>
> Additionally, the proposed `HealthCheckStatusInfo` proto looks completely
> different from `CheckStatusInfo`. Is that intentional? I hope we are not
> thinking of deprecating it again when we come around to fix `HealthCheck`
> proto to be consistent with `CheckInfo` ?
>
How do you think it should look like? Why will we deprecate it?

Health checks are different from checks in the way the result of a check is
interpreted on the agent. In other words health check is an extra step on
top of a check. We might include `CheckStatusInfo` or its contents into
`HealthCheckStatusInfo`, but... should we think about this now? It is nice
to have lower level info from the check in the heath status update, but it
also means more data to transfer. But interpretation—health—we definitely
need.

Greg, I'm +1 on your proposal.

>
> Thanks,
>
> On Wed, Oct 17, 2018 at 1:26 PM Greg Mann  wrote:
>
> > Hi all,
> > Some users have recently reported issues with our current implementation
> > of health checks. See this ticket
> >  for an introduction
> to
> > the issue.
> >
> > To summarize: we currently use a single 'optional bool healthy' field
> > within the 'TaskStatus' message to indicate the result of a health check.
> > This allows us to expose 3 health states to users:
> > 1) 'healthy' field is unset = no health check specified, or health check
> > failed but grace period has not yet elapsed, or health check has not yet
> > been attempted
> > 2) 'healthy' field is set to 'false' = a health check is specified and it
> > returned 'false'
> > 3) 'healthy' field is set to 'true' = a health check is specified and it
> > returned 'true'
> >
> > The issue is that some users need to distinguish between the three
> > scenarios in #1: no health check is specified, OR the task is not yet
> > healthy but we are in the grace period. An example use case would be a
> load
> > balancer which needs to wait for a healthy status to route traffic, but
> > which immediately routes traffic to tasks which have no health check
> > defined.
> >
> > This issue was recognized during the design of Mesos generalized checks;
> > for those checks, we use the presence of the 'check_status' field to
> > indicate whether or not a check is defined for the task. While consumers
> > could make use of generalized checks as a workaround, this does not allow
> > them to both detect the presence of a check AND achieve the task-killing
> > behavior that health checks provide.
> >
> > In order to address this, I would like to propose the following new
> > message, and an addition to the 'TaskStatus' message:
> >
> > message HealthCheckStatusInfo {
> >   enum Status {
> > UNKNOWN = 0;
> > HEALTHY = 1;
> > UNHEALTHY = 2;
> >   }
> >
> >   required Status status = 0;
> > }
> >
> > message TaskStatus {
> >   . . .
> >
> >   optional HealthCheckStatusInfo health_check_status = 17;
> >
> >   . . .
> > }
> >
> > The semantics of these fields would be as follows:
> >
> > 'health_status' field:
> > - If set, a health check has been set
> > - If unset, a health check has not been set
> >
> > 'health_status.status' field:
> > - UNKNOWN: The task has not become healthy but is still within its grace
> > period (this state is also used if an internal error prevents us from
> > running the health check successfully)
> > - HEALTHY: The health check indicates the task is healthy
> > - UNHEALTHY: The health check indicates the task is not healthy
> >
> > This change would also involve deprecating the existing 'healthy' field.
> > In accordance with our deprecation policy, I believe we could not remove
> > the deprecated field until we have a new major release (2.x).
> >
> > I'd love to hear feedback on this proposal, thanks in advance! I'll also
> > add this as an agenda item to our upcoming 

On committer candidate nomination

2018-10-16 Thread Alex Rukletsov
Folks,

A seemingly complex and long path to become a committer can drive away
potential candidates shortly after they start contributing to the project.
Around a year ago Jim Jagielski raised a concern about the high entry bar
we have in the project. We heard the feedback and decided to liberalize our
process for nominating new committers via simplifying the committer
checklist.

1) We have relaxed our committer candidate guidelines, see the updated
version [1].
2) Committer checklist [2] is a thing of the past: candidates are no longer
supposed to fill it in.
3) Nominators are encouraged to use template [3] when proposing new
candidates.

Alex on behalf of Mesos PMC.

[1]
https://github.com/apache/mesos/blob/07ab5abb1db91fda3fa118083dc15065f314a3fd/docs/committer-candidate-guidelines.md
[2]
https://github.com/apache/mesos/blob/69f3744f3b2f8e2a8116f023020696950af573ad/docs/committer-candidate-checklist.md
[3]
https://docs.google.com/document/d/1RBShT_kSqWqvG7HOzQhpNINGd17ZkJXGY7vMyxTZZXg/edit


Re: [VOTE] Release Apache Mesos 1.7.0 (rc3)

2018-09-14 Thread Alex Rukletsov
+1 (binding)

Mesosphere's internal CI run with the aforementioned tag. Observed 4 flaky
tests, 3 are known:
https://issues.apache.org/jira/browse/MESOS-5048
https://issues.apache.org/jira/browse/MESOS-8260
https://issues.apache.org/jira/browse/MESOS-8951

One has been introduced as part of adding GC to nested containers
(MESOS-7947), which is disabled in the release:
https://issues.apache.org/jira/browse/MESOS-9217


On Tue, Sep 11, 2018 at 8:09 PM, Gastón Kleiman 
wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>
>
> 1.7.0 includes the following:
> 
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through
> RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF
> sorter
>
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain
> ;f=CHANGELOG;hb=1.7.0-rc3
> 
> 
>
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos-1.7.0.tar.gz
>
> The tag to be voted on is 1.7.0-rc3:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc3
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos
> -1.7.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos
> -1.7.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1234
>
> Please vote on releasing this package as Apache Mesos 1.7.0!
>
> The vote is open until Fri Sep 14 11:06:30 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
>
> Chun-Hung & Gastón
>


Re: [VOTE] Release Apache Mesos 1.7.0 (rc1)

2018-08-22 Thread Alex Rukletsov
MESOS-9177 has been filed today. It is very likely a regression introduced
by one of the state.json improvements. We are still investigating, but it
is obviously a

-1 (binding)

for rc1.

Alex.


On Wed, Aug 22, 2018 at 4:34 AM, Chun-Hung Hsiao  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>
>
> 1.7.0 includes the following:
> 
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through
> RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF
> sorter
> * Bug fixes: 84 bugs fixed, including 20 critical ones.
>
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.7.0-rc1
> 
> 
>
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/mesos-1.7.0.tar.gz
>
> The tag to be voted on is 1.7.0-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc1
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/
> mesos-1.7.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/
> mesos-1.7.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/service/local/repositories/
> orgapachemesos-1232/
>
> Please vote on releasing this package as Apache Mesos 1.7.0!
>
> The vote is open until Fri Aug 24 19:16:39 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Chun-Hung & Gaston
>


Re: [VOTE] Release Apache Mesos 1.4.2 (rc1)

2018-08-20 Thread Alex Rukletsov
+1 binding (make check on Mac OS 10.13.5)

On Mon, Aug 20, 2018 at 8:28 PM, Kapil Arya  wrote:

> +1 binding (internal CI).
>
> The Apache CI failures reported by Vinod are all known flaky tests. I have
> inserted the details inline.
>
> Best,
> Kapil
>
> On Tue, Aug 14, 2018 at 11:03 AM Vinod Kone  wrote:
>
>> I see some flaky tests in ASF CI, that I don't see already reported.
>>
>> @Kapil Arya   Can you take a look at
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/53 and
>> see
>> if the flaky tests are due to bugs in test code and not source?
>>
>> *Revision*: 612ec2c63a68b4d5b60d1d864e6703fde1c2a023
>>
>>- refs/tags/1.4.2-rc1
>>
>> Configuration Matrix gcc clang
>> centos:7 --verbose --enable-libevent --enable-ssl autotools
>> [image: Failed]
>> > Release/53/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
>> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
>> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>>
>
> Failed due to the timeout being too short -- leveldb took 11 seconds to
> responds while the timeout expired at 10 seconds. It also looks like the
> previous operation also took longer than expected potentially due to some
> machine load at the time.
>
> E0814 00:51:13.001557  8738 registrar.cpp:575] Registrar aborting: Failed to 
> update registry: Failed to perform store within 10secs
> ../../src/tests/registrar_tests.cpp:331: Failure
> (registrar.apply( Owned( new MarkSlaveUnreachable(info1, 
> protobuf::getCurrentTime().failure(): Failed to update registry: Failed 
> to perform store within 10secs
> I0814 00:51:18.990106  8743 leveldb.cpp:341] Persisting action (218 bytes) to 
> leveldb took 11.656345772secs
>
>
> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
>> [image: Failed]
>> > Release/53/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
>> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
>> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Failed]
>> > Release/53/BUILDTOOL=autotools,COMPILER=clang,
>> CONFIGURATION=--verbose%20--enable-libevent%20--enable-
>> ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
>> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>>
>
>> --verbose autotools
>
> [image: Failed]
>> > Release/53/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose,
>> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
>> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>>
>
> Failures because of known double-free corruption in test code due to
> parallel manipulation of signal and control handler: https://issues.
> apache.org/jira/browse/MESOS-8084
>
>
>
>> cmake
>> [image: Failed]
>> > Release/53/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
>> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
>> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
>> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>>
>
> Failure due to known flaky: https://issues.apache.org/
> jira/browse/MESOS-7028
>
> On Mon, Aug 13, 2018 at 7:41 PM Benjamin Mahler 
> wrote:
>
>>
>> > +1 (binding)
>> >
>> > make check passes on macOS 10.13.6 with Apple LLVM version 9.1.0
>> > (clang-902.0.39.2).
>> >
>> > Thanks Kapil!
>> >
>> > On Wed, Aug 8, 2018 at 3:06 PM, Kapil Arya  wrote:
>> >
>> > > Hi all,
>> > >
>> > > Please vote on releasing the following candidate as Apache Mesos
>> 1.4.2.
>> > >
>> > > 1.4.2 is a bug fix release. The CHANGELOG for the release is available
>> > at:
>> > > https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_
>> > > plain;f=CHANGELOG;hb=1.4.2-rc1
>> > >
>> > > The candidate for Mesos 1.4.2 release is available at:
>> > >
>> > https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
>> mesos-1.4.2.tar.gz
>> > >
>> > > The tag to be voted on is 1.4.2-rc1:
>> > > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.2-rc1
>> > >
>> > > The SHA512 checksum of the tarball can be found at:
>> > > https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
>> > > mesos-1.4.2.tar.gz.sha512
>> > >
>> > > The signature of the tarball can be found at:
>> > > https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
>> > > mesos-1.4.2.tar.gz.asc
>> > >
>> > > The PGP key used to sign the release is here:
>> > > https://dist.apache.org/repos/dist/release/mesos/KEYS
>> > >
>> > > The JAR is in a staging repository here:
>> > > https://repository.apache.org/content/repositories/
>> orgapachemesos-1231
>> > >
>> > > Please vote on releasing this package as Apache Mesos 1.4.2!
>> > >
>> > > The vote is open until Sat Aug 11 11:59:59 PDT 

'*.json' endpoints removed in 1.7

2018-08-08 Thread Alex Rukletsov
Folks,

The long ago deprecated '*.json' endpoints will be removed in Mesos 1.7.0.
Please use their non-'.json' counterparts instead.

Commit:
https://github.com/apache/mesos/commit/42551cb5290b7b04101f7d800b4b8fd573e47b91
JIRA ticket: https://issues.apache.org/jira/browse/MESOS-4509

Alex.


Apache Mesos repo migration

2018-08-06 Thread Alex Rukletsov
Folks,

the official Mesos git repository has moved over to gitbox. Don't forget to
update your git remotes and tooling! See the update dox [1]

[1]
https://github.com/apache/mesos/commit/9127c792c5a20aee2f31a2e56854c2490a9ca608


Alex.


Re: Is upgradeStrategy respected during redeployment

2018-07-23 Thread Alex Rukletsov
Feng,

I'd suggest to contact Marathon developers directly, check out
https://github.com/mesosphere/marathon#help.

On Mon, Jul 23, 2018 at 12:01 PM, Feng LI  wrote:

> I think this might worth a discussion here.
>
> Feng
> -- Forwarded message -
> From: Feng LI 
> Date: jeu. 19 juil. 2018 à 17:26
> Subject: Is upgradeStrategy respected during redeployment
> To: 
>
>
> Hello guys,
>
> We had a very similar issue as this one:
> https://jira.mesosphere.com/browse/MARATHON-2340
>
> Basically the upgradeStrategy is not respected during the rollback, which
> results in 0 healthy instance.
>
> I'm wondering what happens if I do a deployment of the old version instead
> of a rollback, will it respect the upgradeStrategy in this case. More
> precisely:
>
> 1. Assume we have 4 healthy instances
> 2. A new deployment with minimumHealthyCapacity to 0.5, and the new
> deployment will end up with unhealthy status. So we'll end up with 2
> healthy instances and 2 unhealthy instances
> 3. If we rollback this deployment, we'll end up with 0 healthy instances
> (say the old version will fail with healthy check due to startup/warmup
> issue)
> 4. The question is, what if we don't rollback, but do a deployment with the
> old version? Will marathon guarantee the upgradeStrategy in this case?
>
> Thanks,
> Feng
>


Re: [VOTE] Release Apache Mesos 1.3.3 (rc1)

2018-07-20 Thread Alex Rukletsov
MPark—

what's the decision regarding the 1.3.3 release?

On Mon, Jul 9, 2018 at 8:52 PM, Michael Park  wrote:

> I'm considering simply abandoning the 1.3.3 release and bringing the 1.3.x
> branch to end of life.
> If anyone really wants a 1.3.3, I'm certainly willing to finish the
> release portion of this
> but I don't have time to dig into the CI issue that Vinod pointed out. If
> someone feels compelled
> to investigate the issue and wants 1.3.3 released, please speak up.
>
> I'll wait for some time (say, a week) to gauge the interest and take
> corresponding action.
>
> Thanks,
>
> MPark
>
> On Thu, May 31, 2018 at 11:55 AM Vinod Kone  wrote:
>
>> -1 (binding).
>>
>>
>> Ran it in ASF CI and found an issue worth investigating. Other 3 issues
>> looks to be related to known flaky tests and/or known core dump issue (that
>> has been fixed in later versions).
>>
>> *Revision*: c78e56e4ea217878dd604de638623be166a18db0
>>
>>- refs/tags/1.3.3-rc1
>>
>> Configuration Matrix gcc clang
>> centos:7 --verbose --enable-libevent --enable-ssl autotools
>> [image: Failed]
>> 
>> [image: Not run]
>> cmake
>> [image: Success]
>> 
>> [image: Not run]
>> --verbose autotools
>> [image: Success]
>> 
>> [image: Not run]
>> cmake
>> [image: Success]
>> 
>> [image: Not run]
>> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
>> [image: Success]
>> 
>> [image: Failed]
>> 
>> cmake
>> [image: Success]
>> 
>> [image: Failed]
>> 
>> --verbose autotools
>> [image: Failed]
>> 
>> [image: Success]
>> 
>> cmake
>> [image: Success]
>> 
>> [image: Success]
>> 
>>
>>
>> 1) Segfault in HTTP Test.
>> 

Re: Backport Policy

2018-07-16 Thread Alex Rukletsov
Back porting as little as possible is the ultimate goal for me. My reasons
are closely aligned with what Andrew wrote above.

If we agree on this strategy, the next question is how to enforce it. My
intuition is that committers will lean towards back porting their patches
in arguable cases, because humans tend to overestimate the importance of
their personal work. Delegating the decision in such cases to a release
manager in my opinion will help us enforce the strategy of minimal number
backports. As a bonus, the release manager will have a much better
understanding of what's going on with the release, keyword: "more
ownership".

On Sat, Jul 14, 2018 at 12:07 AM, Andrew Schwartzmeyer <
and...@schwartzmeyer.com> wrote:

> I believe I fall somewhere between Alex and Ben.
>
> As for deciding what to backport or not, I lean toward Alex's view of
> backporting as little as possible (and agree with his criteria). My
> reasoning is that all changes can have unforeseen consequences, which I
> believe is something to be actively avoided in already released versions.
> The reason for backporting patches to fix regressions is the same as the
> reason to avoid backporting as much as possible: keep behavior consistent
> (and safe) within a release. With that as the goal of a branch in
> maintenance mode, it makes sense to fix regressions, and make exceptions to
> fix CVEs and other critical/blocking issues.
>
> As for who should decide what to backport, I lean toward Ben's view of the
> burden being on the committer. I don't think we should add more work for
> release managers, and I think the committer/shepherd obviously has the most
> understanding of the context around changes proposed for backport.
>
> Here's an example of a recent bugfix which I backported:
> https://reviews.apache.org/r/67587/ (for MESOS-3790)
>
> While normally I believe this change falls under "avoid due to unforeseen
> consequences," I made an exception as the bug was old, circa 2015,
> (indicating it had been an issue for others), and was causing recurring
> failures in testing. The fix itself was very small, meaning it was easier
> to evaluate for possible side effects, so I felt a little safer in that
> regard. The effect of not having the fix was a fatal and undesired crash,
> which furthermore left troublesome side effects on the system (you couldn't
> bring the agent back up). And lastly, a dependent project (DC/OS) wanted it
> in their next bump, which necessitated backporting to the release they were
> pulling in.
>
> I think in general we should backport only as necessary, and leave it on
> the committers to decide if backporting a particular change is necessary.
>
>
> On 07/13/2018 12:54 am, Alex Rukletsov wrote:
>
>> This is exactly where our views differ, Ben : )
>>
>> Ideally, I would like a release manager to have more ownership and less
>> manual work. In my imagination, a release manager has more power and
>> control about dates, features, backports and everything that is related to
>> "their" branch. I would also like us to back port as little as possible,
>> to
>> simplify testing and releasing patch versions.
>>
>> On Fri, Jul 13, 2018 at 1:17 AM, Benjamin Mahler 
>> wrote:
>>
>> +user, I probably it would be good to hear from users as well.
>>>
>>> Please see the original proposal as well as Alex's proposal and let us
>>> know
>>> your thoughts.
>>>
>>> To continue the discussion from where Alex left off:
>>>
>>> > Other bugs and significant improvements, e.g., performance, may be back
>>> ported,
>>> the release manager should ideally be the one who decides on this.
>>>
>>> I'm a little puzzled by this, why is the release manager involved? As we
>>> already document, backports occur when the bug is fixed, so this happens
>>> in
>>> the steady state of development, not at release time. The release manager
>>> only comes in at the time of the release itself, at which point all
>>> backports have already happened and the release manager handles the
>>> release
>>> process. Only blocker level issues can stop the release and while the
>>> release manager has a strong say, we should generally agree on what
>>> consists of a release blocking issue.
>>>
>>> Just to clarify my workflow, I generally backport every bug fix I commit
>>> that applies cleanly, right after I commit it to master (with the
>>> exceptions I listed below).
>>>
>>> On Thu, Jul 12, 2018 at 8:39 AM, Alex Rukletsov 
>>> wrote:
>>>
>>> > I would like to back port as litt

Re: Backport Policy

2018-07-13 Thread Alex Rukletsov
This is exactly where our views differ, Ben : )

Ideally, I would like a release manager to have more ownership and less
manual work. In my imagination, a release manager has more power and
control about dates, features, backports and everything that is related to
"their" branch. I would also like us to back port as little as possible, to
simplify testing and releasing patch versions.

On Fri, Jul 13, 2018 at 1:17 AM, Benjamin Mahler  wrote:

> +user, I probably it would be good to hear from users as well.
>
> Please see the original proposal as well as Alex's proposal and let us know
> your thoughts.
>
> To continue the discussion from where Alex left off:
>
> > Other bugs and significant improvements, e.g., performance, may be back
> ported,
> the release manager should ideally be the one who decides on this.
>
> I'm a little puzzled by this, why is the release manager involved? As we
> already document, backports occur when the bug is fixed, so this happens in
> the steady state of development, not at release time. The release manager
> only comes in at the time of the release itself, at which point all
> backports have already happened and the release manager handles the release
> process. Only blocker level issues can stop the release and while the
> release manager has a strong say, we should generally agree on what
> consists of a release blocking issue.
>
> Just to clarify my workflow, I generally backport every bug fix I commit
> that applies cleanly, right after I commit it to master (with the
> exceptions I listed below).
>
> On Thu, Jul 12, 2018 at 8:39 AM, Alex Rukletsov 
> wrote:
>
> > I would like to back port as little as possible. I suggest the following
> > criteria:
> >
> > * By default, regressions are back ported to existing release branches. A
> > bug is considered a regression if the functionality is present in the
> > previous minor or patch version and is not affected by the bug there.
> >
> > * Critical and blocker issues, e.g., a CVE, can be back ported.
> >
> > * Other bugs and significant improvements, e.g., performance, may be back
> > ported, the release manager should ideally be the one who decides on
> this.
> >
> > On Thu, Jul 12, 2018 at 12:25 AM, Vinod Kone 
> wrote:
> >
> > > Ben, thanks for the clarification. I'm in agreement with the points you
> > > made.
> > >
> > > Once we have consensus, would you mind updating the doc?
> > >
> > > On Wed, Jul 11, 2018 at 5:15 PM Benjamin Mahler 
> > > wrote:
> > >
> > > > I realized recently that we aren't all on the same page with
> > backporting.
> > > > We currently only document the following:
> > > >
> > > > "Typically the fix for an issue that is affecting supported releases
> > > lands
> > > > on the master branch and is then backported to the release
> branch(es).
> > In
> > > > rare cases, the fix might directly go into a release branch without
> > > landing
> > > > on master (e.g., fix / issue is not applicable to master)." [1]
> > > >
> > > > This leaves room for interpretation about what lies outside of
> > "typical".
> > > > Here's the simplest way I can explain what I stick to, and I'd like
> to
> > > hear
> > > > what others have in mind:
> > > >
> > > > * By default, bug fixes at any level should be backported to existing
> > > > release branches if it affects those releases. Especially important:
> > > > crashes, bugs in non-experimental features.
> > > >
> > > > * Exceptional cases that can omit backporting: difficult to backport
> > > fixes
> > > > (especially if the bugs are deemed of low priority), bugs in
> > experimental
> > > > features.
> > > >
> > > > * Exceptional non-bug cases that can be backported: performance
> > > > improvements.
> > > >
> > > > I realize that there is a ton of subtlety here (even in terms of
> which
> > > > things are defined as bugs). But I hope we can lay down a policy that
> > > gives
> > > > everyone the right mindset for common cases and then discuss corner
> > cases
> > > > on-demand in the future.
> > > >
> > > > [1] http://mesos.apache.org/documentation/latest/versioning/
> > > >
> > >
> >
>


Re: Backport Policy

2018-07-12 Thread Alex Rukletsov
I would like to back port as little as possible. I suggest the following
criteria:

* By default, regressions are back ported to existing release branches. A
bug is considered a regression if the functionality is present in the
previous minor or patch version and is not affected by the bug there.

* Critical and blocker issues, e.g., a CVE, can be back ported.

* Other bugs and significant improvements, e.g., performance, may be back
ported, the release manager should ideally be the one who decides on this.

On Thu, Jul 12, 2018 at 12:25 AM, Vinod Kone  wrote:

> Ben, thanks for the clarification. I'm in agreement with the points you
> made.
>
> Once we have consensus, would you mind updating the doc?
>
> On Wed, Jul 11, 2018 at 5:15 PM Benjamin Mahler 
> wrote:
>
> > I realized recently that we aren't all on the same page with backporting.
> > We currently only document the following:
> >
> > "Typically the fix for an issue that is affecting supported releases
> lands
> > on the master branch and is then backported to the release branch(es). In
> > rare cases, the fix might directly go into a release branch without
> landing
> > on master (e.g., fix / issue is not applicable to master)." [1]
> >
> > This leaves room for interpretation about what lies outside of "typical".
> > Here's the simplest way I can explain what I stick to, and I'd like to
> hear
> > what others have in mind:
> >
> > * By default, bug fixes at any level should be backported to existing
> > release branches if it affects those releases. Especially important:
> > crashes, bugs in non-experimental features.
> >
> > * Exceptional cases that can omit backporting: difficult to backport
> fixes
> > (especially if the bugs are deemed of low priority), bugs in experimental
> > features.
> >
> > * Exceptional non-bug cases that can be backported: performance
> > improvements.
> >
> > I realize that there is a ton of subtlety here (even in terms of which
> > things are defined as bugs). But I hope we can lay down a policy that
> gives
> > everyone the right mindset for common cases and then discuss corner cases
> > on-demand in the future.
> >
> > [1] http://mesos.apache.org/documentation/latest/versioning/
> >
>


Re: Proposing change to the allocatable check in the allocator

2018-06-12 Thread Alex Rukletsov
Instead of the master flag, why not a master API call. This will allow to
update the value without restarting the master.

Another thought is that we should explain operators how and when to use
this knob. For example, if they observe a behavioural pattern A, then it
means B is happening, and tuning the knob to C might help.

On Tue, Jun 12, 2018 at 7:36 AM, Jie Yu  wrote:

> I would suggest we also consider the possibility of adding per framework
> control on `min_allocatable_resources`.
>
> If we want to consider supporting per-framework setting, we should probably
> model this as a protobuf, rather than a free form JSON. The same protobuf
> can be reused for both master flag, framework API, or even supporting
> Resource Request in the future. Something like the following:
>
> message ResourceQuantityPredicate {
>   enum Type {
> SCALAR_GE,
>   }
>   optional Type type;
>   optional Value.Scalar scalar;
> }
> message ResourceRequirement {
>   required string resource_name;
>   oneof predicates {
> ResourceQuantityPredicate quantity;
>   }
> }
> message ResourceRequirementList {
>   // All requirements MUST be met.
>   repeated ResourceRequirement requirements;
> }
>
> // Resource request API.
> message Request {
>   repeated ResoruceRequrementList accepted;
> }
>
> // `allocatable()`
> message MinimalAllocatableResources {
>   repeated ResoruceRequrementList accepted;
> }
>
> On Mon, Jun 11, 2018 at 3:47 PM, Meng Zhu  wrote:
>
> > Hi:
> >
> > The allocatable
> >  allocator/mesos/hierarchical.cpp#L2471-L2479>
> >  check in the allocator (shown below) was originally introduced to
> >
> > help alleviate the situation where a framework receives some resources,
> > but no
> >
> > cpu/memory, thus cannot launch a task.
> >
> >
> > constexpr double MIN_CPUS = 0.01;constexpr Bytes MIN_MEM = Megabytes(32);
> > bool HierarchicalAllocatorProcess::allocatable(
> > const Resources& resources)
> > {
> >   Option cpus = resources.cpus();
> >   Option mem = resources.mem();
> >
> >   return (cpus.isSome() && cpus.get() >= MIN_CPUS) ||
> >  (mem.isSome() && mem.get() >= MIN_MEM);
> > }
> >
> >
> > Issues
> >
> > However, there has been a couple of issues surfacing lately surrounding
> > the check.
> >
> >-
> >- - MESOS-8935 Quota limit "chopping" can lead to cpu-only and
> >memory-only offers.
> >
> > We introduced fined-grained quota-allocation (MESOS-7099) in Mesos 1.5.
> > When we
> >
> > allocate resources to a role, we'll "chop" the available resources of the
> > agent up to the
> >
> > quota limit for the role. However, this has the unintended consequence of
> > creating
> >
> > cpu-only and memory-only offers, even though there might be other agents
> > with both
> >
> > cpu and memory resources available in the cluster.
> >
> >
> > - MESOS-8626 The 'allocatable' check in the allocator is problematic with
> > multi-role frameworks.
> >
> > Consider roleA reserved cpu/memory on an agent and roleB reserved disk on
> > the same agent.
> >
> > A framework under both roleA and roleB will not be able to get the
> > reserved disk due to the
> >
> > allocatable check. With the introduction of resource providers, the
> > similar situation will
> >
> > become more common.
> >
> > Proposed change
> >
> > Instead of hardcoding a one-size-fits-all value in Mesos, we are
> proposing
> > to add a new master flag
> >
> > min_allocatable_resources. It specifies one or more scalar resources
> > quantities that define the
> >
> > minimum allocatable resources for the allocator. The allocator will only
> > offer resources that are more
> >
> > than at least one of the specified resources.  The default behavior *is
> > backward compatible* i.e.
> >
> > by default, the flag is set to “cpus:0.01|mem:32”.
> >
> > Usage
> >
> > The flag takes in either a simple text of resource(s) delimited by a bar
> > (|) or a JSON array of JSON
> >
> > formatted resources. Note, the input should be “pure” scalar quantities
> > i.e. the specified resource(s)
> >
> > should only have name, type (set to scalar) and scalar fields set.
> >
> >
> > Examples:
> >
> >- - To eliminate cpu or memory only offer due to the quota chopping,
> >- we could set the flag to “cpus:0.01;mem:32”
> >-
> >- - To enable offering disk only offer, we could set the flag to
> >“disk:32”
> >-
> >- - For both, we could set the flag to “cpus:0.01;mem:32|disk:32”.
> >- Then the allocator will only offer resources that at least contain
> >“cpus:0.01;mem:32”
> >- OR resources that at least contain “disk:32”.
> >
> >
> > Let me know what you think! Thanks!
> >
> >
> > -Meng
> >
> >
>


On filtering protobuf messages in the test harness

2018-05-18 Thread Alex Rukletsov
Folks, I was thinking how our test harness can be improved to allow for
simpler, more reliable tests (captured as MESOS-8922).

One thing, MESOS-8923, comes from an observation that sometimes an
expectation in a test is satisfied by a similar but actually irrelevant
message / call. Let me give some context here.

Logically, I see two scenarios.

1. A test does not care about messages / calls of the same type but with
different fields, nor about their ordering. All it cares about is that a
specific message is eventually observed, e.g.,
  EXPECT_CALL(
  *scheduler,
  update(_, AllOf(
  TaskStatusUpdateTaskIdEq(taskInfo1),
  TaskStatusUpdateStateEq(v1::TASK_STARTING;

2. A test cares about messages of the same type, their ordering, possibly
unexpected messages, and so on. This is achieved by intercepting _all_
message and _then_ checking expectations, e.g.
   EXPECT_CALL(sched, statusUpdate(, _))
  .WillOnce(FutureArg<1>())
  .WillOnce(FutureArg<1>())
  .WillOnce(FutureArg<1>());

Note: Difference between these scenarios is vague. Ideally, 2. is expressed
in terms of 1., but this can be too verbose [1] or not supported [2].

While we can express filtering, order, etc when using gtest macros like
EXPECT_CALL, we are much limited with our own macros, like FUTURE_PROTOBUF,
DROP_PROTOBUF, FUTURE_MESSAGE and alike.

I have a POC [3] to address a part of this problem with an application
example [4]. Another application besides [2] will be some
StorageLocalResourceProviderTests that have to intercept a non-interesting
message:
  Future updateSlave2 =
FUTURE_PROTOBUF(UpdateSlaveMessage(), _, _);
  Future updateSlave1 =
FUTURE_PROTOBUF(UpdateSlaveMessage(), _, _);

  AWAIT_READY(updateSlave1);
  AWAIT_READY(updateSlave2);
  ASSERT_TRUE(updateSlave2->has_resource_providers());
  ASSERT_EQ(1, updateSlave2->resource_providers().providers_size());

The POC can be extended to other FUTURE_* and DROP_* macros.

Do folks think it is a useful addition? Are there more cases where we can
benefit from filtering messages / protobufs?

[1]
https://github.com/apache/mesos/blob/c020a130afd55dd3f5702a23b13f8234e0ace391/src/tests/default_executor_tests.cpp#L288-L320
[2]
https://github.com/apache/mesos/blob/c020a130afd55dd3f5702a23b13f8234e0ace391/src/tests/master_slave_reconciliation_tests.cpp#L417-L428
[3] https://github.com/rukletsov/mesos/commits/alexr/matched-protobuf
[4]
https://github.com/rukletsov/mesos/commit/9883222ba9b6eaaeb04ed4d606aaf5013d858c7b

Alex


Re: Update the *Minimum Linux Kernel version* supported on Mesos

2018-04-08 Thread Alex Rukletsov
This does not seem to me as a disruptive change, so I'm +1.

On Thu, Apr 5, 2018 at 6:36 PM, Jie Yu  wrote:

> User namespaces require >= 3.12 (November 2013). Can we make that the
>> minimum?
>
>
> No, we need to support CentOS7 which uses 3.10 (some variant)
>
> - Jie
>
> On Thu, Apr 5, 2018 at 8:56 AM, James Peach  wrote:
>
>>
>>
>> > On Apr 5, 2018, at 5:00 AM, Andrei Budnik 
>> wrote:
>> >
>> > Hi All,
>> >
>> > We would like to update minimum supported Linux kernel from 2.6.23 to
>> > 2.6.28.
>> > Linux kernel supports cgroups v1 starting from 2.6.24, but `freezer`
>> cgroup
>> > functionality was merged into 2.6.28, which supports nested containers.
>>
>> User namespaces require >= 3.12 (November 2013). Can we make that the
>> minimum?
>>
>> J
>
>
>


Re: Release policy and 1.6 release schedule

2018-03-26 Thread Alex Rukletsov
I would like us to do monthly releases and support 10 branches at a time.
Ideally, releasing that often reduces the burden for the release manager,
because there are less changes and less new features. However, we lack
automation to support this pace: our release guide [1] is several pages
long and includes quite a few non-trivial steps. It would be great to find
some time (maybe during the next Mesos hackathon?) and revisit our release
procedures, but until then I'm +1 for quarterly.

[1] https://mesos.apache.org/documentation/latest/release-guide/

On Sat, Mar 24, 2018 at 5:48 AM, Vinod Kone  wrote:

> I’m +1 for quarterly.
>
> Most importantly I want us to adhere to a predictable cadence.
>
> Sent from my phone
>
> On Mar 23, 2018, at 9:21 PM, Jie Yu  wrote:
>
> It's a burden for supporting multiple releases.
>
> 1.2 was released March, 2017 (1 year ago), and I know that some users are
> still on that version
> 1.3 was released June, 2017 (9 months ago), and we're still maintaining it
> (still backport patches
> 
>  several
> days ago, which some users asked)
> 1.4 was released Sept, 2017 (6 months ago).
> 1.5 was released Feb, 2018 (1 month ago).
>
> As you can see, users expect a release to be supported 6-9 months (e.g.,
> backports are still needed for 1.3 release, which is 9 months old). If we
> were to do monthly minor release, we'll probably need to maintain 6-9
> release branches? That's too much of an ask for committers and maintainers.
>
> I also agree with folks that there're benefits doing releases more
> frequently. Given the historical data, I'd suggest we do quarterly
> releases, and maintain three release branches.
>
> - Jie
>
> On Fri, Mar 23, 2018 at 10:03 AM, Greg Mann  wrote:
>
>> The best motivation I can think of for a shorter release cycle is this: if
>> the release cadence is fast enough, then developers will be less likely to
>> rush a feature into a release. I think this would be a real benefit, since
>> rushing features in hurts stability. *However*, I'm not sure if every two
>> months is fast enough to bring this benefit. I would imagine that a
>> two-month wait is still long enough that people wouldn't want to wait an
>> entire release cycle to land their feature. Just off the top of my head, I
>> might guess that a release cadence of 1 month or shorter would be often
>> enough that it would always seem reasonable for a developer to wait until
>> the next release to land a feature. What do y'all think?
>>
>> Other motivating factors that have been raised are:
>> 1) Many users upgrade on a longer timescale than every ~2 months. I think
>> that this doesn't need to affect our decision regarding release timing -
>> since we guarantee compatibility of all releases with the same major
>> version number, there is no reason that a user needs to upgrade minor
>> releases one at a time. It's fine to go from 1.N to 1.(N+3), for example.
>> 2) Backporting will be a burden if releases are too short. I think that in
>> practice, backporting will not take too much longer. If there was a
>> conflict back in the tree somewhere, then it's likely that after resolving
>> that conflict once, the same diff can be used to backport the change to
>> previous releases as well.
>> 3) Adhering strictly to a time-based release schedule will help users plan
>> their deployments, since they'll be able to rely on features being
>> released
>> on-schedule. However, if we do strict time-based releases, then it will be
>> less certain that a particular feature will land in a particular release,
>> and users may have to wait a release cycle to get the feature.
>>
>> Personally, I find the idea of preventing features from being rushed into
>> a
>> release very compelling. From that perspective, I would love to see
>> releases every month. However, if we're not going to release that often,
>> then I think it does make sense to adjust our release schedule to
>> accommodate the features that community members want to land in a
>> particular release.
>>
>>
>> Jie, I'm curious why you suggest a *minimal* interval between releases.
>> Could you elaborate a bit on your motivations there?
>>
>> Cheers,
>> Greg
>>
>>
>> On Fri, Mar 16, 2018 at 2:01 PM, Jie Yu  wrote:
>>
>> > Thanks Greg for starting this thread!
>> >
>> >
>> >> My primary motivation here is to bring our documented policy in line
>> >> with our practice, whatever that may be
>> >
>> >
>> > +100
>> >
>> > Do people think that we should attempt to bring our release cadence more
>> >> in line with our current stated policy, or should the policy be changed
>> >> to reflect our current practice?
>> >
>> >
>> > I think a minor release every 2 months is probably too aggressive. I
>> don't
>> > have concrete data, but my feeling is that the frequency that folks
>> upgrade
>> > Mesos is low. I know 

Re: On disabled tests

2018-03-26 Thread Alex Rukletsov
Okay, Windows gets a special treatment : ). The aforementioned suggest does
not apply to `TEST_F_TEMP_DISABLED_ON_WINDOWS`.

On Mon, Mar 26, 2018 at 3:21 AM, Andrew Schwartzmeyer <
and...@schwartzmeyer.com> wrote:

> Beware a large number of tickets from the Windows side... ;)
>
>
> On 03/22/2018 12:22 am, Alex Rukletsov wrote:
>
>> I think such policy would help us discover and act on forgotten disabled
>> tests. The reason I am reluctant to propose this as an official policy is
>> because I don't know how to enforce it.
>>
>> On 21 Mar 2018 6:00 pm, "Vinod Kone" <vinodk...@gmail.com> wrote:
>>
>> Thanks for doing this Alex! Are you proposing a policy that every disabled
>>> test should’ve an associated ticket that is linked in the comment above
>>> the
>>> test? I’m all for it.
>>>
>>> Sent from my phone
>>>
>>> > On Mar 21, 2018, at 9:42 AM, Alex Rukletsov <a...@mesosphere.io>
>>> wrote:
>>> >
>>> > Folks,
>>> >
>>> > to increase visibility into disabled tests, I've added a
>>> "disabled-test"
>>> > label. Whenever you disable a test, please add this label. A TODO
>>> comment
>>> > before the test mentioning the corresponding jira helps too.
>>> >
>>> > At the moment we have 20+ disabled tests in 18 tickets [1]. Some tests
>>> were
>>> > disabled for a "brief period of time" before the release and stayed in
>>> that
>>> > state for years. It would be great to audit all of them and either fix
>>> and
>>> > re-enable or remove altogether. Any help is appreciated and volunteers
>>> are
>>> > sought!
>>> >
>>> > [1] https://issues.apache.org/jira/issues/?filter=12343497
>>>
>>>


Re: On disabled tests

2018-03-22 Thread Alex Rukletsov
I think such policy would help us discover and act on forgotten disabled
tests. The reason I am reluctant to propose this as an official policy is
because I don't know how to enforce it.

On 21 Mar 2018 6:00 pm, "Vinod Kone" <vinodk...@gmail.com> wrote:

> Thanks for doing this Alex! Are you proposing a policy that every disabled
> test should’ve an associated ticket that is linked in the comment above the
> test? I’m all for it.
>
> Sent from my phone
>
> > On Mar 21, 2018, at 9:42 AM, Alex Rukletsov <a...@mesosphere.io> wrote:
> >
> > Folks,
> >
> > to increase visibility into disabled tests, I've added a "disabled-test"
> > label. Whenever you disable a test, please add this label. A TODO comment
> > before the test mentioning the corresponding jira helps too.
> >
> > At the moment we have 20+ disabled tests in 18 tickets [1]. Some tests
> were
> > disabled for a "brief period of time" before the release and stayed in
> that
> > state for years. It would be great to audit all of them and either fix
> and
> > re-enable or remove altogether. Any help is appreciated and volunteers
> are
> > sought!
> >
> > [1] https://issues.apache.org/jira/issues/?filter=12343497
>


On disabled tests

2018-03-21 Thread Alex Rukletsov
Folks,

to increase visibility into disabled tests, I've added a "disabled-test"
label. Whenever you disable a test, please add this label. A TODO comment
before the test mentioning the corresponding jira helps too.

At the moment we have 20+ disabled tests in 18 tickets [1]. Some tests were
disabled for a "brief period of time" before the release and stayed in that
state for years. It would be great to audit all of them and either fix and
re-enable or remove altogether. Any help is appreciated and volunteers are
sought!

[1] https://issues.apache.org/jira/issues/?filter=12343497


Re: Reconsidering `allocatable` check in the allocator

2018-03-07 Thread Alex Rukletsov
If we are about to offer some of the resources from a particular agent, why
would we filter anything at all? I doubt we should be concerned about the
size of the offer representation travelling through the network. If
available resources are "cpus:0.001,gpus:1" and we want to allocate GPU,
what is the benefit of filtering CPU?

What about the following:
allocatable(R)
{
  return true
iff (there exists r in R for which size(r) > MIN(type(r)))
}

On Wed, Mar 7, 2018 at 2:41 AM, Qian Zhang  wrote:

> So if the input resources are "cpus:0.001,disk:1024", the `allocatable`
> method will return "disk:1024"? This seems not compatible with the existing
> behavior: with the current implementation of `allocatable`, the same input
> resources will be just skipped because we think "cpus:0.001" is too small
> for framework to launch a task.
>
> allocatable = input
> > foreach known resource type t: do
> >   r = resources of type t from the input
> >   if r is less than the min resource of type t; then
> > allocatable -= r
> >   fi
> > done
> > return allocatable
> >
>
> Are we going to define min amount for each known resource type (including
> disk and gpu)?
>
>
> Regards,
> Qian Zhang
>
> On Wed, Mar 7, 2018 at 6:10 AM, Jie Yu  wrote:
>
> > Chatted with BenM offline on this. There's another option what both of us
> > agreed that it's probably better than any of the ones mentioned above.
> >
> > The idea is to make `allocable` return the portion of the input resources
> > that are allocatable, and strip the unelectable portion.
> >
> > For example:
> > 1) If the input resources are "cpus:0.001,gpus:1", the `allocatable`
> method
> > will return "gpus:1".
> > 2) If the input resources are "cpus:1,mem:1", the `allocatable` method
> will
> > return "cpus:1".
> > 3) If the input resources are "cpus:0.001,mem:1", the `allocatable`
> method
> > will return an empty Resources object.
> >
> > Basically, the algorithm is like the following:
> >
> > allocatable = input
> > foreach known resource type t: do
> >   r = resources of type t from the input
> >   if r is less than the min resource of type t; then
> > allocatable -= r
> >   fi
> > done
> > return allocatable
> >
> > Let me know what do you guys think!
> >
> > Thanks!
> > - Jie
> >
> > On Fri, Mar 2, 2018 at 4:44 PM, Benjamin Mahler 
> > wrote:
> >
> > > I think (2) would need to be:
> > >
> > > bool HierarchicalAllocatorProcess::allocatable(
> > > const Resources& resources)
> > > {
> > >   if (something outside {cpu, mem, disk} is present) return true
> > >   else return true iff at least one of {cpu, mem, disk} has >=
> {MIN_CPU,
> > > MIN_MEM, MIN_DISK}
> > > }
> > >
> > > Otherwise, 1 GPU would be offered but 1GPU + 0.001 CPU would not?
> > >
> > > On Fri, Mar 2, 2018 at 9:27 AM, Jie Yu  wrote:
> > >
> > > > Hi,
> > > >
> > > > The allocatable
> > > >  > > > ator/mesos/hierarchical.cpp#L2471-L2479>
> > > > check in the allocator (shown below) was originally introduced to
> help
> > > > alleviate the situation where a framework receives some resources,
> but
> > no
> > > > cpu/memory, thus cannot launch a task.
> > > >
> > > > bool HierarchicalAllocatorProcess::allocatable(
> > > > const Resources& resources)
> > > > {
> > > >   Option cpus = resources.cpus();
> > > >   Option mem = resources.mem();
> > > >
> > > >   return (cpus.isSome() && cpus.get() >= MIN_CPUS) ||
> > > >  (mem.isSome() && mem.get() >= MIN_MEM);
> > > > }
> > > >
> > > > As pointed by Benjamin in MESOS-7398
> > > > , it now seems to
> > > mainly
> > > > help to minimize the performance overhead from too many small offers
> > > > (instead too small resource amounts are kept out of the offer pool
> > until
> > > > they became accumulated into larger resources).
> > > >
> > > > This check does cause issues when new resources types are introduced.
> > For
> > > > instance, this check does prevent GPU resources alone from being
> > > allocated
> > > > to a framework. There are some other issues we discover MESOS-8626
> > > > .
> > > >
> > > > There are several proposals:
> > > >
> > > > (1) *Completely remove this check*. This check is a heuristic anyway,
> > and
> > > > only applies to a subset of resources (cpu/memory). However, there
> > might
> > > be
> > > > some implication of that change since it's also leveraged to prevent
> > too
> > > > many small offers. *If you are concerned about this approach, please
> > > raise
> > > > your voice.*
> > > >
> > > > (2) *Consider adjust the check to the following. *
> > > >
> > > > bool HierarchicalAllocatorProcess::allocatable(
> > > > const Resources& resources)
> > > > {
> > > >   Option cpus = resources.cpus();
> > > >   Option mem = resources.mem();
> > > >
> > > >   if 

Re: Soliciting Hackathon Ideas

2018-02-12 Thread Alex Rukletsov
Judith —

we have newbie and newbie++ labels [1]. To help people land their changes
at the end of a hackathon, we should find shepherds for issues before
giving them out to folks. Shepherds should have time for reviews and an
idea about the approach.

[1]
https://issues.apache.org/jira/browse/MESOS-8338?jql=labels%20in%20(newbie%2C%20%22newbie%2B%2B%22)%20AND%20project%20%3D%20MESOS%20AND%20status%20!%3D%20Resolved%20

On Fri, Feb 9, 2018 at 10:07 PM, Judith Malnick 
wrote:

> Hi all, these are great! Are they currently captured in Jira tickets or
> issues of some kind? If we had a beginner label we might be able to
> advertise those issues in other hackathons like Hacktober too :) I'd be
> happy to create tickets for the issues but I don't want to accidentally
> create a ton of duplicates if they already exist.
>
> On Wed, Feb 7, 2018 at 6:31 PM, Andrew Schwartzmeyer <
> and...@schwartzmeyer.com> wrote:
>
> > Thanks all for the ideas! (And keep them coming if you have more, it's
> not
> > for another couple weeks.) I'll make sure to put together a list and run
> it
> > by a few of you before I fly out.
> >
> >
> > On 02/07/2018 3:22 pm, Benjamin Mahler wrote:
> >
> >> -list to bcc
> >>
> >> Hey Tim! Sorry that this fell through the cracks, Vinod and I can
> shepherd
> >> this.
> >>
> >> What time zone are you in? We can set up a hangout to go over it.
> >>
> >> Ben
> >>
> >> On Wed, Feb 7, 2018 at 8:13 AM, Timothy Anderegg <
> >> timothy.ander...@gmail.com
> >>
> >>> wrote:
> >>>
> >>
> >> I've been looking for a new shepherd for that for a while, if there are
> >>> any
> >>> takers I'm happy to rebase against the latest code!
> >>>
> >>> Tim
> >>>
> >>> On Wed, Feb 7, 2018 at 11:10 AM James Peach  wrote:
> >>>
> >>> >
> >>> >
> >>> > > On Feb 6, 2018, at 11:21 PM, Benjamin Mahler 
> >>> wrote:
> >>> > >
> >>> > > +1 Versioned documentation would be heroic!
> >>> >
> >>> > Based on https://reviews.apache.org/r/52064/ ?
> >>> >
> >>> > >
> >>> > > On Tue, Feb 6, 2018 at 5:49 PM Vinod Kone 
> >>> wrote:
> >>> > >
> >>> > >> Versioned documentation!
> >>> > >>
> >>> > >> Sent from my iPhone
> >>> > >>
> >>> > >>> On Feb 6, 2018, at 4:37 PM, Benjamin Mahler 
> >>> > wrote:
> >>> > >>>
> >>> > >>> A couple of ideas from the performance related working group:
> >>> > >>>
> >>> > >>> -Use protobuf arenas for all non-trivial outbound master messages
> >>> > (easy)
> >>> > >>> This can be done piecemeal.
> >>> > >>> -Use move semantics (take a Message&&) in all of the master
> message
> >>> > >>> handlers to reduce copying (medium) This one can be done
> piecemeal.
> >>> For
> >>> > >>> example Master::statusUpdate would be a good one to start with.
> >>> > >>> -Audit the Registrar code to use move semantics to reduce copying
> >>> > >> (medium)
> >>> > >>>
> >>> > >>> If there are any UI programmers:
> >>> > >>>
> >>> > >>> -Consider a webui "refresh", try to find a new set of fonts and
> >>> style,
> >>> > >>> could be fun.
> >>> > >>>
> >>> > >>> On Fri, Feb 2, 2018 at 12:47 PM, Andrew Schwartzmeyer <
> >>> > >>> and...@schwartzmeyer.com> wrote:
> >>> > >>>
> >>> >  Hello all,
> >>> > 
> >>> >  Next month I'll be attending HackIllinois (
> >>> https://hackillinois.org/)
> >>> > >> as
> >>> >  an open-source mentor. It's a huge student-run hackathon at the
> >>> > >> University
> >>> >  of Illinois at Urbana-Champaign, running from February 23rd to
> the
> >>> > 25th.
> >>> >  Students from a multitude of schools will be attending (they
> even
> >>> bus
> >>> > >> them
> >>> >  in). The hackathon has an open-source focus, and while there
> will
> >>> be
> >>> > >> many
> >>> >  projects for the students to work on, I want to make sure Mesos
> >>> gets
> >>> > >> some
> >>> >  attention too.
> >>> > 
> >>> >  I am asking you all for open issues and new ideas for small,
> >>> >  beginner-friendly projects that could fit a two-day Hackathon
> >>> project.
> >>> > >> For
> >>> >  Mesos, I'm looking through our open issues labeled "easyfix",
> >>> > >> "beginner",
> >>> >  or "newbie", which actually returns 74 results! If you have
> >>> anything
> >>> > in
> >>> >  particular that you think would be a good fit, please let me
> know.
> >>> I'd
> >>> > >> like
> >>> >  to go with a list of vetted issues so I don't accidentally start
> >>> some
> >>> >  students in on a giant can of worms. Our excellent new Beginner
> >>> > >> Contributor
> >>> >  Guide will be a huge help too.
> >>> > 
> >>> >  Thanks,
> >>> > 
> >>> >  Andy
> >>> > 
> >>> >  P.S. If any of you also want to attend, let me know, and I'll
> get
> >>> you
> >>> > in
> >>> >  touch with their director.
> >>> > 
> >>> > >>
> >>> >
> >>> >
> >>>
> >>>
>
>
> --
> Judith Malnick
> Community Manager
> 310-709-1517 <(310)%20709-1517>
>


Re: [VOTE] C++14 Upgrade

2018-02-12 Thread Alex Rukletsov
+1 (binding)

Mesos codebase seems to be ready for the upgrade (tested on Mesosphere's
internal CI). I think beginning of 2018 is the right time for this.

In addition to technical reasons mentioned by MPark, I add one more:
modernising the codebase fosters learning, fun, and makes it a more
attractive project for contributing.

A.

On 12 Feb 2018 9:41 am, "Michael Park"  wrote:

On Sun, Feb 11, 2018 at 6:00 PM James Peach  wrote:

>
>
> > On Feb 9, 2018, at 9:28 PM, Michael Park  wrote:
> >
> > I'm going to put this up for a vote. My plan is to bump us to C++14 on
> Feb
> > 21.
> >
> > The following are the proposed changes:
> >  - Minimum GCC *4.8.1* => *5*.
> >  - Minimum Clang *3.5* => *3.6*.
> >  - Minimum Apple Clang *8* => *9*.
> >
> > We'll have a standard voting, at least 3 binding votes, and no -1s.
>
> +0
>
> What’s the user benefit of this change?
>

Some of the features I've described in MESOS-7949
 are:

   - Generic lambdas
   - New lambda captures (Proper move captures!)
   - SFINAE result_of (We can remove stout/result_of.hpp)
   - Variable templates
   - Relaxed constexpr functions
   - Simple utilities such as std::make_unique
   - Metaprogramming facilities such as decay_t, index_sequence

J


Re: Soliciting Hackathon Ideas

2018-02-06 Thread Alex Rukletsov
Andrew, here is my selection based on implementation difficulty, impact,
and code locality.

https://issues.apache.org/jira/browse/MESOS-5824 — augmenting string
representation of Resource for debuggability. Coding is trivial, but
requires finding a compromise between conciseness and clarity.

https://issues.apache.org/jira/browse/MESOS-7606 — optimization in the
allocator. Not a trivial fix, but touches allocation: fun!

https://issues.apache.org/jira/browse/MESOS-7191 — a very impactful change,
though there is no agreement on how to proceed here (hardcoded number?
configurable at master start? per request?). Requires some discussion
before it can be given out at a Hackathon.

https://issues.apache.org/jira/browse/MESOS-8329 — libprocess HTTP
processing fix. Likely an easy fix, may be combined with some refactoring.
https://issues.apache.org/jira/browse/MESOS-7773 — refactoring of HTTP
pipeline in libprocess. Might be too much for a hackathon, but mentioning
it here since it is related to the ticket above.

https://issues.apache.org/jira/browse/MESOS-7241 — reconciling status /
exit codes on Linux and Windows. Not a fast and easy one, but your area,
which will definitely help achieving the result.

Alex.



On Fri, Feb 2, 2018 at 9:48 PM, Bruce Campbell <
bruce.campb...@microsoft.com.invalid> wrote:

> Mesos modules for windows?
>
> -Original Message-
> From: Andrew Schwartzmeyer [mailto:and...@schwartzmeyer.com]
> Sent: Friday, February 2, 2018 12:47 PM
> To: dev@mesos.apache.org
> Subject: Soliciting Hackathon Ideas
>
> Hello all,
>
> Next month I'll be attending HackIllinois (https://na01.safelinks.
> protection.outlook.com/?url=https%3A%2F%2Fhackillinois.
> org%2F=02%7C01%7CBruce.Campbell%40microsoft.com%
> 7C2c276fad49384eef40d708d56a7e2ee7%7Cee3303d7fb734b0c8589bcd847f1
> c277%7C1%7C1%7C636532012546050886=q8wilsWFhPRONE3pKp2ypuY6iiDwY0
> fyGzubzqapnm0%3D=0) as an open-source mentor. It's a huge
> student-run hackathon at the University of Illinois at Urbana-Champaign,
> running from February 23rd to the 25th. Students from a multitude of
> schools will be attending (they even bus them in). The hackathon has an
> open-source focus, and while there will be many projects for the students
> to work on, I want to make sure Mesos gets some attention too.
>
> I am asking you all for open issues and new ideas for small,
> beginner-friendly projects that could fit a two-day Hackathon project.
> For Mesos, I'm looking through our open issues labeled "easyfix",
> "beginner", or "newbie", which actually returns 74 results! If you have
> anything in particular that you think would be a good fit, please let me
> know. I'd like to go with a list of vetted issues so I don't accidentally
> start some students in on a giant can of worms. Our excellent new Beginner
> Contributor Guide will be a huge help too.
>
> Thanks,
>
> Andy
>
> P.S. If any of you also want to attend, let me know, and I'll get you in
> touch with their director.
>


Re: Flaky executor tests on ARM

2017-11-23 Thread Alex Rukletsov
Might be libtool wrappers. Have a look at [1] and commits [2, 3, 4].

[1] https://issues.apache.org/jira/browse/MESOS-7500
[2]
https://github.com/apache/mesos/commit/d863620e5cb82b7f22cade0da0a0d18afbdf9136
[3]
https://github.com/apache/mesos/commit/74121798f24fca372180b8c4bc00b4df07d46240
[4]
https://github.com/apache/mesos/commit/cd516ab65b03045dba7f1cbfd40e72a1d5267539

On Thu, Nov 23, 2017 at 11:39 AM, Tomek Janiszewski 
wrote:

> Hi
>
> I found following 5 tests are flaky. They fail when run together but pass
> alone.
>
> CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
> DefaultExecutorCheckTest.CommandCheckDeliveredAndReconciled
> DefaultExecutorCheckTest.CommandCheckStatusChange
> DefaultExecutorCheckTest.CommandCheckSeesParentsEnv
> DefaultExecutorCheckTest.CommandCheckSharesWorkDirWithTask
>
> Before I start debugging it, have you got any ideas what could be a
> problem?
>
> Thanks
> Tomek
>


Re: DC/OS (Mesos) portability

2017-11-21 Thread Alex Rukletsov
I think Tomas means Mesos dependencies, like libcurl, and not libmesos. If
I understand him correctly, he is saying that part of Mesos dependencies is
not distributed with Mesos binaries, and, if not included into a
distribution, might complicate installation process.

On Fri, Nov 3, 2017 at 8:54 PM, Joseph Wu  wrote:

> It isn't clear to me how DC/OS would benefit from (ongoing) work to
> create/push Mesos packages.  DC/OS downloads and builds all of its
> component parts from source.
>
> Also, we (Mesos devs) are hoping to get more frameworks to move away from
> using libmesos (including the API shims), in favor of using the HTTP APIs
> instead.  So we have a dis-incentive to provide a libmesos bundle.
>
> On Fri, Nov 3, 2017 at 8:23 AM, Tomas Barton 
> wrote:
>
> > Hi,
> >
> > I'd like to contribute to DC/OS with a Debian/Suse/... support.
> > Surprisingly on Debian most of the compatibility issues could be solved
> by
> > a sequence of symlinks.
> >
> > Why Mesos dev list? :)
> >
> > Currently the biggest issue is connected to distributing libmesos-bundle
> > tar archive, which contain the libmesos.so library and several others.
> The
> > library is dynamically linked with certain libcurl,  libssl, libsvn etc.
> > that might differ between distributions.
> >
> > I can think of a few solutions:
> >  1. Compile Mesos (master and agent) using static build (which as I
> > understood aren't currently fully supported/propagated).
> >  2. Generate bundle during automatic builds for certain supported
> > distributions.
> >  3. Include libmesos in standard distribution channels - rpm, deb
> packages
> > (that might take same time).
> >
> > The last solution would be the best, but Mesos release cycle is very
> > different from distributions release cycle. It might be complicated to
> > synchronize.
> >
> > I coudn't find scripts for generating libmesos-bundle, but it's a archive
> > with libraries from build server, e.g.
> > https://downloads.mesosphere.io/libmesos-bundle/libmesos-
> > bundle-1.10-1.4-63e0814.tar.gz
> > (32MB).
> >
> > So the question is, whether Mesos website could provide prebuild libmesos
> > bundle for each release and platform, that could be afterwards used e.g.
> in
> > DC/OS packages?
> >
> > Last issue might be connected to an executor that eventually might need
> OS
> > family ENV variable with OS release version, so that it can fetch
> > corresponding libbundle archive. Such information is typically parsed
> from
> > `uname -a` or `lsb_release -sri` (if available). This way DC/OS could be
> > running on a cluster with diverse OS versions/distributions.
> >
> > Thanks for your time! I'd like to hear your opinion.
> >
> > Regards,
> > Tomas Barton
> >
>


Re: Mesos schedulers

2017-11-21 Thread Alex Rukletsov
What do you mean under "the regular mesos scheduler"?

On Tue, Nov 21, 2017 at 6:44 AM, Trevor Powell 
wrote:

> YoYo what up!
>
> We have been running into resource fragmentation across our clusters. We
> have several small tasks and several large tasks.  And sometimes, the small
> tasks take away enough resources from a node, that a large task can not be
> placed there.  I believe bin packing should do the trick.
>
> Is the regular mesos scheduler modifiable to support this? Or are their
> other scheduler options?
>
> I know Netflix’s Fenzo does this. Others?
>
>
>
> Thanks gang.
>
>
>
> —
>
> [image: id:image001.png@01D2FA4E.D74370C0]
>
> *Trevor Alexander Powell*
>
> Product Owner, Release+Platform Engineering
>
> 7575 Gateway Blvd. Newark, CA 94560
> 
>
> M: +1.650.325.7467 <(650)%20325-7467>
>
>
>
> https://github.com/tpowell-rms
>
> https://www.linkedin.com/in/trevorapowell
>
> http://www.rms.com
>


Re: mesos health checks

2017-11-03 Thread Alex Rukletsov
+ dev list for visibility and history.

Okay, let's dig into this a little bit : ).

First, it is true that Marathon and Mesos HTTP health checks are not
equivalent. It's not just 1xx status codes, you can't have multiple Mesos
health checks for example. I don't understand why you say that the operator
should know that failed is an expected response. It is not! Health checks
do not have a concept of "not ready yet", grace period serves this purpose.
The health check has failed because the contract had been violated: 111 is
considered a failure. If you think that 1xx codes should be treated as
success — let's have this discussion separately, probably on the dev list
(btw, k8s does the same
<https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request>
).

Second, are you sure about the status code in the second case? The error
does not say anything about empty body, but empty reply. From what I can see
<https://stackoverflow.com/questions/41290792/my-curl-post-gets-empty-reply-from-server>,
(52) means a misbehaving server. If you're convinced that your server
returned a proper HTTP response with some status code but with empty body,
please file a bug report against Mesos jira.

On Fri, Nov 3, 2017 at 2:20 PM, Alex Rukletsov <a...@mesosphere.com> wrote:

> Tomas, can I reply to you and cc devlist to have our discussion logged
> publicly?
>
>
> On Fri, Nov 3, 2017 at 10:43 AM, Tomas Barton <barton.to...@gmail.com>
> wrote:
>
>> Hi Alex,
>>
>> I'm quite ok with the current contract, treat "codes between 200 and 399
>> as success" seems reasonable for me. We're using code < 200 for "not
>> ready yet" and >= 500 for error states.
>>
>> But that's not really the problem. While Marathon's implementation only
>> checked the HTTP code, curl tends to be too smart. Meaning that going from
>> Marathon healthcheck to MESOS based might introduce some incompatibility.
>>
>> For example:
>>
>> (2017-11-02 19:31:25) [INFO] Request: 127.0.0.1:44172 0x1fcc44f0
>> HTTP/1.1 GET /health
>> (2017-11-02 19:31:25) [INFO] Response: 0x1fcc44f0 /health 111 0
>> I1102 19:31:25.548070 23822 checker_process.cpp:959] HTTP health check
>> for task 'reql-dev.3c83761f-c004-11e7-acb9-be622fe0971d' returned: 111
>> W1102 19:31:25.548195 23822 health_checker.cpp:317] HTTP health check for
>> task 'reql-dev.3c83761f-c004-11e7-acb9-be622fe0971d' failed: Unexpected
>> HTTP response code: 111
>>
>> This is sort of ok, the operator should know that "failed: Unexpected
>> HTTP response code: 111" isn't really a failure but an expected response.
>>
>> But in order to get this we had to hack into HTTP server and introduce
>> some "special" HTTP codes.
>>
>> Another component where health checks on Marathon we responding as
>> expected, behaves funny with MESOS_HTTP:
>>
>> W1102 10:50:38.637907 6 health_checker.cpp:307] HTTP health check for
>> task 'xxx' failed: curl exited with status 52: curl: (52) Empty reply from
>> server
>> I1102 10:50:38.637949 6 health_checker.cpp:333] Ignoring failure of
>> HTTP health check for task 'xxx': still in grace period
>>
>> In this case the response code was either 100 or 111. Hard to tell from
>> the logs as the return code is not logged. The problem is, that the
>> component is written in Java, where some library for creating simple
>> webserver responds to /health endpoint is using underneath pretty standard
>> Jetty server. And Jetty decided that responses with code 1xx doesn't have
>> to send body response. On the other side curl thinks that HTTP response
>> with 1xx should have body response, thus the error code (52) Empty reply
>> from server. Maybe we should simply respond with HTTP 418 I'm a teapot,
>> meaning that the tea is not ready yet :)
>>
>> So, the question is, could be curl configured in a way where it doesn't
>> check for body content? And if body is present include it in logs?
>>
>> Or should I file bug reports to all web servers to include Mesos
>> compatible http responses? :)
>>
>> Thanks!
>> Tomas
>>
>>
>> On 2 November 2017 at 19:58, Alex Rukletsov <a...@mesosphere.com> wrote:
>>
>>> Hi Tomas!
>>>
>>> I wanted to make health checks as simple as possible. I had looked at
>>> what aws, k8s, and nomad do and decided that I will not support
>>> customization for return codes unless someone shows me a very good reason
>>> to do so. Such customization is not easy, once you start it, people will
>>> want mor

On the current CI state

2017-10-23 Thread Alex Rukletsov
Folks,

the CI state (both Apache and internal we have at Mesosphere) has recently
degraded to a point when people no longer look at it failures. This defeats
the primary purpose of a CI: to produce a reliable signal when a change
breaks something.

You might have seen a bunch of commits fixing flaky tests and bugs over the
past two weeks — this is the beginning of our effort to bring the CI back
to the green state. To track the effort, there exists a swim lane in our
tech debt board [1] and a flow diagram [2]. I believe that some of the
older tickets are no longer relevant, I will do a cleanup at some point
when I get a better feeling of the actual state.

If you would like to help, watch out for new flakiness new changes might
introduce. Apache CI apparently has a quirk when a test run can pause for
15+s, leading to arbitrary test failures. This is a false positive, but the
pattern is easily recognizabe in the logs.

We also have a dedicated channel in Apache Mesos slack: #ci-back-to-green

If you would like to participate, here is the list of the biggest offenders
that are not triaged yet: MESOS-7519, MESOS-7082, MESOS-7434, MESOS-7512,
MESOS-7742, MESOS-7028, MESOS-7425, MESOS-7106, MESOS-7337, MESOS-7273,
MESOS-6724, MESOS-8112, MESOS-6949, MESOS-8000, MESOS-8047

Alex.

[1]
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=151=detail=MESOS-8005
[2]
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=204=reporting=cumulativeFlowDiagram=501=774=775=776=7


Re: [Proposal] Updating levels for verbose logging

2017-10-09 Thread Alex Rukletsov
Ben, I understand why you question that libprocess should log starting from
a specific level. I think it is not quite illogical for a library to use
lower priority levels. I can see this change being helpful for any user of
libprocess, not just Mesos.

On Mon, Oct 9, 2017 at 6:34 PM, Benjamin Mahler  wrote:

> >
> >2. Changing the libprocess verbose logs to start at level 3. Not just
> >due to an ordering between Mesos and libprocess logs, but also because
> >libprocess is a low-level library.
>
>
> 2. is the part that is concerning. It seems arbitrary to me to have
> libprocess start at a particular level since it's a library. Can you make
> it a configuration option as I mentioned earlier?
>
> The /logging integration for per-module logging sounds great!
>
> On Mon, Oct 9, 2017 at 11:02 AM, Armand Grillet 
> wrote:
>
> > Thanks for your input Benjamin. After having looked at per-module verbose
> > level, here are the changes I would like to apply:
> >
> >1. Changing the Mesos common events verbose logs so that they use
> >VLOG(2) instead of 1. The original commit
> > https://github.com/apache/meso
> >s/commit/fa6ffdfcd22136c171b43aed2e7949a07fd263d7
> > fa6ffdfcd22136c171b43aed2e7949
> > a07fd263d7>
> > that
> >started using VLOG(1) for the allocator does not state why this level
> > was
> >chosen and the periodic messages such as "No allocations performed"
> > should
> >be displayed at a higher level to simplify debugging.
> >2. Changing the libprocess verbose logs to start at level 3. Not just
> >due to an ordering between Mesos and libprocess logs, but also because
> >libprocess is a low-level library.
> >3. Adding support for the GLOG vmodule flag and add it as an option in
> >/toggle/logging (as suggested in https://issues.apache.org/j
> >ira/browse/MESOS-5784). However, this would not allow us to have a
> >per-component logging verbosity control that should be added
> afterwards.
> >
> >
> > 2017-10-07 1:47 GMT+02:00 Benjamin Mahler :
> >
> > > It seems unfortunate to establish an ordering between different
> > component's
> > > verbosity levels, how is libprocess to know which level to start at? I
> > > suppose you can tell it, but it's not clear that the first level of
> > > verbosity in libprocess should come after the max level of verbosity in
> > > mesos.
> > >
> > > This seems to surface a need for per-module logging verbosity control.
> > Have
> > > you looked into the '--vmodule' flag?
> > >
> > > On Wed, Oct 4, 2017 at 12:59 PM, Armand Grillet <
> agril...@mesosphere.io>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > We currently use three levels of verbose logging via the VLOG macro.
> I
> > > > propose to add two levels and change how we use the current ones to
> > make
> > > > debugging easier for Mesos developers.
> > > >
> > > > The current situation is:
> > > >
> > > >- VLOG(1) is used for Mesos and libprocess events such as the
> > > >admission of an agent by a master. It is also used for a few Mesos
> > > > common
> > > >events, e.g. the allocation of resources on an agent.
> > > >- VLOG(2) is used for Mesos and libprocess common events, e.g. the
> > > >reception of an offer by a Mesos scheduler.
> > > >- VLOG(3) is used when a Mesos scheduler process saves the PID
> > > >associated with each slave and for libprocess events related to
> > > timers,
> > > >clocks, and waiter processes.
> > > >
> > > > As an example, running GLOG_v= ./mesos-tests --gtest_filter="
> > > > OversubscriptionTest.UpdateAllocatorOnSchedulerFailover" --verbose
> > > > returns:
> > > >
> > > >- 212 lines of logs with level = 1.
> > > >- 695 lines of logs with level = 2.
> > > >- 782 lines of logs with level = 3.
> > > >
> > > > The logs at level 2 are quite noisy. This is mainly due to the number
> > of
> > > > messages regarding libprocess recurring events such as process
> > > resumptions:
> > > > https://github.com/apache/mesos/blob/d863620e5cb82b7f22cade0da0a0d1
> > > > 8afbdf9136/3rdparty/libprocess/src/process.cpp#L3245
> > > >
> > > > To improve the situation, I suggest having five levels:
> > > >
> > > >- VLOG(1), used for Mesos events.
> > > >- VLOG(2), used for Mesos common/recurring events.
> > > >- VLOG(3), used for libprocess events.
> > > >- VLOG(4), used for libprocess common/recurring events.
> > > >- VLOG(5), used for libprocess events related to timers, clocks,
> and
> > > >waiter processes.
> > > >
> > > > This change would allow us to read the Mesos verbose logs without
> > having
> > > to
> > > > see the ones concerning libprocess, a use case that seems reasonable
> > for
> > > > Mesos developers. The new log levels would make it possible to have
> the
> > > > same logs as before when necessary.
> > > >
> > > > What do you think about this? 

Re: [Proposal] Updating levels for verbose logging

2017-10-06 Thread Alex Rukletsov
I support the effort, Armand.

On Wed, Oct 4, 2017 at 3:59 PM, Armand Grillet 
wrote:

> Hi all,
>
> We currently use three levels of verbose logging via the VLOG macro. I
> propose to add two levels and change how we use the current ones to make
> debugging easier for Mesos developers.
>
> The current situation is:
>
>- VLOG(1) is used for Mesos and libprocess events such as the
>admission of an agent by a master. It is also used for a few Mesos
> common
>events, e.g. the allocation of resources on an agent.
>- VLOG(2) is used for Mesos and libprocess common events, e.g. the
>reception of an offer by a Mesos scheduler.
>- VLOG(3) is used when a Mesos scheduler process saves the PID
>associated with each slave and for libprocess events related to timers,
>clocks, and waiter processes.
>
> As an example, running GLOG_v= ./mesos-tests --gtest_filter="
> OversubscriptionTest.UpdateAllocatorOnSchedulerFailover" --verbose
> returns:
>
>- 212 lines of logs with level = 1.
>- 695 lines of logs with level = 2.
>- 782 lines of logs with level = 3.
>
> The logs at level 2 are quite noisy. This is mainly due to the number of
> messages regarding libprocess recurring events such as process resumptions:
> https://github.com/apache/mesos/blob/d863620e5cb82b7f22cade0da0a0d1
> 8afbdf9136/3rdparty/libprocess/src/process.cpp#L3245
>
> To improve the situation, I suggest having five levels:
>
>- VLOG(1), used for Mesos events.
>- VLOG(2), used for Mesos common/recurring events.
>- VLOG(3), used for libprocess events.
>- VLOG(4), used for libprocess common/recurring events.
>- VLOG(5), used for libprocess events related to timers, clocks, and
>waiter processes.
>
> This change would allow us to read the Mesos verbose logs without having to
> see the ones concerning libprocess, a use case that seems reasonable for
> Mesos developers. The new log levels would make it possible to have the
> same logs as before when necessary.
>
> What do you think about this? Please feel free to share your thoughts and
> comments.
>
> --
> Armand Grillet
> Software Engineer, Mesosphere
>


Re: About the Mesos authorization

2017-09-15 Thread Alex Rukletsov
Look for "Implementing an Authorizer" in [1].

[1] https://mesos.apache.org/documentation/latest/authorization/

On Thu, Sep 14, 2017 at 4:01 AM, j...@is-land.com.tw <j...@is-land.com.tw>
wrote:

>
>
> On 2017-09-14 02:46, Alex Rukletsov <a...@mesosphere.io> wrote:
> > Mesos provides API which you can use to build any authz you like. But
> that
> > does not necessarily mean that all those implementations should be part
> of
> > the core mesos. I'd suggest to search around, maybe you will find
> something
> > you can use. For example, internally we have Kerberos authz module, but
> it
> > is proprietary.
> >
> > Alex.
> >
> > On 12 Sep 2017 4:52 am, "j...@is-land.com.tw" <j...@is-land.com.tw>
> wrote:
> >
> > > Hi all:
> > > Why does Mesos authorization not support the LDAP or Kerberos?
> > >
> > > I am thinking to implement the Mesos module for authorization.
> > >
> > >
> > > Thank you.
> > >
>
> Hi,
> Thank you for your reply.
>
> How can use the Mesos API to build authz?
>
>


Re: About the Mesos authorization

2017-09-13 Thread Alex Rukletsov
Mesos provides API which you can use to build any authz you like. But that
does not necessarily mean that all those implementations should be part of
the core mesos. I'd suggest to search around, maybe you will find something
you can use. For example, internally we have Kerberos authz module, but it
is proprietary.

Alex.

On 12 Sep 2017 4:52 am, "j...@is-land.com.tw"  wrote:

> Hi all:
> Why does Mesos authorization not support the LDAP or Kerberos?
>
> I am thinking to implement the Mesos module for authorization.
>
>
> Thank you.
>


[RESULT][VOTE] Release Apache Mesos 1.1.3 (rc2)

2017-08-31 Thread Alex Rukletsov
Hi all,

The vote for Mesos 1.1.3 (rc2) has passed with the
following votes.

+1 (Binding)
--
Alex R
Till Tönshoff
Vinod Kone

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.1.3

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.3

The mesos-1.1.3.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Till & Alex


Re: [VOTE] Release Apache Mesos 1.1.3 (rc2)

2017-08-31 Thread Alex Rukletsov
+1

Tested on internal CI and additionally `make check` on Fedora 25 and Mac OS
10.11.6.

On Thu, Aug 31, 2017 at 2:50 AM, Till Toenshoff <toensh...@me.com> wrote:

> +1
>
> Tested on internal CI as well as on macOS 10.12 and macOS 10.13 DP 8 using
> Apple’s clang (Xcode 8.3.3 and Xcode 9.0.0 beta 6).
>
> > On Aug 27, 2017, at 8:33 PM, Vinod Kone <vinodk...@apache.org> wrote:
> >
> > +1 (binding)
> >
> > Tested on ASF CI. The only red build was the known perf core dump issue.
> >
> > Revision: ce77d91bd3a59227d5684ce0783b460c54ea311f
> > refs/tags/1.1.3-rc2
> > Configuration Matrix  gcc clang
> > centos:7  --verbose --enable-libevent --enable-sslautotools
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > cmake
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > --verbose autotools
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
> exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > cmake
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
> 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > ubuntu:14.04  --verbose --enable-libevent --enable-sslautotools
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > cmake
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> -verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > --verbose autotools
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > cmake
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> -verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > On Fri, Aug 25, 2017 at 7:48 AM, Alex Rukletsov <a...@mesosphere.com
> <mailto:a...@mesosphere.com>> wrote:
> > Folks,
> >
> > Please vote on releasing the fo

[VOTE] Release Apache Mesos 1.1.3 (rc2)

2017-08-25 Thread Alex Rukletsov
Folks,

Please vote on releasing the following candidate as Apache Mesos 1.1.3.
Note that this will be the last 1.1.x release.

1.1.3 includes the following:

** Bug
 * [MESOS-5187] - The filesystem/linux isolator does not set the
permissions of the host_path.
  * [MESOS-6743] - Docker executor hangs forever if `docker stop` fails.
  * [MESOS-6950] - Launching two tasks with the same Docker image
simultaneously may cause a staging dir never cleaned up.
  * [MESOS-7540] - Add an agent flag for executor re-registration timeout.
  * [MESOS-7569] - Allow "old" executors with half-open connections to be
preserved during agent upgrade / restart.
  * [MESOS-7689] - Libprocess can crash on malformed request paths for
libprocess messages.
  * [MESOS-7690] - The agent can crash when an unknown executor tries to
register.
  * [MESOS-7581] - Fix interference of external Boost installations when
using some unbundled dependencies.
  * [MESOS-7703] - Mesos fails to exec a custom executor when no shell is
used.
  * [MESOS-7728] - Java HTTP adapter crashes JVM when leading master
disconnects.
  * [MESOS-7770] - Persistent volume might not be mounted if there is a
sandbox volume whose source is the same as the target of the persistent
volume.
  * [MESOS-] - Agent failed to recover due to mount namespace leakage
in Docker 1.12/1.13.
  * [MESOS-7796] - LIBPROCESS_IP isn't passed on to the fetcher.
  * [MESOS-7830] - Sandbox_path volume does not have ownership set
correctly.
  * [MESOS-7863] - Agent may drop pending kill task status updates.
  * [MESOS-7865] - Agent may process a kill task and still launch the task.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.3-rc2


The candidate for Mesos 1.1.3 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.3-rc2/mesos-1.1.3.tar.gz

The tag to be voted on is 1.1.3-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.3-rc2

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.3-rc2/mesos-1.1.3.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.3-rc2/mesos-1.1.3.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1208

Please vote on releasing this package as Apache Mesos 1.1.3!

The vote is open until Wed Aug 28 23:59:59 CEST 2017 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.1.3
[ ] -1 Do not release this package because ...

Thanks,
Alex & Till


Re: build tests/scheduler_http_api_tests.cpp error

2017-08-25 Thread Alex Rukletsov
Yes, we need to `#include `. Thanks for the report! It is fixed in
the master now: 9735aebcdaa02d6577636b2c9ea45d986af747c9

On Tue, Aug 22, 2017 at 11:22 AM, Yanjun Shen  wrote:

> hi,
>   now, i have finished make install ,but when i build
> scheduler_http_api_tests.cpp file, then error as flow:
>
> In file included from /usr/local/include/internal/devolve.hpp:27:0,
>  from /usr/local/include/master/master.hpp:75,
>  from scheduler_http_api_tests.cpp:39:
> /usr/local/include/mesos/executor/executor.hpp:28:61: 错误:‘Call’不是一个类型名
>  inline std::ostream& operator<<(std::ostream& stream, const Call::Type&
> type)
>  ^
> /usr/local/include/mesos/executor/executor.hpp:28:71: 错误:expected
> unqualified-id before ‘&’ token
>  inline std::ostream& operator<<(std::ostream& stream, const Call::Type&
> type)
>^
> /usr/local/include/mesos/executor/executor.hpp:28:71: 错误:expected ‘)’
> before ‘&’ token
> /usr/local/include/mesos/executor/executor.hpp:28:73: 错误:expected
> initializer before ‘type’
>  inline std::ostream& operator<<(std::ostream& stream, const Call::Type&
> type)
>  ^
> /usr/local/include/mesos/executor/executor.hpp:34:61: 错误:‘Event’不是一个类型名
>  inline std::ostream& operator<<(std::ostream& stream, const Event::Type&
> type)
>  ^
> /usr/local/include/mesos/executor/executor.hpp:34:72: 错误:expected
> unqualified-id before ‘&’ token
>  inline std::ostream& operator<<(std::ostream& stream, const Event::Type&
> type)
> ^
> /usr/local/include/mesos/executor/executor.hpp:34:72: 错误:expected ‘)’
> before ‘&’ token
> /usr/local/include/mesos/executor/executor.hpp:34:74: 错误:expected
> initializer before ‘type’
>  inline std::ostream& operator<<(std::ostream& stream, const Event::Type&
> type)
>
> my build command
> g++ -std=c++0x  -o test_scheduler_http.bin  -I/usr/local/include
> -L/usr/local/lib scheduler_http_api_tests.cpp -Lmesos-1.3.0
>
> i find  type “Call” in the scheduler.pb.h in the build dir
> [root@mytest mesos-1.3.0]# find . -name scheduler.pb.h
> ./build/include/mesos/scheduler/scheduler.pb.h
> ./build/include/mesos/v1/scheduler/scheduler.pb.h
>
> so , do you tell me ?Mesos’s heads file dir need include ?
>
> Thanks.


Re: [Proposal] Use jemalloc as default memory allocator for Mesos

2017-08-19 Thread Alex Rukletsov
I'm for making jemalloc default as well with adding an opt-out option to
CMake and autotools build scripts.

On Sat, Aug 19, 2017 at 3:23 AM, Benjamin Mahler  wrote:

> This will be a big win Benno, thanks for driving it!
>
> Nice to see that the heap profiling overhead is really low, I'd love to be
> able to just hit an endpoint on the master or agent and get a memory
> profile.
>
> I'm a +1 for making it the default, however, I seem to recall hearing that
> there were some issues with JNI?
>
> Ben
>
> On Fri, Aug 18, 2017 at 3:49 AM, Benno Evers 
> wrote:
>
> > Hi all,
> >
> > I would like to propose bundling jemalloc as a new dependency
> > under `3rdparty/`, and to link Mesos against this new memory
> > allocator by default.
> >
> >
> > # Motivation
> >
> > The Mesos master and agent binaries are, ideally, very long-running
> > processes. This makes them susceptible to memory issues, because
> > even small leaks have a chance to build up over time to the point
> > where they become problematic.
> >
> > We have seen several such issues on our internal Mesos installations,
> > for example https://issues.apache.org/jira/browse/MESOS-7748
> > or https://issues.apache.org/jira/browse/MESOS-7800.
> >
> > I imagine any organization running Mesos for an extended period
> > of time has had its share of similar issues, so I expect this
> > proposal to be useful for the whole community.
> >
> >
> > # Why jemalloc?
> >
> > Given that memory issues tend to be most visible after a given
> > process has been running for a long time, it would be great to
> > have the option to enable heap tracking and profiling at runtime,
> > without having to restart the process. (This ability could then
> > be connected to a Mesos endpoint, similar to how we can adjust
> > the log level at runtime)
> >
> > The two production-quality memory allocators that have this
> > ability currently seem to be tcmalloc and jemalloc. Of these,
> > jemalloc does produce in our experience better and more
> > detailed statistics.
> >
> >
> > # What is the impact on users who do not need this feature?
> >
> > Naturally, not every single user of Mesos will have a need
> > for this feature. To ensure these users would not experience serious
> > performance regressions as a result of this change, we conducted
> > a preliminary set of benchmarks whose results are collected
> > under https://issues.apache.org/jira/browse/MESOS-7876
> >
> > It turns out that we could probably even expect a small speedup (1% - 5%)
> > as a nice side-effect of this change.
> >
> > Users who compile Mesos themselves would of course have the option
> > to disable jemalloc at configuration time or replace it with their
> > memory allocator of choice.
> >
> >
> >
> > I'm looking forward to hear any thoughts and comments.
> >
> >
> > Thanks,
> > --
> > Benno Evers
> > Software Engineer, Mesosphere
> >
>


Re: Mesos 1.1.3 release

2017-08-17 Thread Alex Rukletsov
We have two more issues that I would like to have in 1.1.3 because it's the
last 1.1.x release:
https://issues.apache.org/jira/browse/MESOS-7865
https://issues.apache.org/jira/browse/MESOS-7863

They are in review and will be back ported soon.

On Tue, Jul 25, 2017 at 11:28 AM, Alex Rukletsov <a...@mesosphere.com>
wrote:

> MESOS-7643 is still unresolved. I am moving the cut date for one more
> week, because this is the last patch release for 1.1.x.
>
> On Fri, Jul 14, 2017 at 6:34 PM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> Folks,
>>
>> We are planning to cut the 1.1.3 release once MESOS-7643 is resolved. If
>> you have any patch that needs to get into 1.1.3, please make sure that
>> either it is already in the 1.1.x branch or the corresponding ticket has
>> a target version including 1.1.3.
>>
>> The release dashboard:
>> https://issues.apache.org/jira/secure/Dashboard.jspa?selectP
>> ageId=12331463
>>
>> Till & Alex.
>>
>> On Wed, Jun 14, 2017 at 12:59 PM, Alex Rukletsov <a...@mesosphere.com>
>> wrote:
>>
>>> Folks,
>>>
>>> there are only 2 back ported tickets to the 1.1.x branch so far (MESOS-7540
>>> and MESOS-7569). Since this will be the last 1.1.x release, we are
>>> delaying it for 3 more weeks to leave more time for people to include
>>> critical bug fixes.
>>>
>>> Till & Alex.
>>>
>>
>>
>


Re: Mesos 1.1.3 release

2017-07-25 Thread Alex Rukletsov
MESOS-7643 is still unresolved. I am moving the cut date for one more week,
because this is the last patch release for 1.1.x.

On Fri, Jul 14, 2017 at 6:34 PM, Alex Rukletsov <a...@mesosphere.com> wrote:

> Folks,
>
> We are planning to cut the 1.1.3 release once MESOS-7643 is resolved. If
> you have any patch that needs to get into 1.1.3, please make sure that
> either it is already in the 1.1.x branch or the corresponding ticket has
> a target version including 1.1.3.
>
> The release dashboard:
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12331463
>
> Till & Alex.
>
> On Wed, Jun 14, 2017 at 12:59 PM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> Folks,
>>
>> there are only 2 back ported tickets to the 1.1.x branch so far (MESOS-7540
>> and MESOS-7569). Since this will be the last 1.1.x release, we are
>> delaying it for 3 more weeks to leave more time for people to include
>> critical bug fixes.
>>
>> Till & Alex.
>>
>
>


Re: Mesos 1.1.3 release

2017-07-14 Thread Alex Rukletsov
Folks,

We are planning to cut the 1.1.3 release once MESOS-7643 is resolved. If
you have any patch that needs to get into 1.1.3, please make sure that
either it is already in the 1.1.x branch or the corresponding ticket has a
target version including 1.1.3.

The release dashboard:
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12331463

Till & Alex.

On Wed, Jun 14, 2017 at 12:59 PM, Alex Rukletsov <a...@mesosphere.com>
wrote:

> Folks,
>
> there are only 2 back ported tickets to the 1.1.x branch so far (MESOS-7540
> and MESOS-7569). Since this will be the last 1.1.x release, we are
> delaying it for 3 more weeks to leave more time for people to include
> critical bug fixes.
>
> Till & Alex.
>


Re: RFC: removing process implementations from common headers

2017-06-28 Thread Alex Rukletsov
I'm in favor of the suggestion. Do you guys plan to do a single sweep or
document the pattern somewhere and apply it only for new and refactored
code?

On Wed, Jun 28, 2017 at 12:19 AM, Yan Xu  wrote:

> This sounds reasonable to me. Do others have comments?
>
> ---
> @xujyan 
>
> On Fri, Jun 23, 2017 at 4:23 PM, James Peach  wrote:
>
> > Hi all,
> >
> > There is a common Mesos pattern where a subsystem is implemented by a
> > facade class that forwards calls to an internal Process class, eg.
> Fetcher
> > and FetcherProcess, or zookeeper::Group and zookeeper::GroupProcess.
> Since
> > the Process is an internal implementation detail, I'd like to propose
> that
> > we adopt a general policy that it should not be exposed in the primary
> > header file. This has the following benefits:
> >
> > - reduces the number of symbols exposed to clients including the primary
> > header file
> > - reduces the number of header files needed in the primary header file
> > - reduces the number of rebuilt dependencies when the process
> > implementation changes
> >
> > Although each individual case of this practice may not improve build
> > times, I think it is likely that over time, consistent application of
> this
> > will help.
> >
> > In many cases, when FooProcess is only used by Foo, both the declaration
> > and definitions of Foo can be inlined into "foo.cpp", which is already
> our
> > common practice. If the implementation of the Process class is needed
> > outside the facade (eg. for testing), the pattern I would propose is:
> >
> > foo.hpp - Primary API for Foo, forward declares FooProcess
> > foo_process.hpp - Declarations for FooProcess
> > foo_process.cpp - Definitions of FooProcess
> >
> > The "checks/checker.hpp" interface almost follows this pattern, but gives
> > up the build benefits by including "checker_process.hpp" in
> "checker.hpp".
> > This should be simple to fix however.
> >
> > thanks,
> > James
>


On Apache Mesos release process

2017-06-17 Thread Alex Rukletsov
Folks,

for more than a year Apache Mesos releases are done according to our "then
new" release policy [1]. It seems to work quite well, but today I would
like to address things that can be improved.

Let's start with pain points:
* A minor bug can cancel a release vote, even for a patch release.
* More canceled votes lead to more RCs and hence create more work for
committers and voters.
* Demotivation for release on a candidate unless other people vote.
* Releases often run behind schedule.

I would like to suggest some improvements to the process:

1. Stricter time releases. The next release should go into planning (with
release managers elected) right after the current is cut. Feature owners
work with the release managers prior to the cut to track progress (k8s
community aims for 2-3 meeting per week discussing blockers and schedule).
This way release managers should have a satisfactory understanding which
new features are going in and what can slow down the release several days
before the cut.

2. Written guideline for which issues can '-1' the release. Though it is up
to the voter how to vote, a clear guideline will set reasonable
expectations and hopefully help us decrease the number of RCs. Regressions
(security, performance, compatibility, functional) can cause -1.
Regressions of experimental features cannot cause -1. Patch releases can be
-1'd in exceptional cases, e.g., critical bug fix missing in the last patch
release. New features cannot block a release.

Note: We love reasonable -1 votes! It is so much better to defer a release
than discover a critical regression from a production user report!

3. Release managers decides what is back ported to the RC branch once it is
cut (same for patch releases). Feature owners and committers are encouraged
to update the release managers timely on the status and importance of
features and bug fixes.

And of course, I encourage everyone using Mesos to test & vote on release
candidates! Identical cluster configurations are rare, each new setup helps
with finding bugs and hence build better software.

[1] https://github.com/apache/mesos/blob/master/docs/versioning.md

Alex.


Re: Easing the Pain of Code Formatting in Mesos

2017-06-15 Thread Alex Rukletsov
+1. Having an enforceable rule is sometimes more important than the rule
itself (e.g., 4 vs. 2 spaces indent).

On Thu, Jun 15, 2017 at 9:59 AM, Alexander Rojas 
wrote:

> +1 It is always frustrating to rely in clang format to realize it generate
> the wrong style, even for old Mesos contributors
>
> Alexander Rojas
> alexan...@mesosphere.io
>
>
>
>
> > On 15. Jun 2017, at 04:32, Michael Park  wrote:
> >
> > I'm increasingly hearing that many contributors who want to contribute to
> > Mesos find that
> > it's often difficult to do so. One of the big issues is due to our
> > formatting rules which is not
> > automated. As a result, we've had many reviews that are overwhelming in
> > style comments
> > with only a couple of comments on functionality.
> >
> > This is very frustrating for contributors, and is also a large burden on
> > the committers to
> > remember, review and explain the formatting sections of the style guide.
> >
> > I introduced *ClangFormat* a long time ago as our formatting tool, but it
> > was only
> > a supplementary tool since it didn't yet conform fully to the style
> guide.
> > We've done a lot of
> > work to narrow this gap and the gap is actually quite small at this
> point.
> > However, the existence
> > of such a gap is enough to stir discussions and render the tool useless
> for
> > some people.
> >
> > I think we should close this gap by adopting ClangFormat as our
> formatting
> > guideline.
> >
> > I don't have a fully fleshed out plan just yet. I'd like to push for this
> > effort again,
> > as I find it to be very important.
> >
> > I'm just seeking for +1s if you'd like to see a fleshed out plan for
> this.
> >
> > Thanks,
> >
> > MPark
>
>


Mesos 1.1.3 release

2017-06-14 Thread Alex Rukletsov
Folks,

there are only 2 back ported tickets to the 1.1.x branch so far (MESOS-7540
and MESOS-7569). Since this will be the last 1.1.x release, we are delaying
it for 3 more weeks to leave more time for people to include critical bug
fixes.

Till & Alex.


Re: [VOTE] Release Apache Mesos 1.2.1 (rc1)

2017-06-12 Thread Alex Rukletsov
PortMapping tests are indeed in bade shape. There are JIRAs already, have a
look before filing new ones:
MESOS-4646, MESOS-5687, MESOS-2765, MESOS-5690, MESOS-5688, MESOS-5689,
MESOS-4643, MESOS-4644, MESOS-5309

On Sat, Jun 10, 2017 at 10:58 AM, Adam Bordelon  wrote:

> +1 (binding) Good enough for me.
>
> Ran `make check` (or equivalent) on the Mesosphere internal Jenkins CI.
> Lots of green (all tests passed) on Mac, CentOS7, Debian8, Fedora23 and
> Ubuntu 12.04.
> Three sets of yellow configs yielded 10 unique but mostly known
> failing/flaky tests.
> (Grey means untested)
> [image: Inline image 1]
>
> * Ubuntu {14.04|16.04|16.10} - {Plain|SSL|CMake|Clang}
>   PerfTest.Version (always) - https://issues.apache.org/
> jira/browse/MESOS-7160
>   ExamplesTest.PythonFramework (sometimes) - https://issues.apache.org/
> jira/browse/MESOS-7218
>
> * Centos 6 - {Plain|SSL}
>   DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes -
> https://issues.apache.org/jira/browse/MESOS-7510
>
> * Fedora 23 - Network_Isolator
>   PortMappingIsolatorTest.ROOT_NC_HostToContainerUDP -
> https://issues.apache.org/jira/browse/MESOS-5690
>   PortMappingIsolatorTest.ROOT_ContainerICMPExternal -
> https://issues.apache.org/jira/browse/MESOS-5689
>   PortMappingIsolatorTest.ROOT_DNS - https://issues.apache.org/
> jira/browse/MESOS-5688
>   PortMappingIsolatorTest.ROOT_NC_SmallEgressLimit -
> https://issues.apache.org/jira/browse/MESOS-5687
>   PortMappingIsolatorTest.ROOT_NC_PortMappingStatistics - ?
>   PortMappingMesosTest.CGROUPS_ROOT_RecoverMixedContainers - ?
>   PortMappingMesosTest.CGROUPS_ROOT_RecoverMixedKnownAndUnKnownOrphans - ?
>
> Anybody have any ideas on the last three? Seems like these PortMapping
> tests are generally in a bad shape, or the network isolator is seriously
> broken. I'll file JIRAs.
>
> P.S. AgentAPIStreamingTest.AttachInputToNestedContainerSession Vinod saw
> on ASF CI is flaky according to https://issues.apache.org/jira
> /browse/MESOS-7159 (added the log gist link there)
>
> P.P.S.  CI results at https://jenkins.mesosphere.
> com/service/jenkins/job/mesos/job/Mesos_CI-build/1215 for those with
> access. We're still working on exposing our CI to the public. Waiting is.
>
>
> On Thu, Jun 8, 2017 at 4:23 PM, Benjamin Mahler 
> wrote:
>
>> +1 (binding)
>>
>> make check passed on macOS 10.12.4
>>
>> The ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession
>> passed for me. Kevin, I captured the logs to the failed run vinod pointed
>> to here:
>>
>> https://gist.github.com/bmahler/5ae340b4de3341f3c1f072250006dc64
>>
>> Does that look like a flaky test or a bug?
>>
>> On Thu, Jun 8, 2017 at 4:07 PM, Benjamin Mahler 
>> wrote:
>>
>>> Vinod I think that's the getenv issue from: https://issues.apache.or
>>> g/jira/browse/MESOS-6985
>>>
>>> On Wed, May 17, 2017 at 5:57 PM, Till Toenshoff 
>>> wrote:
>>>
 +1

 Ran it through DC/OS builds and integration tests;
 https://github.com/dcos/dcos/pull/1530 => all green

 On May 17, 2017, at 10:01 PM, Vinod Kone  wrote:

 Ran it on ASF CI and saw some issues.

 Segfault in "MasterTest.MultipleExecutors" in two builds [1]
 
 [2
 ],
 which is concerning. Is this a known issue?

 "ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession" 
 test failed 
 .




 On Sun, May 14, 2017 at 12:55 AM, tommy xiao  wrote:

> +1
>
> 2017-05-12 7:33 GMT+08:00 Adam Bordelon :
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos
> 1.2.1.
> >
> > 1.2.1 is a bug fix release. The CHANGELOG for the release is
> available at:
> > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> > plain;f=CHANGELOG;hb=1.2.1-rc1
> >
> > The candidate for Mesos 1.2.1 release is available at:
> > https://dist.apache.org/repos/dist/dev/mesos/1.2.1-rc1/mesos
> -1.2.1.tar.gz
> >
> > The tag to be voted on is 

Re: Questions about Mesos starting procedure in the source code

2017-06-07 Thread Alex Rukletsov
Wenzhao,

I am sure your read some docs about Mesos, have you seen overview pages [1,
2, 3]? I think they will help to better understand the big picture, what
the moving parts are and how they interact.

To your questions,

1. There are two types of schedulers: those using "legacy" aka "libprocess"
API and those using HTTP API. The differences are in the way the scheduler
is connected to the master, the actual communication process is the same.
Simplified, it looks like "connect to the master -> get registered -> get
some resources offers -> decline resource offers / use some resources to
launch tasks -> get updates about launched tasks' statuses". Have a look at
[4] to get a better understanding how the scheduler<->master communication
protocol looks like.

File "execute.cpp" is a very simple scheduler, not executor. It relies on
one of the built-in executors to run its tasks. If you want to have a look,
[5] is one of such executors.

2. Framework info is not directly distributed to slaves (we prefer to call
them agents). Once a framework decides to launch a task on offered
resources, master forwards the task specification to the respective agent.
All the agent cares about is instructions how to launch the task, which
includes which executor to use (if omitted, a built-in executors is used).

Alex.

[1] https://mesos.apache.org/documentation/latest/architecture/
[2]
https://mesos.apache.org/documentation/latest/app-framework-development-guide/
[3] https://mesos.apache.org/documentation/latest/scheduler-http-api/
[4]
https://github.com/apache/mesos/blob/master/include/mesos/scheduler/scheduler.proto
[5] https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp

On Fri, Jun 2, 2017 at 12:11 AM, Wenzhao Zhang  wrote:

> Hi, All:
>
> I just start working on Mesos source code for a research project. I become
> confused about the starting procedure, thus need some help.
> I'm talking about the working procedure of using, "mesos-execute" to
> execute a docker image.
>
> 1. How is resource offered to the framework (docker) from master?
> In Master::offer(), I find a "ResourceOffersMessage" is sent.
> I search the source code, find that only "mesos-1.2.0/src/sched/*sched.
> cpp*"
> has a function to receive this message, and this function finally invokes a
> scheduler-driver to finish the task.
> But, I believe this is not the procedure in which resource is offered to
> the docker-image, as I don't see any logic of "mesos-1.2.0/src/cli/
> *execute.cpp*" using "*sched.cpp*";and according to the documentation,
>  "Mesos provides a simple executor that can execute shell commands and
> Docker containers on behalf of the framework scheduler".
>
>In "*execute.cpp*", I see a "offers()" function, which finally executes
> some executors. But I don't see where this function is call from the
> master?
>How does this simple executor executes shell commands and Docker
> containers on behalf of the framework scheduler?
>How is the "*sched.cpp*" used in the source code?
>
>
> 2. After "execute.cpp" subscribes to the master, "framework" information is
> created in the master.
>But how is this "framework" info distributed to the slaves?  I become
> confused about this procedure.
>
> Could anyone kindly give some suggestions?
> Thanks very much
>
> Wenzhao
>


Re: Added task status update reason for health checks

2017-05-22 Thread Alex Rukletsov
James,

We are more than happy to write a comment if folks think it is useful. Do
you have anything specific in mind you want to be captured there? For me,
the reason's name is self-explanatory.

Alex.

On 22 May 2017 17:32, "James Peach"  wrote:

>
> > On May 22, 2017, at 5:28 AM, Andrei Budnik 
> wrote:
> >
> > Hi All,
> >
> > The new reason is REASON_TASK_HEALTH_CHECK_STATUS_UPDATED.
> > The corresponding ticket is https://issues.apache.org/
> jira/browse/MESOS-6905
>
> Is there any documentation about how executors ought to use this reason?
> Even a comment in the proto files would help executor authors use this
> consistently.
>
> J


[RESULT][VOTE] Release Apache Mesos 1.1.2 (rc2)

2017-05-19 Thread Alex Rukletsov
Hi all,

The vote for Mesos 1.1.2 (rc2) has passed with the following votes.

+1 (Binding)
--
Vinod Kone
Till Tönshoff
Alex Rukletsov

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.1.2

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.2

The mesos-1.1.2.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Alex & Till


[VOTE] Release Apache Mesos 1.1.2 (rc2)

2017-05-12 Thread Alex Rukletsov
Folks,

Please vote on releasing the following candidate as Apache Mesos 1.1.2.

1.1.2 includes the following:

** Bug
  * [MESOS-2537] - AC_ARG_ENABLED checks are broken.
  * [MESOS-5028] - Copy provisioner cannot replace directory with symlink.
  * [MESOS-5172] - Registry puller cannot fetch blobs correctly from http
Redirect 3xx urls.
  * [MESOS-6327] - Large docker images causes container launch failures:
Too many levels of symbolic links.
  * [MESOS-7057] - Consider using the relink functionality of libprocess in
the executor driver.
  * [MESOS-7119] - Mesos master crash while accepting inverse offer.
  * [MESOS-7152] - The agent may be flapping after the machine reboots due
to provisioner recover.
  * [MESOS-7197] - Requesting tiny amount of CPU crashes master.
  * [MESOS-7210] - HTTP health check doesn't work when mesos runs with
--docker_mesos_image.
  * [MESOS-7237] - Enabling cgroups_limit_swap can lead to "invalid
argument" error.
  * [MESOS-7265] - Containerizer startup may cause sensitive data to leak
into sandbox logs.
  * [MESOS-7350] - Failed to pull image from Nexus Registry due to
signature missing.
  * [MESOS-7366] - Agent sandbox gc could accidentally delete the entire
persistent volume content.
  * [MESOS-7383] - Docker executor logs possibly sensitive parameters.
  * [MESOS-7422] - Docker containerizer should not leak possibly sensitive
data to agent log.
  * [MESOS-7471] - Provisioner recover should not always assume 'rootfses'
dir exists.
  * [MESOS-7482] - #elif does not match #ifdef when checking the platform.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.2-rc2


The candidate for Mesos 1.1.2 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc2/mesos-1.1.2.tar.gz

The tag to be voted on is 1.1.2-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.2-rc2

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc2/mesos-1.1.2.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc2/mesos-1.1.2.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1194

Please vote on releasing this package as Apache Mesos 1.1.2!

The vote is open until Wed May 17 17:17:17 CEST 2017 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.1.2
[ ] -1 Do not release this package because ...

Thanks,
Till & Alex


Re: [VOTE] Release Apache Mesos 1.1.2 (rc1)

2017-05-12 Thread Alex Rukletsov
Vinod, the failure you've observed is a known flaky test:
https://issues.apache.org/jira/browse/MESOS-6724

MESOS-7471 <https://issues.apache.org/jira/browse/MESOS-7471> has been
backported. We don't have any other blockers, I'll be cutting a new rc soon.

On Wed, May 10, 2017 at 6:03 PM, Alex Rukletsov <a...@mesosphere.io> wrote:

> This vote is cancelled. Vinod, I'll look into the failure and report back.
> After that, I'll start a new vote.
>
> On 9 May 2017 10:07 am, "Jie Yu" <yujie@gmail.com> wrote:
>
>> -1
>>
>> I suggest we include this fix in 1.1.2
>> https://issues.apache.org/jira/browse/MESOS-7471
>>
>> On Thu, May 4, 2017 at 12:07 PM, Alex Rukletsov <a...@mesosphere.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Please vote on releasing the following candidate as Apache Mesos 1.1.2.
>>>
>>> 1.1.2 includes the following:
>>> 
>>> 
>>> ** Bug
>>>   * [MESOS-2537] - AC_ARG_ENABLED checks are broken.
>>>   * [MESOS-5028] - Copy provisioner cannot replace directory with
>>> symlink.
>>>   * [MESOS-5172] - Registry puller cannot fetch blobs correctly from http
>>> Redirect 3xx urls.
>>>   * [MESOS-6327] - Large docker images causes container launch failures:
>>> Too many levels of symbolic links.
>>>   * [MESOS-7057] - Consider using the relink functionality of libprocess
>>> in
>>> the executor driver.
>>>   * [MESOS-7119] - Mesos master crash while accepting inverse offer.
>>>   * [MESOS-7152] - The agent may be flapping after the machine reboots
>>> due
>>> to provisioner recover.
>>>   * [MESOS-7197] - Requesting tiny amount of CPU crashes master.
>>>   * [MESOS-7210] - HTTP health check doesn't work when mesos runs with
>>> --docker_mesos_image.
>>>   * [MESOS-7237] - Enabling cgroups_limit_swap can lead to "invalid
>>> argument" error.
>>>   * [MESOS-7265] - Containerizer startup may cause sensitive data to leak
>>> into sandbox logs.
>>>   * [MESOS-7350] - Failed to pull image from Nexus Registry due to
>>> signature missing.
>>>   * [MESOS-7366] - Agent sandbox gc could accidentally delete the entire
>>> persistent volume content.
>>>   * [MESOS-7383] - Docker executor logs possibly sensitive parameters.
>>>   * [MESOS-7422] - Docker containerizer should not leak possibly
>>> sensitive
>>> data to agent log.
>>>
>>> The CHANGELOG for the release is available at:
>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>>> lain;f=CHANGELOG;hb=1.1.2-rc1
>>> 
>>> 
>>>
>>> The candidate for Mesos 1.1.2 release is available at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>>> -1.1.2.tar.gz
>>>
>>> The tag to be voted on is 1.1.2-rc1:
>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.2-rc1
>>>
>>> The MD5 checksum of the tarball can be found at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>>> -1.1.2.tar.gz.md5
>>>
>>> The signature of the tarball can be found at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>>> -1.1.2.tar.gz.asc
>>>
>>> The PGP key used to sign the release is here:
>>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>>
>>> The JAR is up in Maven in a staging repository here:
>>> https://repository.apache.org/content/repositories/orgapachemesos-1188
>>>
>>> Please vote on releasing this package as Apache Mesos 1.1.2!
>>>
>>> The vote is open until Tue May 9 12:12:12 CEST 2017 and passes if a
>>> majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Mesos 1.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> Thanks,
>>> Alex & Till
>>>
>>
>>


Re: [VOTE] Release Apache Mesos 1.1.2 (rc1)

2017-05-10 Thread Alex Rukletsov
This vote is cancelled. Vinod, I'll look into the failure and report back.
After that, I'll start a new vote.

On 9 May 2017 10:07 am, "Jie Yu" <yujie@gmail.com> wrote:

> -1
>
> I suggest we include this fix in 1.1.2
> https://issues.apache.org/jira/browse/MESOS-7471
>
> On Thu, May 4, 2017 at 12:07 PM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.1.2.
>>
>> 1.1.2 includes the following:
>> 
>> 
>> ** Bug
>>   * [MESOS-2537] - AC_ARG_ENABLED checks are broken.
>>   * [MESOS-5028] - Copy provisioner cannot replace directory with symlink.
>>   * [MESOS-5172] - Registry puller cannot fetch blobs correctly from http
>> Redirect 3xx urls.
>>   * [MESOS-6327] - Large docker images causes container launch failures:
>> Too many levels of symbolic links.
>>   * [MESOS-7057] - Consider using the relink functionality of libprocess
>> in
>> the executor driver.
>>   * [MESOS-7119] - Mesos master crash while accepting inverse offer.
>>   * [MESOS-7152] - The agent may be flapping after the machine reboots due
>> to provisioner recover.
>>   * [MESOS-7197] - Requesting tiny amount of CPU crashes master.
>>   * [MESOS-7210] - HTTP health check doesn't work when mesos runs with
>> --docker_mesos_image.
>>   * [MESOS-7237] - Enabling cgroups_limit_swap can lead to "invalid
>> argument" error.
>>   * [MESOS-7265] - Containerizer startup may cause sensitive data to leak
>> into sandbox logs.
>>   * [MESOS-7350] - Failed to pull image from Nexus Registry due to
>> signature missing.
>>   * [MESOS-7366] - Agent sandbox gc could accidentally delete the entire
>> persistent volume content.
>>   * [MESOS-7383] - Docker executor logs possibly sensitive parameters.
>>   * [MESOS-7422] - Docker containerizer should not leak possibly sensitive
>> data to agent log.
>>
>> The CHANGELOG for the release is available at:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>> lain;f=CHANGELOG;hb=1.1.2-rc1
>> 
>> 
>>
>> The candidate for Mesos 1.1.2 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos-1.1.2.tar.gz
>>
>> The tag to be voted on is 1.1.2-rc1:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.2-rc1
>>
>> The MD5 checksum of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>> -1.1.2.tar.gz.md5
>>
>> The signature of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>> -1.1.2.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is up in Maven in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1188
>>
>> Please vote on releasing this package as Apache Mesos 1.1.2!
>>
>> The vote is open until Tue May 9 12:12:12 CEST 2017 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.1.2
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Alex & Till
>>
>
>


[VOTE] Release Apache Mesos 1.1.2 (rc1)

2017-05-04 Thread Alex Rukletsov
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.1.2.

1.1.2 includes the following:

** Bug
  * [MESOS-2537] - AC_ARG_ENABLED checks are broken.
  * [MESOS-5028] - Copy provisioner cannot replace directory with symlink.
  * [MESOS-5172] - Registry puller cannot fetch blobs correctly from http
Redirect 3xx urls.
  * [MESOS-6327] - Large docker images causes container launch failures:
Too many levels of symbolic links.
  * [MESOS-7057] - Consider using the relink functionality of libprocess in
the executor driver.
  * [MESOS-7119] - Mesos master crash while accepting inverse offer.
  * [MESOS-7152] - The agent may be flapping after the machine reboots due
to provisioner recover.
  * [MESOS-7197] - Requesting tiny amount of CPU crashes master.
  * [MESOS-7210] - HTTP health check doesn't work when mesos runs with
--docker_mesos_image.
  * [MESOS-7237] - Enabling cgroups_limit_swap can lead to "invalid
argument" error.
  * [MESOS-7265] - Containerizer startup may cause sensitive data to leak
into sandbox logs.
  * [MESOS-7350] - Failed to pull image from Nexus Registry due to
signature missing.
  * [MESOS-7366] - Agent sandbox gc could accidentally delete the entire
persistent volume content.
  * [MESOS-7383] - Docker executor logs possibly sensitive parameters.
  * [MESOS-7422] - Docker containerizer should not leak possibly sensitive
data to agent log.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.2-rc1


The candidate for Mesos 1.1.2 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos-1.1.2.tar.gz

The tag to be voted on is 1.1.2-rc1:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.2-rc1

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos-1.1.2.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos-1.1.2.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1188

Please vote on releasing this package as Apache Mesos 1.1.2!

The vote is open until Tue May 9 12:12:12 CEST 2017 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.1.2
[ ] -1 Do not release this package because ...

Thanks,
Alex & Till


Mesos 1.1.2 release

2017-04-24 Thread Alex Rukletsov
Folks,

We are planning to cut the 1.1.2 release later this week. If you have any
patch that needs to get into 1.1.2, please make sure that either it is
already in the 1.1.x branch or the corresponding ticket has a target
version including 1.1.2.

The release dashboard:
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12331212

AlexR & Till.


Re: CMake and (eventually) deprecating the autotools build

2017-04-02 Thread Alex Rukletsov
A tip for ccache users. Recent versions of CMake allow to set
CMAKE_{CXX|C}_COMPILER_LAUNCHER=. For example, `cmake
-DCMAKE_CXX_COMPILER_LAUNCHER=/usr/local/bin/ccache
-DCMAKE_C_COMPILER_LAUNCHER=/usr/local/bin/ccache`.

On Tue, Mar 14, 2017 at 6:15 PM, Joseph Wu  wrote:

> Hi Devs!
>
> The CMake build system for Mesos is now complete enough for wider
> consumption.  The plan is to review all the differences between the
> CMake and Autotools build systems and eventually deprecate the
> Autotools build system.
>
> A few of us are already using CMake exclusively for development.  But
> we'd like to have more developers using it *before we start talking
> about deprecation*.
>
>
> Here is a summary of the known differences:
>
> Missing features:
> * CMake does not build Java artifacts at the moment.  Since the most
> widely-used frameworks (Aurora, Marathon, etc) rely on this, we will
> prioritize getting this done.
> * CMake currently does not let you specify the exact system dependency
> to use.  i.e. --with-ssl=... --with-boost=... etc.  Instead, CMake
> either uses the bundled versions or automatically finds the system
> locations.  This is a blocker for CMake adoption by DC/OS.
> * CMake does not have an install target at the moment.  One of the top
> priority things to get done.
> * CMake does not build the port isolator module at the moment.
> * CMake does not have an option to install the module dependencies at
> the moment.
> * CMake does not work on FreeBSD at the moment.
>
> Features left out on purpose:
> * CMake does not generate artifacts for Python.  We feel the Autotools
> deprecation will likely run near/alongside the push towards using the
> V1 HTTP APIs.  And there is already an HTTP API library for Python:
> https://github.com/douban/pymesos
> * CMake does not build the old CLI executables (src/cli/mesos.cpp and
> src/cli/resolve.cpp) under the assumption that we will replace those
> in the near future.
> * CMake does not support installing test binaries, because the feature
> appears to be unused.
>
> New features:
> * CMake builds on Windows!
> * CMake supports packaging sources.  For example, you can do `cmake ..
> && make package_source` to generate the autotools equivalent of `make
> dist`.
> * CMake supports packaging binaries.  For example:
>
>   * To generate debs and rpms: `cmake .. -DCPACK_BINARY_DEB=1
> -DCPACK_BINARY_RPM=1 && make package`
>   * On Windows, to build a graphical installer: `cmake ..
> -DCPACK_BINARY_NSIS=1 && make package`
>   * On OSX, to build .dmg and interactive installers: `cmake ..
> -DCPACK_BINARY_OSXX11=1` and `-DCPACK_BINARY_DRAGNDROP=1 && make
> package`
>
> * More granular build targets.  For example, if you're working on
> libprocess, you can use `make libprocess-tests` instead of babysitting
> `make check`.
> * [Upcoming] Precompiled headers, which should speed up the build
> dramatically.
> * [Upcoming] We will be combining some aspects of Mesosphere's OSS
> packaging repo [1] so that binary packages will contain service
> definitions, as well as binaries.
>
>
> Please let us know if you have any comments, concerns, or requests!
>
> And please do try it out:
> cmake .. && cmake --build .
>
> The JIRA tracking the CMake build system is here:
> https://issues.apache.org/jira/browse/MESOS-898
>
> Thanks!
> ~Joseph
>
>
> [1] https://github.com/mesosphere/mesos-deb-packaging
>


[RESULT][VOTE] Release Apache Mesos 1.1.1 (rc2)

2017-03-14 Thread Alex Rukletsov
 Hi folks,

The vote for Mesos 1.1.1 (rc2) has passed with the following votes.

+1 (Binding)
--
*** AlexR
*** Till Tönshoff
*** Vinod Kone

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.1.1

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.1

The mesos-1.1.1.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Alex & Till


Re: [VOTE] Release Apache Mesos 1.1.1 (rc2)

2017-03-14 Thread Alex Rukletsov
The vote is up for more than two weeks now and there are no -1's. I go
ahead and vote myself:

+1 (binding)

Tested on internal CI with several know issues.

On Tue, Mar 7, 2017 at 6:08 PM, Till Toenshoff <toensh...@me.com> wrote:

> +1
>
> Tested on:
> - macOS 10.12.4 Beta (16E175b): ok
> - centos 6: mostly ok, MESOS-4736
> - centos 7: internal CI issues on capabilities tests, otherwise fine
> - debian 8: mostly ok, MESOS-7213
> - fedora 23: ok
> - ubuntu 12.04: mostly ok, MESOS-7218
> - ubuntu 14.04: mostly ok, MESOS-7218
> - ubuntu 16.04: mostly ok, MESOS-7218
>
>
> On Mar 4, 2017, at 1:09 AM, Vinod Kone <vinodk...@apache.org> wrote:
>
> +1 (binding)
>
> Since the perf issue I reported earlier doesn't seem to be a blocker.
>
> On Fri, Mar 3, 2017 at 12:14 AM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> Was this perf issue introduced by one of the fixes included in 1.1.1-rc2?
>> If not, I would suggest we vote for 1.1.1-rc2 and back port the perf fix
>> into 1.1.2. IIUC, time based patch releases should *not be worse*, hence
>> if
>> the perf issue was already in 1.1.0 it is *fine* to fix it in 1.1.2. I
>> would like to avoid postponing already belated 1.1.1 for even longer.
>>
>> On Wed, Mar 1, 2017 at 8:02 PM, Vinod Kone <vinodk...@apache.org> wrote:
>>
>> > Tested on ASF CI.
>> >
>> > Saw 2 configurations fail with
>> > https://issues.apache.org/jira/browse/MESOS-7160
>> >
>> > I think @jpeach and @bbannier were looking into this. Not sure about the
>> > severity of the issue, so withholding my vote.
>> >
>> >
>> > *Revision*: b9d8202a7444d0d1e49476bfc9817eb4583beaff
>> >
>> >- refs/tags/1.1.1-rc2
>> >
>> > Configuration Matrix gcc clang
>> > centos:7 --verbose --enable-libevent --enable-ssl autotools
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
>> verbose%20--
>> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> > 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
>> > 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Not run]
>> > cmake
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
>> > verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
>> > GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
>> > 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Not run]
>> > --verbose autotools
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
>> > ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
>> > exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Not run]
>> > cmake
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
>> > verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
>> > 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Not run]
>> > ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
>> verbose%20--
>> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> > 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
>> > 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Failed]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=
>> --verbose%20--
>> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> > 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
>> > 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > cmake
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
>> > verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
>> > GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_e

[VOTE] Release Apache Mesos 1.1.1 (rc2)

2017-02-27 Thread Alex Rukletsov
 Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.1.1.

1.1.1 includes the following:

** Bug
  * [MESOS-6002] - The whiteout file cannot be removed correctly using aufs
backend.
  * [MESOS-6010] - Docker registry puller shows decode error "No response
decoded".
  * [MESOS-6142] - Frameworks may RESERVE for an arbitrary role.
  * [MESOS-6360] - The handling of whiteout files in provisioner is not
correct.
  * [MESOS-6411] - Add documentation for CNI port-mapper plugin.
  * [MESOS-6526] - `mesos-containerizer launch --environment` exposes
executor env vars in `ps`.
  * [MESOS-6571] - Add "--task" flag to mesos-execute.
  * [MESOS-6597] - Include v1 Operator API protos in generated JAR and
python packages.
  * [MESOS-6606] - Reject optimized builds with libcxx before 3.9.
  * [MESOS-6621] - SSL downgrade path will CHECK-fail when using both
temporary and persistent sockets.
  * [MESOS-6624] - Master WebUI does not work on Firefox 45.
  * [MESOS-6676] - Always re-link with scheduler during re-registration.
  * [MESOS-6848] - The default executor does not exit if a single task pod
fails.
  * [MESOS-6852] - Nested container's launch command is not set correctly
in docker/runtime isolator.
  * [MESOS-6917] - Segfault when the executor sets an invalid UUID when
sending a status update.
  * [MESOS-7008] - Quota not recovered from registry in empty cluster.
  * [MESOS-7133] - mesos-fetcher fails with openssl-related output.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.1-rc2


The candidate for Mesos 1.1.1 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.1-rc2/mesos-1.1.1.tar.gz

The tag to be voted on is 1.1.1-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.1-rc2

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.1-rc2/mesos-1.1.1.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.1-rc2/mesos-1.1.1.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1182

Please vote on releasing this package as Apache Mesos 1.1.1!

The vote is open until Thu Mar  2 23:59:59 CET 2017 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.1.1
[ ] -1 Do not release this package because ...

Thanks,
Till & Alex


[Design Doc] Arbitrary task checks in Mesos

2017-01-05 Thread Alex Rukletsov
We've recently been working on a design for arbitrary task checks [1]

in
Mesos (currently called probes, but this will likely change). Please have a
look and leave comments on the doc or start high-level discussion on this
thread.

Alex.

[1]
https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU


Mesos 1.1.1 release dashboard

2016-12-22 Thread Alex Rukletsov
Folks,

We are planning to cut the 1.1.1 release early next week. If you have any
patches that need to get into 1.1.1, please make sure that either it is
already in the 1.1.x branch or the corresponding ticket has a target
version including 1.1.1 *by Monday* Dec 26.

The release dashboard:
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12329892

AlexR & Till.


Re: Order of includes

2016-12-21 Thread Alex Rukletsov
Yes!

https://issues.apache.org/jira/browse/MESOS-6827

On Mon, Dec 19, 2016 at 8:49 AM, Yan Xu  wrote:

> The example is helpful. Thanks!
>
> I have no objection to sticking to the new rule then. But the we have to:
>
> - For contributors and committers, start using the new style when creating
> new files today.
> - Fix the existing include order hopefully with the help of tools like
> clang-tidy and have it enforce the style going forward.
>
> Agreed?
>
> ---
> @xujyan 
>
> On Fri, Dec 16, 2016 at 8:54 PM, Benjamin Bannier <
> benjamin.bann...@mesosphere.io> wrote:
>
> > Hi,
> >
> > > How does putting your own header at the top (vs. ~the bottom) help
> ensure
> > > "a header file always includes all symbols it requires”?
> >
> >
> > Given an incomplete header
> >
> > // foo.hpp
> > std::string f();
> >
> > // foo.cpp
> > #include “foo.hpp”
> > #include 
> >
> > std::string f() { return {}; }
> >
> > I get
> >
> > % clang++ -fsyntax-only foo.cpp --std=c++11
> > In file included from foo.cpp:1:
> > ./foo.hpp:1:1: error: use of undeclared identifier 'std'
> > std::string f();
> > ^
> > 1 error generated.
> >
> > Swapping the include order makes this pass as `#include` is just textual
> > replacement, and the `#include ` in `foo.cpp` would declare the
> > symbol used in `foo.hpp`.
> >
> >
> > Cheers,
> >
> > Benjamin
>


Re: [webui] Started show wrong time

2016-12-13 Thread Alex Rukletsov
This looks like a bug. Tomek, could you please file a JIRA?

On Tue, Dec 13, 2016 at 1:02 PM, Tomek Janiszewski 
wrote:

> Hi
>
> When task has enabled Mesos healthcheck start time in UI can show wrong
> time. This happens because UI assumes that first status is task started
> [0]. This is not always true because Mesos keeps only recent tasks statuses
> [1] so when healthcheck updates tasks status it can override task start
> time displayed in webui.
>
> Best
> Tomek
>
> [0]
> https://github.com/apache/mesos/blob/master/src/webui/
> master/static/js/controllers.js#L140
> [1]
> https://github.com/apache/mesos/blob/f2adc8a95afda943f6a10e771aad64
> 300da19047/src/common/protobuf_utils.cpp#L263-L265
>


Re: Command healthcheck failed but status KILLED

2016-12-12 Thread Alex Rukletsov
Technically the task hast not failed but was killed by the executor
(because it failed a health check).

On Fri, Dec 9, 2016 at 11:27 AM, Tomek Janiszewski 
wrote:

> Hi
>
> What is desired behavior when command health check failed? On Mesos 1.0.2
> when health check fails task has state KILLED instead of FAILED with reason
> specifying it was killed due to failing health check.
>
> Thanks
> Tomek
>


Re: Adding a reload end-point to `network/cni` isolator

2016-12-11 Thread Alex Rukletsov
Whatever the solution will be, it will be great to stay consistent. It
looks like updating configuration during agent lifetime is a typical task,
hence having a "standard approach" would be great. Agent Whitelist and ACLS
come to my mind.

On 8 Dec 2016 7:24 am, "Avinash Sridharan"  wrote:

> Valid point.
>
> Looks like this solution is turning out to much more cleaner than I
> expected :).
>
> Thanks Daniel, Qian and Vinod.
>
> This was helpful.
>
> On Wed, Dec 7, 2016 at 6:22 PM, Qian Zhang  wrote:
>
> > Why does "delete" need an agent restart? I think operators can just
> delete
> > the CNI network configuration file from "--network_cni_config_dir" at any
> > time they want, and later when a framework tries to launch a container to
> > that deleted CNI network, CNI isolator will find that network is in its
> > cache but not in the disk, so it can fail framework's request and remove
> > that CNI network from its cache. So it is kind of lazy delete in cache.
> >
> >
> > Thanks,
> > Qian Zhang
> >
> > On Thu, Dec 8, 2016 at 8:12 AM, Avinash Sridharan  >
> > wrote:
> >
> > > On Wed, Dec 7, 2016 at 4:07 PM, Daniel Osborne  wrote:
> > >
> > > > For the record, we already support a). Qian explains it here:
> > > > https://issues.apache.org/jira/browse/MESOS-6567?
> > > > focusedCommentId=15652501=com.atlassian.jira.
> > > > plugin.system.issuetabpanels:comment-tabpanel#comment-15652501
> > > >
> > > > You are correct. We don't store the config in-memory just the `name`.
> > So
> > > we will be reading the config every time we launch a new container. So
> > > looks like "delete" is the only operation that will need an agent
> > restart.
> > >
> > > >
> > > > On Wed, Dec 7, 2016 at 4:02 PM, Avinash Sridharan <
> > avin...@mesosphere.io
> > > >
> > > > wrote:
> > > >
> > > > > Thinking about the solution of treating the CNI config as an
> > in-memory
> > > > > cache and doing disk reads on failures I see two problems:
> > > > > a) We won't be able to support modifications to CNI networks. Since
> > > > > modification to existing networks won't generate a miss.
> > > > > b) We won't be able to support deletion of CNI networks.
> > > > >
> > > > > The two operations above will still need an agent restart.
> > > > >
> > > > > On Wed, Dec 7, 2016 at 3:40 PM, Avinash Sridharan <
> > > avin...@mesosphere.io
> > > > >
> > > > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Dec 7, 2016 at 3:31 PM, Avinash Sridharan <
> > > > avin...@mesosphere.io
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >> On Wed, Dec 7, 2016 at 3:17 PM, Daniel Osborne 
> > > wrote:
> > > > > >>
> > > > > >>> Chiming in since I raised an identical issue a few weeks back:
> > > > > >>> https://issues.apache.org/jira/browse/MESOS-6567
> > > > > >>>
> > > > > >>> The proposed endpoint solution sounds plausible. However I'd
> like
> > > to
> > > > > >>> explore if it solves the use case I raised my issue for. I was
> > > trying
> > > > > to
> > > > > >>> create a Mesos framework that adds new CNI networks. But [IIRC]
> > the
> > > > > Agent
> > > > > >>> API can't be reached from a Mesos Executor instance since the
> > Agent
> > > > > could
> > > > > >>> be listening on a non-default port, or on any of its IPs. The
> > > > executor
> > > > > >>> instance doesn't know that information, so after it installs
> the
> > > > > plugin,
> > > > > >>> it
> > > > > >>> won't know how to reach that new reload endpoint.
> > > > > >>>
> > > > > >>
> > > > > >> Just trying to understand the problem you are alluding to here.
> > The
> > > > > >> executor needs to register with the agent in order to launch the
> > > > > container,
> > > > > >> so it should have reachability to the agent, and hence the
> > endpoint?
> > > > > >>
> > > > > >>
> > > > > >>> - Is there a reliable way  to reach the reload endpoint from a
> > > > default
> > > > > >>> executor instance?
> > > > > >>> - Why not scan the config directory every time? Are you trying
> to
> > > > avoid
> > > > > >>> the
> > > > > >>> speed hit from disk reads?
> > > > > >>>
> > > > > >> By scan the config directory every time, do you mean run a timer
> > > that
> > > > > >> will periodically scan the config directory and keep updating
> the
> > > > > configs.
> > > > > >> This is feasible. The only problem is that the point at which
> the
> > > > > operator
> > > > > >> write the config and the point at which the network will be
> > > available
> > > > > for
> > > > > >> container launch will not be deterministic. The behavior would
> be
> > > much
> > > > > >> cleaner if we can make it deterministic.
> > > > > >>
> > > > > >
> > > > > > Daniel, ignore this comment. I think you were referring to using
> > the
> > > > disc
> > > > > > as a cache as Vinod had pointed out. I misread your suggestion.
> > > > > >
> > > > > >> Best,
> > > > > >>> -Dan
> > > > > >>>
> > > > 

Re: Duplicate task IDs

2016-12-11 Thread Alex Rukletsov
I'm fine with prohibiting non-unique IDs, but why do you plan to keep the
most recent in case of a conflict? I'd expect any duplicate (that we can
find out) is rejected / killed / banned / unchurched.

On 9 Dec 2016 8:13 pm, "Joris Van Remoortere"  wrote:

> Hey Neil,
>
> I concur that using duplicate task IDs is bad practice and asking for
> trouble.
>
> Could you please clarify *why* you want to use a hashmap? Is your goal to
> remove duplicate task IDs or is this just a side-effect and you have a
> different reason (e.g. performance) for using a hashmap?
>
> I'm wondering why a multi-hashmap is not sufficient. This would be clear if
> you were explicitly *trying* to get rid of duplicates of course :-)
>
> Thanks,
> Joris
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Fri, Dec 9, 2016 at 7:08 AM, Neil Conway  wrote:
>
> > Folks,
> >
> > The master stores a cache of metadata about recently completed tasks;
> > for example, this information can be accessed via the "/tasks" HTTP
> > endpoint or the "GET_TASKS" call in the new Operator API.
> >
> > The master currently stores this metadata using a list; this means
> > that duplicate task IDs are permitted. We're considering [1] changing
> > this to use a hashmap instead. Using a hashmap would mean that
> > duplicate task IDs would be discarded: if two completed tasks have the
> > same task ID, only the metadata for the most recently completed task
> > would be retained by the master.
> >
> > If this behavior change would cause problems for your framework or
> > other software that relies on Mesos, please let me know.
> >
> > (Note that if you do have two completed tasks with the same ID, you'd
> > need an unambiguous way to tell them apart. As a recommendation, I
> > would strongly encourage framework authors to never reuse task IDs.)
> >
> > Neil
> >
> > [1] https://reviews.apache.org/r/54179/
> >
>


Re: Quota

2016-12-11 Thread Alex Rukletsov
Granularity in the allocator is a single agent. Hence even though you set
quota for 0.0001 CPU, at least one agent is "blocked". This is probably the
reason why marathon is not getting offers. You can turn verbose master logs
and check allocator messages to confirm.

Alex.

On 10 Dec 2016 2:14 am, "Vijay"  wrote:

> The dispatcher needs 1cpu and 1G memory.
>
> Regards,
> Vijay
>
> Sent from my iPhone
>
> > On Dec 9, 2016, at 4:51 PM, Vinod Kone  wrote:
> >
> > And how many resources does spark need?
> >
> >> On Fri, Dec 9, 2016 at 4:05 PM, Vijay Srinivasaraghavan <
> vijikar...@yahoo.com> wrote:
> >> Here is the slave state info. I see marathon is registered as
> "slave_public" role and is configured with "default_accepted_resource_roles"
> as "*"
> >>
> >> "slaves":[
> >>   {
> >>  "id":"69356344-e2c4-453d-baaf-22df4a4cc430-S0",
> >>  "pid":"slave(1)@xxx.xxx.xxx.100:5051",
> >>  "hostname":"xxx.xxx.xxx.100",
> >>  "registered_time":1481267726.19244,
> >>  "resources":{
> >> "disk":12099.0,
> >> "mem":14863.0,
> >> "gpus":0.0,
> >> "cpus":4.0,
> >> "ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079,
> 8082-8180, 8182-32000]"
> >>  },
> >>  "used_resources":{
> >> "disk":0.0,
> >> "mem":0.0,
> >> "gpus":0.0,
> >> "cpus":0.0
> >>  },
> >>  "offered_resources":{
> >> "disk":0.0,
> >> "mem":0.0,
> >> "gpus":0.0,
> >> "cpus":0.0
> >>  },
> >>  "reserved_resources":{
> >>
> >>  },
> >>  "unreserved_resources":{
> >> "disk":12099.0,
> >> "mem":14863.0,
> >> "gpus":0.0,
> >> "cpus":4.0,
> >> "ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079,
> 8082-8180, 8182-32000]"
> >>  },
> >>  "attributes":{
> >>
> >>  },
> >>  "active":true,
> >>  "version":"1.0.1"
> >>   }
> >>],
> >>
> >> Regards
> >> Vijay
> >> On Friday, December 9, 2016 3:48 PM, Vinod Kone 
> wrote:
> >>
> >>
> >> How many resources does the agent register with the master? How many
> resources does spark task need?
> >>
> >> I'm guessing marathon is not registered with "test" role so it is only
> getting un-reserved resources which are not enough for spark task?
> >>
> >> On Fri, Dec 9, 2016 at 2:54 PM, Vijay Srinivasaraghavan <
> vijikar...@yahoo.com> wrote:
> >> I have a standalone DCOS setup (Single node Vagrant VM running DCOS
> v.1.9-dev build + Mesos 1.0.1 + Marathon 1.3.0). Both master and agent are
> running on same VM.
> >>
> >> Resource: 4 CPU, 16GB Memory, 20G Disk
> >>
> >> I have created a quota using new V1 API which creates a role "test"
> with resource constraints of 0.5 CPU and 1G Memory.
> >>
> >> When I try to deploy Spark package, Marathon receives the request but
> the task is in "waiting" state since it did not receive any offers from
> Master though I don't see any resource constraints from the hardware
> perspective.
> >>
> >> However, when I deleted the quota, Marathon is able to move forward
> with the deployment and Spark was deployed/up and running. I could see from
> the Mesos master logs that it had sent an offer to the Marathon framework.
> >>
> >> To debug the issue, I was trying to create a quota but this time did
> not provide any CPU and Memory (0 cpu and 0 mem). After this, when I try to
> deploy Spark from DCOS UI, I could see Marathon getting offer from Master
> and able to deploy Spark without the need to delete the quota this time.
> >>
> >> Did anyone notice similar behavior?
> >>
> >> Regards
> >> Vijay
> >>
> >>
> >>
> >
>


Re: [GitHub] mesos issue #190: Ensure curl is present on Ubuntu

2016-11-30 Thread Alex Rukletsov
Curl is also a prerequisite for mesos-native HTTP health checks (from Mesos
1.2). We will remove it eventually but likely not in the nearest future.

On 29 Nov 2016 19:08, "jieyu"  wrote:

> Github user jieyu commented on the issue:
>
> https://github.com/apache/mesos/pull/190
>
> Yeah, curl currently is a dependency if people wants to use container
> image support in unified containerizer. There is plan to remove this
> dependency.
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
> with INFRA.
> ---
>


Re: [VOTE] Release Apache Mesos 0.28.3 (rc1)

2016-11-30 Thread Alex Rukletsov
Joseph—

Thank you for investigating. I'm
+1 (binding)
make check passes on CentOS 7, Fedora 23, Ubuntu 14, 15 modulo known flaky
tests, including LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystem.

On 29 Nov 2016 21:21, "Joseph Wu" <jos...@mesosphere.io> wrote:

> AlexR,
>
> Thanks for pointing out those test failures.  As of 0.28, the
> LinuxFilesystemIsolatorTests were notoriously flaky on distributions with
> "large" root filesystems.  The test would essentially copy the root
> filesystem, leading to timeouts in multiple places in the tests.  CentOS 7
> was known to have at least twice as much stuff to copy compared to the
> other distributions (not sure about Fedora 23).
>
> Looking at your logs (and logs you didn't attach), we see that a couple of
> the tests that exercise the same code path did in fact pass, while others
> timed out.  I wouldn't consider that a regression.
>
> On Mon, Nov 28, 2016 at 12:54 PM, Vinod Kone <vinodk...@apache.org> wrote:
>
>> +1 (binding)
>>
>> Tested on ASF CI.
>>
>>
>> *Revision*: 52a0b0a41482da35dc736ec2fd445b6099e7a4e7
>>
>>- refs/tags/0.28.3-rc1
>>
>> Configuration Matrix gcc clang
>> centos:7 --verbose --enable-libevent --enable-ssl autotools
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Not run]
>> cmake
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Not run]
>> --verbose autotools
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Not run]
>> cmake
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Not run]
>> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> cmake
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> --verbose autotools
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> cmake
>> [image: Success]
>> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/25/BUILDTOOL=cmake,COMPILER=gcc,CONFIG

Re: [VOTE] Release Apache Mesos 0.28.3 (rc1)

2016-11-28 Thread Alex Rukletsov
I see LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystem failing on
CentOS 7 and Fedora 23, see e.g., [1]. I don't see any backports touching
[2], can it be a regression or this test is know to be problematic in
0.28.x?

[1] http://pastebin.com/c5PzfGF8
[2]
https://github.com/apache/mesos/blob/0.28.x/src/tests/containerizer/filesystem_isolator_tests.cpp

On Thu, Nov 24, 2016 at 12:07 AM, Anand Mazumdar  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 0.28.3.
>
>
> 0.28.3 includes the following:
> 
> 
>
> ** Bug
>   * [MESOS-2043] - Framework auth fail with timeout error and never
> get authenticated
>   * [MESOS-4638] - Versioning preprocessor macros.
>   * [MESOS-5073] - Mesos allocator leaks role sorter and quota role
> sorters.
>   * [MESOS-5330] - Agent should backoff before connecting to the master.
>   * [MESOS-5390] - v1 Executor Protos not included in maven jar
>   * [MESOS-5543] - /dev/fd is missing in the Mesos containerizer
> environment.
>   * [MESOS-5571] - Scheduler JNI throws exception when the major
> versions of JAR and libmesos don't match.
>   * [MESOS-5576] - Masters may drop the first message they send
> between masters after a network partition.
>   * [MESOS-5673] - Port mapping isolator may cause segfault if it bind
> mount root does not exist.
>   * [MESOS-5691] - SSL downgrade support will leak sockets in CLOSE_WAIT
> status.
>   * [MESOS-5698] - Quota sorter not updated for resource changes at agent.
>   * [MESOS-5723] - SSL-enabled libprocess will leak incoming links to
> forks.
>   * [MESOS-5740] - Consider adding `relink` functionality to libprocess.
>   * [MESOS-5748] - Potential segfault in `link` when linking to a
> remote process.
>   * [MESOS-5763] - Task stuck in fetching is not cleaned up after
> --executor_registration_timeout.
>   * [MESOS-5913] - Stale socket FD usage when using libevent + SSL.
>   * [MESOS-5927] - Unable to run "scratch" Dockerfiles with Unified
> Containerizer.
>   * [MESOS-5943] - Incremental http parsing of URLs leads to decoder error.
>   * [MESOS-5986] - SSL Socket CHECK can fail after socket receives EOF.
>   * [MESOS-6104] - Potential FD double close in libevent's
> implementation of `sendfile`.
>   * [MESOS-6142] - Frameworks may RESERVE for an arbitrary role.
>   * [MESOS-6152] - Resource leak in libevent_ssl_socket.cpp.
>   * [MESOS-6233] - Master CHECK fails during recovery while relinking
> to other masters.
>   * [MESOS-6234] - Potential socket leak during Zookeeper network changes.
>   * [MESOS-6246] - Libprocess links will not generate an ExitedEvent
> if the socket creation fails.
>   * [MESOS-6299] - Master doesn't remove task from pending when it is
> invalid.
>   * [MESOS-6457] - Tasks shouldn't transition from TASK_KILLING to
> TASK_RUNNING.
>   * [MESOS-6502] - _version uses incorrect
> MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java binding.
>   * [MESOS-6527] - Memory leak in the libprocess request decoder.
>   * [MESOS-6621] - SSL downgrade path will CHECK-fail when using both
> temporary and persistent sockets
>
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=0.28.3-rc1
> 
> 
>
> The candidate for Mesos 0.28.3 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/0.28.3-rc1/
> mesos-0.28.3.tar.gz
>
> The tag to be voted on is 0.28.3-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.28.3-rc1
>
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/0.28.3-rc1/
> mesos-0.28.3.tar.gz.md5
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/0.28.3-rc1/
> mesos-0.28.3.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1170
>
> Please vote on releasing this package as Apache Mesos 0.28.3!
>
> The vote is open until Sat Nov 26 14:59:10 PST 2016 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 0.28.3
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Anand & Joseph
>


Re: [3/3] mesos git commit: Enabled multiple field based authorization in the authorizer interface.

2016-11-23 Thread Alex Rukletsov
Fixed in
https://github.com/apache/mesos/commit/d2ab4b49d3cc0b86bacc5ec3400b46cfa70c3a7b

On Fri, Nov 18, 2016 at 4:48 AM, Benjamin Bannier <
benjamin.bann...@mesosphere.io> wrote:

> Hi,
>
> This introduces a possibly uninitialized member `weight_info` which
> Coverity immediately detected. I filed MESOS-6604 for that. Could you
> please take that on @Alexander?
>
>
> Cheers,
>
> Benjamin
>
> > On Nov 16, 2016, at 6:00 PM, m...@apache.org wrote:
> >
> > Enabled multiple field based authorization in the authorizer interface.
> >
> > Updates the authorizer interfaces and well as the local authorizer,
> > such that all actions which were limited to use a _role_ or a
> > _principal_ as an object, are able to use whole protobuf messages
> > as objects. This change enables more sofisticated authorization
> > mechanisms.
> >
> > Review: https://reviews.apache.org/r/52600/
> >
> >
> > Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> > Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/bc0e6d7b
> > Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/bc0e6d7b
> > Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/bc0e6d7b
> >
> > Branch: refs/heads/master
> > Commit: bc0e6d7b0b367e5ff67dd5f395e1e06938b02399
> > Parents: 40c2e5f
> > Author: Alexander Rojas 
> > Authored: Tue Nov 15 19:04:25 2016 -0800
> > Committer: Adam B 
> > Committed: Wed Nov 16 01:55:03 2016 -0800
> >
> > --
> > include/mesos/authorizer/authorizer.hpp   |   6 +-
> > include/mesos/authorizer/authorizer.proto |  54 
> > src/authorizer/local/authorizer.cpp   | 115
> +
> > 3 files changed, 157 insertions(+), 18 deletions(-)
> > --
> >
> >
> > http://git-wip-us.apache.org/repos/asf/mesos/blob/bc0e6d7b/
> include/mesos/authorizer/authorizer.hpp
> > --
> > diff --git a/include/mesos/authorizer/authorizer.hpp
> b/include/mesos/authorizer/authorizer.hpp
> > index cb365c7..7217600 100644
> > --- a/include/mesos/authorizer/authorizer.hpp
> > +++ b/include/mesos/authorizer/authorizer.hpp
> > @@ -61,7 +61,9 @@ public:
> > task_info(object.has_task_info() ? _info() :
> nullptr),
> > executor_info(
> > object.has_executor_info() ? _info() :
> nullptr),
> > -quota_info(object.has_quota_info() ? _info() :
> nullptr) {}
> > +quota_info(object.has_quota_info() ? _info() :
> nullptr),
> > +weight_info(object.has_weight_info() ? _info() :
> nullptr),
> > +resource(object.has_resource() ? () : nullptr)
> {}
> >
> > const std::string* value;
> > const FrameworkInfo* framework_info;
> > @@ -69,6 +71,8 @@ public:
> > const TaskInfo* task_info;
> > const ExecutorInfo* executor_info;
> > const quota::QuotaInfo* quota_info;
> > +const WeightInfo* weight_info;
> > +const Resource* resource;
> >   };
> >
> >   /**
> >
> > http://git-wip-us.apache.org/repos/asf/mesos/blob/bc0e6d7b/
> include/mesos/authorizer/authorizer.proto
> > --
> > diff --git a/include/mesos/authorizer/authorizer.proto
> b/include/mesos/authorizer/authorizer.proto
> > index b6a9f14..0696a62 100644
> > --- a/include/mesos/authorizer/authorizer.proto
> > +++ b/include/mesos/authorizer/authorizer.proto
> > @@ -46,11 +46,17 @@ message Object {
> >   optional TaskInfo task_info = 4;
> >   optional ExecutorInfo executor_info = 5;
> >   optional quota.QuotaInfo quota_info = 6;
> > +  optional WeightInfo weight_info = 7;
> > +  optional Resource resource = 8;
> > }
> >
> >
> > // List of authorizable actions supported in Mesos.
> > +// NOTE: Values in this enum should be kept in
> > +// numerical order to prevent accidental aliasing.
> > enum Action {
> > +  option allow_alias = true;
> > +
> >   // This must be the first enum value in this list, to
> >   // ensure that if 'type' is not set, the default value
> >   // is UNKNOWN. This enables enum values to be added
> > @@ -58,19 +64,67 @@ enum Action {
> >   UNKNOWN = 0;
> >
> >   // Actions named *_WITH_foo may set a foo in `Object.value`.
> > +
> > +  // `REGISTER_FRAMEWORK` will have an object with `FrameworkInfo` set.
> > +  // The `_WITH_ROLE` alias is deprecated and will be removed after
> > +  // Mesos 1.2's deprecation cycle ends. The `value` field will continue
> > +  // to be set until that time.
> > +  REGISTER_FRAMEWORK = 1;
> >   REGISTER_FRAMEWORK_WITH_ROLE = 1;
> >
> >   // `RUN_TASK` will have an object with `FrameworkInfo` and `TaskInfo`
> set.
> >   RUN_TASK = 2;
> >
> > +  // `TEARDOWN_FRAMEWORK` will have an object with `FrameworkInfo` set.
> > +  // The `_WITH_PRINCIPAL` alias is deprecated and will be removed after
> > +  // Mesos 1.2's deprecation cycle ends. The `value` 

Re: [VOTE] Release Apache Mesos 1.0.2 (rc3)

2016-11-10 Thread Alex Rukletsov
+1 (binding)

Tested in internal CI.

On Mon, Nov 7, 2016 at 8:24 PM, Vinod Kone  wrote:

> Hi all,
>
>
> Please vote on releasing the following candidate as Apache Mesos 1.0.2.
>
>
> This is a bug fix release.
>
>
> The CHANGELOG for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.0.2-rc3
>
> 
> 
>
>
> The candidate for Mesos 1.0.2 release is available at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.0.2-rc3/mesos-1.0.2.tar.gz
>
>
> The tag to be voted on is 1.0.2-rc3:
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.0.2-rc3
>
>
> The MD5 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.0.2-rc3/
> mesos-1.0.2.tar.gz.md5
>
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.0.2-rc3/
> mesos-1.0.2.tar.gz.asc
>
>
> The PGP key used to sign the release is here:
>
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
>
> The JAR is up in Maven in a staging repository here:
>
> https://repository.apache.org/content/repositories/orgapachemesos-1168
>
>
> Please vote on releasing this package as Apache Mesos 1.0.2!
>
>
> The vote is open until Thu Nov 10 11:22:30 PST 2016 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
>
> [ ] +1 Release this package as Apache Mesos 1.0.2
>
> [ ] -1 Do not release this package because ...
>
>
> Thanks,
>


Re: [VOTE] Release Apache Mesos 1.1.0 (rc3)

2016-11-10 Thread Alex Rukletsov
+1 (binding)

make check locally + manually tested health checks with the locally
modified mesos-execute.

On Wed, Nov 9, 2016 at 6:13 PM, Zhitao Li  wrote:

> +1 (non-binding)
>
> Tested with ROOT and docker on debian jessie.
>
> On Mon, Nov 7, 2016 at 2:19 PM, Vinod Kone  wrote:
>
> > +1 (binding)
> >
> > Tested on ASF CI.
> >
> > *Revision*: a44b077ea0df54b77f05550979e1e97f39b15873
> >
> >- refs/tags/1.1.0-rc3
> >
> > Configuration Matrix gcc clang
> > centos:7 --verbose --enable-libevent --enable-ssl autotools
> > [image: Success]
> >  > Release/23/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> > 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
> > 7CHadoop)&&(!ubuntu-us1)/>
> > [image: Not run]
> > cmake
> > [image: Success]
> >  > Release/23/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> > verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> > GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
> > 7C%7CHadoop)&&(!ubuntu-us1)/>
> > [image: Not run]
> > --verbose autotools
> > [image: Success]
> >  > Release/23/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> > ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
> > exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/>
> > [image: Not run]
> > cmake
> > [image: Success]
> >  > Release/23/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> > verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
> > 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/>
> > [image: Not run]
> > ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
> > [image: Success]
> >  > Release/23/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> > 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> > 7CHadoop)&&(!ubuntu-us1)/>
> > [image: Success]
> >  > Release/23/BUILDTOOL=autotools,COMPILER=clang,
> CONFIGURATION=--verbose%20--
> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> > 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> > 7CHadoop)&&(!ubuntu-us1)/>
> > cmake
> > [image: Success]
> >  > Release/23/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> > verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> > GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> > docker%7C%7CHadoop)&&(!ubuntu-us1)/>
> > [image: Success]
> >  > Release/23/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> > -verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> > GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> > docker%7C%7CHadoop)&&(!ubuntu-us1)/>
> > --verbose autotools
> > [image: Success]
> >  > Release/23/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> > ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> > label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/>
> > [image: Success]
> >  > Release/23/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose,
> > ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> > label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/>
> > cmake
> > [image: Success]
> >  > Release/23/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> > verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> > 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/>
> > [image: Success]
> >  > Release/23/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> > -verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> > 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/>
> >
> > On Mon, Nov 7, 2016 at 7:49 AM, Evers Benno 
> wrote:
> >
> > > +1 (non-binding)
> > >
> > > Built and installed on Ubuntu 10.04 + 14.04
> > >
> > > Configured with --disable-java --enable-python --disable-bundled-pip
> > > --disable-python-dependency-install
> > >
> > > On 04.11.2016 14:15, Till Toenshoff wrote:
> > > > Hi all,
> > > >
> > > > Please vote on releasing the following candidate as Apache Mesos
> 1.1.0.
> > > >
> > > >
> > > > 1.1.0 includes the following:
> > > > 
> > > 
> > > >   * [MESOS-2449] - **Experimental** support for launching a group of
> > > tasks
> > > > via a new `LAUNCH_GROUP` Offer 

Re: [2/2] mesos git commit: Added MESOS-6142 to CHANGELOG for 1.1.1.

2016-11-10 Thread Alex Rukletsov
If I read https://github.com/apache/mesos/blob/master/docs/release-guide.md
correctly, that is actually what we do: create 1.1.x branch before 1.1.0.
If the vote does not pass, we cherry-pick and fix in that branch.

With this commit I haven't created any branches, *just* a placeholder for a
future changelog. Does this make sense?

On Thu, Nov 10, 2016 at 2:19 AM, Benjamin Mahler  wrote:

> We wouldn't normally create the 1.1.x branch before 1.1.0 is released. If
> 1.1.0 doesn't go through this branch needs to be rebased and force pushed?
>
> On Tue, Nov 8, 2016 at 1:56 PM,  wrote:
>
>> Added MESOS-6142 to CHANGELOG for 1.1.1.
>>
>>
>> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
>> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/550d13c3
>> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/550d13c3
>> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/550d13c3
>>
>> Branch: refs/heads/1.1.x
>> Commit: 550d13c3dffe4d3b86c74ca40fb539796cefd848
>> Parents: 0023d38
>> Author: Alexander Rukletsov 
>> Authored: Tue Nov 8 22:57:24 2016 +0100
>> Committer: Alexander Rukletsov 
>> Committed: Tue Nov 8 22:57:24 2016 +0100
>>
>> --
>>  CHANGELOG | 4 
>>  1 file changed, 4 insertions(+)
>> --
>>
>>
>> http://git-wip-us.apache.org/repos/asf/mesos/blob/550d13c3/CHANGELOG
>> --
>> diff --git a/CHANGELOG b/CHANGELOG
>> index a305da1..28bc827 100644
>> --- a/CHANGELOG
>> +++ b/CHANGELOG
>> @@ -2,6 +2,10 @@ Release Notes - Mesos - Version 1.1.1 (WIP)
>>  ---
>>  * This is a bug fix release.
>>
>> +All Issues:
>> +** Bug
>> +  * [MESOS-6142] - Frameworks may RESERVE for an arbitrary role.
>> +
>>
>>  Release Notes - Mesos - Version 1.1.0
>>  -
>>
>>
>


Re: Test failures in Apache Jenkins

2016-11-04 Thread Alex Rukletsov
Yes, those were direct links. The source of *some* failures is probably VM
lags, which were reported in
https://issues.apache.org/jira/browse/INFRA-12852. I suggest we wait for a
resolution from Infra and see whether it helps and to what extent.

On Fri, Nov 4, 2016 at 12:42 AM, Benjamin Mahler <bmah...@apache.org> wrote:

> Hm.. these links are all broken, were you linking to jenkins logs directly?
> They get garbage collected rather quickly.
>
> On Mon, Oct 31, 2016 at 1:47 AM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
> > Folks,
> >
> > I observe a lot of flaky tests in Apache Jenkins. They seem rather random
> > and not tied to particular machines (saw failures on H1 and on H2).
> > Moreover, there are no tickets for them and I haven't seen any of those
> > failures in our internal CI.
> >
> > Does anyone have an idea about any recent changes in test harness,
> > libprocess or whatever that could lead to this? It's probably not related
> > to MESOS-6180 <https://issues.apache.org/jira/browse/MESOS-6180>,
> because
> > not all failures are future timeout induced.
> >
> > For example, in the last day I saw these guys failing:
> > ReconciliationTest.RecoveredAgent [1]
> > MasterTest.TaskLabels [2]
> > RoleTest.ImplicitRoleRegister [3]
> > ReconciliationTest.ImplicitTerminalTask [4]
> > ReservationTest.BadACLDropReserve [5]
> > ReservationTest.CompatibleCheckpointedResources [6]
> > ContentType/SchedulerHttpApiTest.Subscribe/0 [7]
> >
> > [1] https://goo.gl/cs88BD
> > [2] https://goo.gl/gTzKUV
> > [3] https://goo.gl/7pGaQG
> > [4] https://goo.gl/ccq38D
> > [5] https://goo.gl/0R1eOO
> > [6] https://goo.gl/xKQzUt
> > [7] https://goo.gl/HZmiGJ
> >
>


Re: On increasing visibility into experimental features.

2016-11-03 Thread Alex Rukletsov
Every experimental feature graduating from experimental should be
explicitly called out at the top of the log. We probably haven't been
consistent in the past, but it should be easier for a release manager to
remember when adjusting the list of still experimental features.

On Wed, Nov 2, 2016 at 7:06 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:

> (speaking from both a contributor and user perspective)
>
> Definitely +1 for improve visibility of experimental features, and I think
> the proposal is definitely helpful for people to read it.
>
> In terms of expectation management, one thing from the user perspective is
> when an experimental feature will graduate into stable, because a
> responsible user might hold-off actual (or full) adoption until that
> happens, and they need to plan accordingly. Have a way to track what items
> the owner considers necessary for a feature to become stable is quite
> important in such a case (not everyone has the luxury to comb through JIRA
> boards for contexts, or talk to the direct owner of the the feature).
>
> On Tue, Nov 1, 2016 at 5:28 PM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
> > Folks,
> >
> > Additionally to the "known bugs" proposal in a parallel thread, we think
> > that maintaining a list of still experimental features for each minor
> > release will significantly help users
> > to adjust their expectations.
> >
> > Our suggestion is to include a new section into the CHANGELOG called
> > "Experimental Features" starting with the upcoming 1.1.0 release.
> > Populating this section should be relatively easy: take the contents of
> > this section from the previous minor release, remove features declared
> > stable, and add new experimental features.
> >
> > With this change users will have a complete overview of experimental
> > functionality per release, without searching the CHANGELOG for when and
> > whether a certain feature became production-ready.
> >
> > What do you think?
> >
> > AlexR.
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>


On increasing visibility into experimental features.

2016-11-01 Thread Alex Rukletsov
Folks,

Additionally to the "known bugs" proposal in a parallel thread, we think
that maintaining a list of still experimental features for each minor
release will significantly help users
to adjust their expectations.

Our suggestion is to include a new section into the CHANGELOG called
"Experimental Features" starting with the upcoming 1.1.0 release.
Populating this section should be relatively easy: take the contents of
this section from the previous minor release, remove features declared
stable, and add new experimental features.

With this change users will have a complete overview of experimental
functionality per release, without searching the CHANGELOG for when and
whether a certain feature became production-ready.

What do you think?

AlexR.


On increasing visibility into known bugs and issues.

2016-11-01 Thread Alex Rukletsov
Folks,

There have been several suggestions recently about how we can help people
understand which known issues are in certain Mesos releases (thanks Joris
for kicking this off!). With a time-based release strategy we may not have
enough time to fix all the known issues prior to cutting an RC. However, we
definitely want to be honest and tell users about those issues before they
run into something obscure and start searching for this on JIRA.

Hence, we suggest to include a new section into the CHANGELOG called "Known
Issues", starting with the upcoming 1.1.0 release.

>From now on, release managers will have to find and triage all blocker and
critical bugs and call them out in the CHANGELOG; a JIRA query which may be
helpful: `project = Mesos AND type = bug AND status != Resolved AND
priority IN (blocker, critical)`.

To help release managers, we would like to encourage all committers,
shepherds, and seasoned contributors to classify and set target version
properly on every bug with critical or blocker priority they open.

Right now there are 33 such bugs, with the oldest one filed in April 2014.
Shepherds, please look into your issues from this list and
classify/close/re-target them accordingly.

AlexR.


Re: Build failed in Jenkins: Mesos » autotools,gcc,--verbose --enable-libevent --enable-ssl,GLOG_v=1 MESOS_VERBOSE=1,centos:7,(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-6) #2852

2016-11-01 Thread Alex Rukletsov
Filed https://issues.apache.org/jira/browse/INFRA-12852

On Mon, Oct 31, 2016 at 4:26 PM, Neil Conway  wrote:

> I spent a little while looking into this. The
> "PersistentVolumeEndpointsTest.OfferCreateThenEndpointRemove" test
> fails on the following expectations:
>
> https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e
> 78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1562
> https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e
> 78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1564
> https://github.com/apache/mesos/blob/1e57459b7d3f571bdf18fec29b070e
> 78ce719319/src/tests/persistent_volume_endpoints_tests.cpp#L1573
>
> Which all seem quite innocent: similar or identical preamble code
> occurs in many test cases. Looking at the log, it seems the scheduler
> begins the authentication process but authentication times out:
>
> 12:27:56.527899 31618 sched.cpp:226] Version: 1.2.0
> 12:27:56.528548 31638 sched.cpp:330] New master detected at
> master@172.17.0.2:48653
> 12:27:56.528661 31638 sched.cpp:396] Authenticating with master
> master@172.17.0.2:48653
> 12:27:56.528681 31638 sched.cpp:403] Using default CRAM-MD5 authenticatee
> 12:28:01.529717 31637 sched.cpp:526] Authentication timed out
> 12:28:10.795253 31637 sched.cpp:466] Failed to authenticate with
> master master@172.17.0.2:48653: Authentication discarded
>
> In the scheduler driver, we fail the "authenticating" future at
> 12:28:01, but it is ~9 seconds before the associated `onAny` callback
> is invoked to schedule retrying authentication; by the time the retry
> backoff timeout expires, we've exceeded the 15 second Future timeout
> in the test case.
>
> Note that between 12:28:01.5 and 12:28:10.8, there is essentially
> nothing happening:
>
> W1031 12:28:01.529717 31637 sched.cpp:526] Authentication timed out
> W1031 12:28:01.529752 31645 master.cpp:6789] Authentication timed out
> I1031 12:28:10.794798 31652 status_update_manager.cpp:203] Recovering
> status update manager
> W1031 12:28:10.795033 31645 master.cpp:6769] Failed to authenticate
> scheduler-877be3e9-9dc1-4de1-bf3e-a19b77b3d124@172.17.0.2:48653:
> Authentication discarded
> I1031 12:28:10.794939 31647 authenticator.cpp:432] Authentication
> session cleanup for crammd5-authenticatee(655)@172.17.0.2:48653
> I1031 12:28:10.795253 31637 sched.cpp:466] Failed to authenticate with
> master master@172.17.0.2:48653: Authentication discarded
>
> So I think the most likely culprit is VM lag.
>
> We can try to workaround this by increasing some of the timeouts for
> the test expectation futures, but of course that is just a kludge: if
> we're going to experience random ~9.5 second VM-wide pauses, the tests
> are likely to continue to be flaky unless we make more widespread
> changes (e.g., increasing ALL expectation futures timeouts).
>
> Neil
>
>
> On Mon, Oct 31, 2016 at 8:34 AM, Apache Jenkins Server
>  wrote:
> > See  COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%
> 20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=
> centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(
> !ubuntu-6)/2852/changes>
> >
> > Changes:
> >
> > [alexr] Updated the stale comment in agent flags.
> >
> > --
> > [...truncated 219320 lines...]
> > W1031 12:32:10.921492 31618 backend.cpp:76] Failed to create 'aufs'
> backend: AufsBackend requires root privileges, but is running as user mesos
> > W1031 12:32:10.921664 31618 backend.cpp:76] Failed to create 'bind'
> backend: BindBackend requires root privileges
> > I1031 12:32:10.925060 31647 slave.cpp:208] Mesos agent started on (635)@
> 172.17.0.2:48653
> > I1031 12:32:10.925091 31647 slave.cpp:209] Flags at startup: --acls=""
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc"
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true"
> --authenticatee="crammd5" --authentication_backoff_factor="1secs"
> --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos" 
> --container_disk_watch_interval="15secs"
> --containerizers="mesos" --credential="/tmp/Endpoint_SlaveEndpointTest_
> AuthorizedRequest_1_j6HfxC/credential" --default_role="*"
> --disk_watch_interval="1mins" --docker="docker"
> --docker_kill_orphans="true" --docker_registry="https://
> registry-1.docker.io" --docker_remove_delay="6hrs"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns"
> --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_
> dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs"
> 

Transition TASK_KILLING -> TASK_RUNNING

2016-10-31 Thread Alex Rukletsov
We've recently discovered a bug that may lead to a task being transitioned
from killing to running state. More information about it in MESOS-6457 [1].
We plan to fix it in 1.2.0 and will backport it to all supported versions.

[1] https://issues.apache.org/jira/browse/MESOS-6457


Test failures in Apache Jenkins

2016-10-31 Thread Alex Rukletsov
Folks,

I observe a lot of flaky tests in Apache Jenkins. They seem rather random
and not tied to particular machines (saw failures on H1 and on H2).
Moreover, there are no tickets for them and I haven't seen any of those
failures in our internal CI.

Does anyone have an idea about any recent changes in test harness,
libprocess or whatever that could lead to this? It's probably not related
to MESOS-6180 , because
not all failures are future timeout induced.

For example, in the last day I saw these guys failing:
ReconciliationTest.RecoveredAgent [1]
MasterTest.TaskLabels [2]
RoleTest.ImplicitRoleRegister [3]
ReconciliationTest.ImplicitTerminalTask [4]
ReservationTest.BadACLDropReserve [5]
ReservationTest.CompatibleCheckpointedResources [6]
ContentType/SchedulerHttpApiTest.Subscribe/0 [7]

[1] https://goo.gl/cs88BD
[2] https://goo.gl/gTzKUV
[3] https://goo.gl/7pGaQG
[4] https://goo.gl/ccq38D
[5] https://goo.gl/0R1eOO
[6] https://goo.gl/xKQzUt
[7] https://goo.gl/HZmiGJ


Re: [VOTE] Release Apache Mesos 1.1.0 (rc1)

2016-10-25 Thread Alex Rukletsov
This vote is cancelled. We'll cut RC2 later this week after the blockers
are resolved.

On Tue, Oct 25, 2016 at 5:48 AM, Zameer Manji  wrote:

> I'm going to -1 (non binding) for the same reason as David Robinson.
>
> I would classify the FD leak as serious and a violation of the isolation
> that the agent provides.
>
> It should be back ported to 1.1.0 just like how it was backported to 1.0.2
>
> On Mon, Oct 24, 2016 at 5:37 PM, David Robinson 
> wrote:
>
>> -1
>>
>> Can the fix for MESOS-6420 be backported? The Mesos agent leaks sockets
>> when the port mapping network isolator is enabled, the leaked sockets are
>> passed to the executor (the close-on-exec flag is not set) and that can
>> cause problems for certain frameworks. The Aurora executor uses Kazoo (the
>> python ZooKeeper library) for service announcement, Kazoo uses Python's
>> select() call for polling its file descriptors and Python's select() chokes
>> when there's > 1024 file descriptors. The end result for Aurora is that
>> after an agent runs > 1024 tasks any new tasks will fail to announce (will
>> not be registered in ZooKeeper) and will therefore be unknown to other
>> services.
>>
>> On Tue, Oct 18, 2016 at 1:01 PM, Till Toenshoff  wrote:
>>
>>> Hi all,
>>>
>>> Please vote on releasing the following candidate as Apache Mesos 1.1.0.
>>>
>>>
>>> 1.1.0 includes the following:
>>> 
>>> 
>>>   * [MESOS-2449] - **Experimental** support for launching a group of
>>> tasks
>>> via a new `LAUNCH_GROUP` Offer operation. Mesos will guarantee that
>>> either
>>> all tasks or none of the tasks in the group are delivered to the
>>> executor.
>>> Executors receive the task group via a new `LAUNCH_GROUP` event.
>>>
>>>   * [MESOS-2533] - **Experimental** support for HTTP and HTTPS health
>>> checks.
>>> Executors may now use the updated `HealthCheck` protobuf to implement
>>> HTTP(S) health checks. Both default executors (command and docker)
>>> leverage
>>> `curl` binary for sending HTTP(S) requests and connect to
>>> `127.0.0.1`,
>>> hence a task must listen on all interfaces. On Linux, For BRIDGE and
>>> USER
>>> modes, docker executor enters the task's network namespace.
>>>
>>>   * [MESOS-3421] - **Experimental** Support sharing of resources across
>>> containers. Currently persistent volumes are the only resources
>>> allowed to
>>> be shared.
>>>
>>>   * [MESOS-3567] - **Experimental** support for TCP health checks.
>>> Executors
>>> may now use the updated `HealthCheck` protobuf to implement TCP
>>> health
>>> checks. Both default executors (command and docker) connect to
>>> `127.0.0.1`,
>>> hence a task must listen on all interfaces. On Linux, For BRIDGE and
>>> USER
>>> modes, docker executor enters the task's network namespace.
>>>
>>>   * [MESOS-4324] - Allow access to persistent volumes as read-only or
>>> read-write
>>> by tasks. Mesos doesn't allow persistent volumes to be created as
>>> read-only
>>> but in 1.1 it starts allow tasks to use the volumes as read-only.
>>> This is
>>> mainly motivated by shared persistent volumes but applies to regular
>>> persistent volumes as well.
>>>
>>>   * [MESOS-5275] - **Experimental** support for linux capabilities.
>>> Frameworks
>>> or operators now have fine-grained control over the capabilities
>>> that a
>>> container may have. This allows a container to run as root, but not
>>> have all
>>> the privileges associated with the root user (e.g., CAP_SYS_ADMIN).
>>>
>>>   * [MESOS-5344] -- **Experimental** support for partition-aware Mesos
>>> frameworks. In previous Mesos releases, when an agent is partitioned
>>> from
>>> the master and then reregisters with the cluster, all tasks running
>>> on the
>>> agent are terminated and the agent is shutdown. In Mesos 1.1,
>>> partitioned
>>> agents will no longer be shutdown when they reregister with the
>>> master. By
>>> default, tasks running on such agents will still be killed (for
>>> backward
>>> compatibility); however, frameworks can opt-in to the new
>>> PARTITION_AWARE
>>> capability. If they do this, their tasks will not be killed when a
>>> partition
>>> is healed. This allows frameworks to define their own policies for
>>> how to
>>> handle partitioned tasks. Enabling the PARTITION_AWARE capability
>>> also
>>> introduces a new set of task states: TASK_UNREACHABLE, TASK_DROPPED,
>>> TASK_GONE, TASK_GONE_BY_OPERATOR, and TASK_UNKNOWN. These new states
>>> are
>>> intended to eventually replace the TASK_LOST state.
>>>
>>>   * [MESOS-6077] - **Experimental** A new default executor is introduced
>>> which
>>> frameworks can use to launch task groups as nested containers. All
>>> the
>>> nested containers share resources likes cpu, memory, network and
>>> volumes.

Re: Parallel test runner added

2016-10-13 Thread Alex Rukletsov
This is great, Benjamin!

I've used it the whole day today and it is awesome. (It will become
insanely great once MESOS-6387 is resolved.)

Thanks for everyone who made this happen, also on behalf of my employer : )

Alex.

On Thu, Oct 13, 2016 at 11:28 PM, Benjamin Bannier <
benjamin.bann...@mesosphere.io> wrote:

>
> Hi,
>
> Since most tests in the Mesos, libprocess, and stout test suites can
> be executed in parallel (the exception being some `ROOT` tests with
> global side effects in Mesos), we recently added a parallel test
> runner `support/mesos-gtest-runner.py`. This should allow to
> potentially significantly speed up running of test suites.
>
> To enable automatic parallel execution of tests for test targets
> executed during `make check`, configure Mesos with the option
> `--enable-parallel-test-execution`. This will configure the test runner
> to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
> be run in a separate, sequential step.
>
> * * *
>
> We use the environment variable `TEST_DRIVER` to drive parallel test
> execution. By setting this variable to an empty string you can
> temporarily disable configured parallel execution, e.g.,
>
> % make check TEST_DRIVER=
>
> By setting this environment variable you have control over the test
> runner itself and its arguments, even without enabling parallel test
> during `./configure` time. Be aware that many `ROOT` tests cannot be
> run in parallel.
>
>
> The current settings oversubscribe the machine by running `#cores*1.5`
> parallel jobs. This was driven by the observation that currently our
> tests by and large do not make extended use of even a single core.
> The number of parallel jobs can by controlled with the `-j` flag of
> the test runner.
>
> Since making more use of the machine will likely increase machine load
> during test execution, running tests in parallel might expose test
> flakiness. Tests might also fail to run in parallel if testcases e.g.,
> write data to hardcoded locations or use hardcoded ports. Please file
> JIRA tickets for such tests if they do not yet exist.
>
>
> There is still some work needed to improve reporting from parallel
> tests. We currently use a very silent mode if tests are running
> without failures, and just report the logs of failed jobs in case of
> failure. MESOS-6387 sketches out possible future improvements in this
> area.
>
>
> Happy testing,
>
> Benjamin with help from Kevin & Till
>
>


On Mesos versioning and deprecation policy

2016-10-12 Thread Alex Rukletsov
Folks,

There have been a bunch of online [1, 2] and offline discussions about our
deprecation and versioning policy. I found that people—including
myself—read the versioning doc [3] differently; moreover some aspects are
not captured there. I would like to start a discussion around this topic by
sharing my confusions and suggestions. This will hopefully help us stay on
the same page and have similar expectations. The second goal is to
eliminate ambiguities from the versioning doc (thanks Vinod for
volunteering to update it).

1. API vs. semantic changes.
Current versioning guide treat features (e.g. flags, metrics, endpoints)
and API differently: incompatible changes for the former are allowed after
6 month deprecation cycle, while for the latter they require bumping a
major version. I suggest we consolidate these policies.

We should also define and clearly explain what changes require bumping the
major version. I have no strong opinion here and would love to hear what
people think. The original motivation for maintaining backwards
compatibility is to make sure vN schedulers can correctly work with vN API
without being updated. But what about semantic changes that do not touch
the API? For example, what if we decide to send less task health updates to
schedulers based on some health policy? It influences the flow of task
status updates, should such change be considered compatible? Taking it to
an extreme, we may not even be able to fix some bugs because someone may
already rely on this behaviour!

Another tightly related thing we should explicitly call out is
upgradability and rollback capabilities inside a major release. Committing
to this may significantly limit what we can change within a major release;
on the other side it will give users more time and a better experience
about using and maintaining Mesos clusters.

2. Versioned vs. unversioned protobufs.
Currently we have v1 and unnamed protobufs, which simultaneously mean v0,
v2, and internal. I am sometimes confused about what is the right way to
update or introduce a field or message there, do people feel the same? How
about splitting the unnamed version into explicit v0, v2, and internal?

Food for thought. It would be great if we can only maintain "diffs" to the
internal protobufs in the code, instead of duplicating them altogether.

3. API and feature labelling.
I suggest to introduce explicit labels for API and features, to ensure
users have the right assumptions about the their lifetime while engineers
have the ability to change a wip feature in an non-compatible way. I
propose the following:
API: stable, non-stable, pure (not used by Mesos components)
Feature: experimental, normal.

Looking forward to your thoughts and suggestions.
AlexR

[1] https://www.mail-archive.com/user@mesos.apache.org/msg08025.html
[2] https://www.mail-archive.com/dev@mesos.apache.org/msg36621.html
[3]
https://github.com/apache/mesos/blob/b2beef37f6f85a8c75e968136caa7a1f292ba20e/docs/versioning.md


Re: How to shutdown mesos-agent gracefully?

2016-10-12 Thread Alex Rukletsov
To make sure: you are aware of SIGUSR1?

On Tue, Oct 11, 2016 at 5:37 PM, tommy xiao  wrote:

> Hi Ma,
>
> could you please input more background, why Maintenance feature  is not
> best option for your request?
>
> 2016-10-11 14:47 GMT+08:00 haosdent :
>
> > gracefully means not affect running tasks?
> >
> > On Tue, Oct 11, 2016 at 2:36 PM, Klaus Ma 
> wrote:
> >
> >> It seems there's not a way to shutdown mesos-agent gracefully.
> >> Maintenance feature expect the agents re-register back in the future.
> >>
> >> Thanks
> >> Klaus
> >> --
> >>
> >> Regards,
> >> 
> >> Da (Klaus), Ma (马达), PMP® | Software Architect
> >> IBM Platform Development & Support, STG, IBM GCG
> >> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
> >>
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: 1.1.0 release

2016-10-12 Thread Alex Rukletsov
Folks,

we have 23 unresolved tickets targeted for Mesos 1.1.0 release, including 7
blockers and 3 epics (MESOS-5344, MESOS-3421, MESOS-2449), which turns 23
into 55. Obviously, we can’t make a cut today.

Shepherds please either commit your blockers by Thu EOD PST or declare them
as non-blockers. For unfinished epics, please transition all unresolved
tickets to a new epic (see previous email) or retarget the epic. Make sure
CHANGELOG is in good shape.

We strive to cut the release on Fri Oct 14 around 13:00 CEST. At that time
we will bulk-transit all unresolved tickets to 1.2.

Rigorously,
Alex & Till

On Tue, Oct 11, 2016 at 5:30 PM, Alex Rukletsov <a...@mesosphere.io> wrote:

> Folks,
>
> in preparation for Mesos 1.1.0 release we would like to ask people who
> have worked on features in 1.1.0 to either:
> * update the CHANGELOG and declare the feature implemented or
> experimental, make sure documentation is updated as well;
> * postpone to 1.2 and update the related epic;
> * promote an experimental feature to stable if necessary.
>
> If you think you need to land something in 1.1.0, please mark the
> respective JIRA as a blocker and set the target version to 1.1.0. Bear in
> mind the release cut will be cut *tomorrow*, Oct 12 2016.
>
> For experimental features, consider creating a separate epic and moving
> all unresolved tickets there, while marking the original epic as resolved
> for 1.1.0. For example, see MESOS-2449 (pods) and MESOS-6355
> (pods-improvements).
>
> Below is the list of candidates for the CHAGELOG update with their
> respective owners:
> MESOS-6014 CNI port-mapping Avinash, Jie
> MESOS-2449 Pods, subtopics: nested containers, nested isolators, default
> executor Vinod
> MESOS-5676 New Mesos CLI Kevin
> MESOS-4697 Unified Cgroups isolator Haosdent, Jie
> MESOS-6007 v1 API Anand, Vinod
> MESOS-3302 - // -
> MESOS-4855 - // -
> MESOS-4791 - // -
> MESOS-4766 Allocator performance BenM
> MESOS-4936 Container security Jie
> MESOS-4936 Capabilities and container security Benjamin Bannier, Jie
> MESOS-3421 Shared resources Yan Xu
> MESOS-5344 Partition awareness  Neil
>
> Below is the list of features marked as experimental in 1.0. Are they
> ready to be promoted and called out in the CHANGELOG?
> MESOS-4312 Power PC Vinod
> MESOS-4828 XFS disk isolator Yan Xu
> MESOS-4641 Network CNI isolator Qian, Jie
> MESOS-3094 Mesos tasks on Windows Joseph
> MESOS-4355 Docker volume isolator Guangya, Qian, Jie
>
> This one has never been even called experimental. Joseph, is it time to do
> so?
> MESOS-898 CMake (never declared even experimental) Joseph
>
> Thanks in advance for cooperation,
> Till and AlexR
>
> On Fri, Oct 7, 2016 at 7:47 PM, Vinod Kone <vinodk...@apache.org> wrote:
>
>> I think you need to clean up the JIRA a bit.
>>
>> 1) Make sure unresolved tickets do not have fix version (1.1.0) set.
>> 2) Move "Fix version 1.1.0" to "Target version 1.1.0".
>>
>> 2) might obviate the need for 1).
>>
>>
>>
>> On Fri, Oct 7, 2016 at 7:24 AM, Till Toenshoff <toensh...@me.com> wrote:
>>
>>> Hi everyone!
>>>
>>> its us who will be the Release Managers for 1.1.0 - Alex and Till!
>>>
>>> We are planning to cut the next release (1.1.0) within three workdays -
>>> that would be Wednesday next week. So, if you have any patches that need to
>>> get into 1.1.0 make sure that either is already in the master branch or the
>>> corresponding ticket has a target version set to 1.1.0.
>>>
>>> The release dashboard:
>>> https://issues.apache.org/jira/secure/Dashboard.jspa?selectP
>>> ageId=12329720
>>>
>>> Alex & Till
>>>
>>
>>
>


Re: LIBPROCSES_IP

2016-10-12 Thread Alex Rukletsov
>
> Also, I think libprocess should always bind to 0.0.0.0, rather than doing a
> hostname lookup and bind to the IP found for the hostname.
> LIBPROCESS_ADVERTISE_IP can be used to overwrite the ip address it wants to
> advertise to peers. If that's not specified, it'll try to do a hostname
> lookup to guess a routable ip.
>

I'm +1 for this change. Here is one more argument.

A master or agents always have a single unique UPID, which is tied to a
specific IP, obtained either via a hostname lookup or set up manually.
However, the way IP is obtained influences the way a master or agents binds
to network interfaces: a single one in case LIBPROCESS_IP is set and *all*
available interfaces otherwise. This leads to confusions like sometimes you
can use any interface on the master machine to query a master endpoint, but
sometimes not (e.g. if you set --ip master flag), while agents always
communicate using one specific interface.

Some related links to the code:
https://github.com/apache/mesos/blob/c9b707aa86d55714ec419ad10190db22ec38108b/3rdparty/libprocess/src/process.cpp#L976
https://github.com/apache/mesos/blob/c9b707aa86d55714ec419ad10190db22ec38108b/3rdparty/libprocess/src/process.cpp#L899
https://github.com/apache/mesos/blob/c9b707aa86d55714ec419ad10190db22ec38108b/src/master/main.cpp#L233
https://github.com/apache/mesos/blob/c9b707aa86d55714ec419ad10190db22ec38108b/3rdparty/libprocess/src/process.cpp#L3282


Re: Separate Compilation of Tests

2016-09-26 Thread Alex Rukletsov
Michael,

I'm doing this wrong too and I have expensive laptop as well. I don't know
any better solution than interleave compilation with other work. This is
not always productive, hence

+1 for this change.

As a side note, we should probably revive the effort of a) splitting huge
.cpps into smaller ones and b) moving non-template method implementations
into .cpps.

On Sun, Sep 25, 2016 at 3:56 PM, Michael Park  wrote:

> Hello,
>
> I would like to propose a change in our build to help us improve developer
> efficiency.
> The proposal is to support separate compilation of unit tests.
>
> Currently, we have the old approach of invoking `make check -j N
> GTEST_FILTER=""`, or a newer option of doing `make tests -j N`. From what
> I've heard the latter is slightly faster. However, when someone is
> developing a specific feature, it's likely that they would like to make
> changes and iterate on a single test file. In this workflow, having to
> compile (virtually) __all__ of the tests is very burdensome. This situation
> is not so bad if you're working in a very isolated part of the codebase,
> but it gets to be pretty bad if you're experimenting with parts that are
> widely used.
>
> An example of a workflow I'm aiming for would look something like:
>
>1. write some code...
>2. `make check master_tests`  // compile and test
>`src/tests/master_tests.cpp`
>3. fix compilation errors...
>4. `make check master_tests`  // compile and test
>`src/tests/master_tests.cpp`
>5. change some stuff...
>6. `make check master_tests`  // compile and test
>`src/tests/master_tests.cpp`
>7. debug...
>8. `make check master_tests`  // compile and test
>`src/tests/master_tests.cpp`
>9. alright, looks good. `make check`
>
> I have 0 attachment to the `make check master_tests` syntax. It'll be a
> different syntax for CMake anyways. I just think that the ability to
> perform separate compilation of tests will be immensely useful.
>
> Some numbers to justify what I'm proposing:
>
>- `make -j 8` on my laptop takes roughly 10 mins.
>- `make tests -j 8` takes about 30 mins.
>
> Of course, not every change you make triggers all of the tests to
> recompile. But if you change components that are widely used, it does end
> up recompiling virtually everything. Under these circumstances, I would
> love for each iteration to be 11 mins (10 mins + __at most__ 1 min to
> compile the single test), rather than 30 mins.
>
> NOTE: My laptop is expensive... some people even use machines with 64 cores
> or whatever to compile Mesos. Not everyone has access to this kind of
> equipment. We should strive for something better than "throw more money at
> it".
>
> The goal of this thread for me is the following:
>   (1) Capture whether this is something most people experience, or whether
> I'm just doing it wrong
>   (2) If most people do experience this inefficiency and would like this
> change to be made,
>I would like to recruit 1 or 2 people to help me deliver this, since
> I'm not a automake nor CMake expert.
>


Re: Support HTTP(s)/TCP Health Check in Mesos

2016-09-05 Thread Alex Rukletsov
Aaron—

we do use some on Boost libraries.

I think supporting HTTP/2 is a great idea and we should definitely create a
JIRA epic to evaluate and track work. My intuition we will have to
implement it ourselves in libprocess. Would you like to open a ticket?

However, let's not hijack this thread for this : ).

Everyone, are there any Mesos users with custom executors which use HTTP
part of HealthCheck protobuf?


On Fri, Sep 2, 2016 at 6:49 PM, Aaron Wood  wrote:

> That's great!
>
> I would be interested in seeing support for HTTP/2 as I think the benefits
> of header compression and connection multiplexing could provide some nice
> improvements in certain environments.
>
> What do you (or anyone else here) think about this? There's no use of Boost
> anywhere in this project, right? I'm not sure what good libraries there are
> to provide this for C++ 11.
>
> On Fri, Sep 2, 2016 at 12:46 PM, haosdent  wrote:
>
> > Just test with curl 7.50.1, HTTP 2 is supported.
> >
> > On Sat, Sep 3, 2016 at 12:32 AM, haosdent  wrote:
> >
> > > The current implementation of HTTP(s) health check is based on curl.
> > > According to the document of curl
> > >
> > > >Since 7.47.0, the curl tool enables HTTP/2 by default for HTTPS
> > > connections.
> > >
> > > So I think it should be supported if the curl version in your Mesos
> Agent
> > > is higher that 7.47. But I have not yet try this.
> > >
> > > On Sat, Sep 3, 2016 at 12:23 AM, Aaron Wood 
> > wrote:
> > >
> > >> Since you mentioned that you're working on supporting HTTPS health
> > checks
> > >> I'm curious if there are any plans to support HTTP/2 over TLS (or even
> > >> over
> > >> plain HTTP). I would think that using HTTP/2 for any communication
> that
> > >> happens in Mesos would provide a nice improvement in heavy load
> > >> situations.
> > >>
> > >> On Fri, Sep 2, 2016 at 10:59 AM, haosdent  wrote:
> > >>
> > >> > Hi, dear friends. @alexr and I are working on supporting HTTP(s)/TCP
> > >> Health
> > >> > Check in Mesos.
> > >> > We have finished and committed some initial works. But if you use
> the
> > >> old
> > >> > protobuf definition of
> > >> > `HealthCheck` to implement HTTP health check in your custom executor
> > >> > before, our changes recently would
> > >> > break it.
> > >> >
> > >> > The change of the protobuf definition of `HealthCheck` is
> > >> >
> > >> > ```
> > >> >  message HealthCheck {
> > >> >  +  enum Type {
> > >> >  +UNKNOWN = 0;
> > >> >  +COMMAND = 1;
> > >> >  +HTTP = 2;
> > >> >  +TCP = 3;
> > >> >  +  }
> > >> >  +
> > >> >  -  message HTTP {
> > >> >  +  message HTTPCheckInfo {
> > >> >  +optional string scheme = 1;
> > >> >  -required uint32 port = 1;
> > >> >  +required uint32 port = 2;
> > >> >  -optional string path = 2 [default = "/"];
> > >> >  +optional string path = 3;
> > >> >  -repeated uint32 statuses = 4;
> > >> > }
> > >> > ...
> > >> >  +  optional Type type = 8;
> > >> >  -  // HTTP health check - not yet recommended for use, see
> > MESOS-2533.
> > >> >  -  optional HTTP http = 1;
> > >> >  +  optional HTTPCheckInfo http = 1;
> > >> > ...
> > >> >   }
> > >> > ```
> > >> >
> > >> > Noted that we add a field `type` to specific the health check type
> and
> > >> use
> > >> > `HTTPCheckInfo` instead of `HTTP`.
> > >> > As I know, Mesos didn't support HTTP health check before 1.0 and it
> is
> > >> > supposed to not used.
> > >> >
> > >> > But thanks to @swsnider to report the issues recently, user may
> > >> implement
> > >> > the custom executor with
> > >> > HTTP health check. So I am writing this email to check if anyone
> > >> > implemented HTTP health check in custom executor
> > >> > like @swsnider and if you depend on the old protobuf definition of
> > >> > `HealthCheck` heavily.
> > >> > If so, how many month your need for the deprecation cycle of this?
> > >> >
> > >> > Any concerns and questions are appreciated, thanks a lot!
> > >> >
> > >> > --
> > >> > Best Regards,
> > >> > Haosdent Huang
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> > >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
>


Re: [VOTE] Release Apache Mesos 1.0.1 (rc1)

2016-08-12 Thread Alex Rukletsov
+1 (binding)

make check on Mac OS 10.11.6 with apple clang-703.0.31.

DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky (MESOS-4570), but
this does not seem to be a regression or a blocker.

On Fri, Aug 12, 2016 at 10:30 PM, Radoslaw Gruchalski 
wrote:

> I am trying to build Mesos 1.0.1 for Centos 7 in a Docker container but
> I'm hitting this: https://issues.apache.org/jira/browse/MESOS-5925.
>
> Kind regards,
>
> Radek Gruchalski
> ra...@gruchalski.com
> +4917685656526
>
> *Confidentiality:*
> This communication is intended for the above-named person and may be
> confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Thu, Aug 11, 2016 at 2:32 AM, Vinod Kone  wrote:
>
>> Hi all,
>>
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.0.1.
>>
>>
>> The CHANGELOG for the release is available at:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>> lain;f=CHANGELOG;hb=1.0.1-rc1
>>
>> 
>> 
>>
>>
>> The candidate for Mesos 1.0.1 release is available at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos-1.0.1.tar.gz
>>
>>
>> The tag to be voted on is 1.0.1-rc1:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.0.1-rc1
>>
>>
>> The MD5 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos
>> -1.0.1.tar.gz.md5
>>
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos
>> -1.0.1.tar.gz.asc
>>
>>
>> The PGP key used to sign the release is here:
>>
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>>
>> The JAR is up in Maven in a staging repository here:
>>
>> https://repository.apache.org/content/repositories/orgapachemesos-1155
>>
>>
>> Please vote on releasing this package as Apache Mesos 1.0.1!
>>
>>
>> The vote is open until Mon Aug 15 17:29:33 PDT 2016 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Mesos 1.0.1
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> Thanks,
>>
>
>


On creating actor instances in Mesos

2016-07-22 Thread Alex Rukletsov
Folks,

I've noticed recently that some actors do not specify a distinguishable
actor ID. As a result, it may be hard to match output to a specific actor,
for example an excerpt from "__processes__" endpoint:

[{"events":[],"id":"(10)"},{"events":[],"id":"(11)"},{"events":[],"id":"(1279859)"},{"events":[],"id":"(15)"},
... ]

Every time you create an actor, i.e., a `ProcessBase` instance, you most
probably want to give it a meaningful id. Consider `StatusUpdateManager`.
Currently, the code does not specify the ID:

StatusUpdateManagerProcess::StatusUpdateManagerProcess(const Flags& _flags)
  : flags(_flags), paused(false) {}

Instead, a preferred way of calling the `StatusUpdateManagerProcess` c-tor
would be:

StatusUpdateManagerProcess::StatusUpdateManagerProcess(const Flags& _flags)
  : ProcessBase(process::ID::generate("status-update-manager")),
flags(_flags),
paused(false) {}

Best,
Alex


Re: MESOS-4694

2016-07-18 Thread Alex Rukletsov
Dario,

but this is true only for framework sorters, right? The total kept in the
role sorter is changed not on allocations, but when an agent joins or
leaves the cluster. Maintaining a priority queue for roles can make sense,
but may decrease the performance for framework sorters.

What is the ratio frameworks / roles in your clusters?

On Fri, Jul 8, 2016 at 6:37 PM, Dario Rexin <dre...@apple.com> wrote:

> Hi Alex,
>
> thanks for your input. We originally thought about that, too, but the
> problem is, that every time resources are allocated to a framework, this
> method will be called:
>
> void DRFSorter::add(const SlaveID& slaveId, const Resources& resources)
>
> It will add the passed resources to the total resources of the sorter and
> therefore invalidate the whole sorting (i.e. set dirty=true). So we would
> still have to actually sort the frameworks almost every time. In fact,
> frameworks are already kept sorted as long as possible, it’s just not
> possible to keep them sorted for very long because of the call to said
> function ;).
>
> --
>  Dario
>
> > On Jul 8, 2016, at 6:50 AM, Alex Rukletsov <a...@mesosphere.com> wrote:
> >
> > I was not involved into conversations around this issue, so maybe you
> have
> > discussed this already (in this case, is the outcome of the discussion is
> > documented somewhere?).
> >
> > Though the patch seems good to me, it assumes that frameworks SUPPRESS
> when
> > they don't need offers. This is not always the case. Since now we have a
> > real world use case with ~6k frameworks, the "right thing to do" seems to
> > maintain a heap of roles and frameworks in the role and avoid sorting.
> >
> > On Thu, Jul 7, 2016 at 7:20 PM, Dario Rexin <dre...@apple.com> wrote:
> >
> >> A bit more context:
> >>
> >> We have a very high number of frameworks on our clusters. In some cases
> >> ~6k. The biggest problem is the sort method, which has a complexity of
> O(n
> >> log n) and is called n*m times, where n = number of agents and m =
> number
> >> of roles. So in total we have a complexity of O(n^3 log n). I think
> >> reducing n is the most promising optimization here. We have been running
> >> this patch in production for quite a while now and have seen huge
> >> improvements in general allocation time and also in failover times.
> >>
> >> Also, if we were to add a parameterized version of SUPPRESS, what
> problems
> >> do you see with just differentiating between the two cases?
> >>
> >> Thanks,
> >> --
> >>  Dario
> >>
> >>> On Jul 7, 2016, at 8:40 AM, Dario Rexin <dre...@apple.com> wrote:
> >>>
> >>> Hi Joris,
> >>>
> >>> I still don't really understand why we would parameterize SUPPRESS, to
> >> me that sounds like a case for filters. The idea of SUPPRESS was to
> >> completely stop getting offers.
> >>>
> >>> Could you please explain why you think the patch is a hack? To me it
> >> just seems logical to not sort frameworks that don't need to be
> considered
> >> in the allocator.
> >>>
> >>> Thanks,
> >>> Dario
> >>>
> >>>> On 07.07.2016, at 7:38 AM, Joris Van Remoortere <jo...@mesosphere.io>
> >> wrote:
> >>>>
> >>>> The reason that SUPPRESS doesn't just deactivate is because the intent
> >> was
> >>>> to be able to parameterize this call. At that point the change
> wouldn't
> >>>> work without turning this in to 2 cases.
> >>>>
> >>>> I have asked to look at what a parameterized suppress would like and
> >>>> understand the performance impact of that before we do this.
> >>>> Have we reached consensus that there's no way to implement a generic
> >>>> parameterized suppress that is performant?
> >>>>
> >>>> There are some refactorings that we had discussed with James, Jacob,
> and
> >>>> Ian that seem like lower hanging fruit. After those are made it might
> be
> >>>> worth reconsidering whether we need to do this hack.
> >>>>
> >>>>
> >>>>
> >>>> —
> >>>> *Joris Van Remoortere*
> >>>> Mesosphere
> >>>>
> >>>>> On Thu, Jul 7, 2016 at 10:15 AM, Guangya Liu <gyliu...@gmail.com>
> >> wrote:
> >>>>>
> >>>>> Hi Ben and Dario,
> >>>>>

Re: [VOTE] Release Apache Mesos 1.0.0 (rc2)

2016-07-15 Thread Alex Rukletsov
Haosdent investigated the issue, and it seems that health checks do work
for docker executor. Hence I retract my negative vote.

On Fri, Jul 15, 2016 at 12:57 PM, Alex Rukletsov <a...@mesosphere.com>
wrote:

> -1 (binding): MESOS-5848
> <https://issues.apache.org/jira/browse/MESOS-5848>. The fix is on the way.
>
> On Wed, Jul 13, 2016 at 1:19 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
>> +1 (nonbinding)
>>
>> Tested by 1)running all tests on Mac OS, 2) perform upgrade and downgrade
>> on a small test cluster for both master and slave.
>>
>>
>>
>> On Mon, Jul 11, 2016 at 10:13 AM, Kapil Arya <ka...@mesosphere.io> wrote:
>>
>>> None of the stable builds have SSL yet. The first SSL-enabled stable
>>> build
>>> will be 1.0.0. Sorry for the confusion.
>>>
>>> Kapil
>>>
>>> On Mon, Jul 11, 2016 at 1:03 PM, Zhitao Li <zhitaoli...@gmail.com>
>>> wrote:
>>>
>>> > Hi Kapil,
>>> >
>>> > Do you mean that the stable builds from
>>> > http://open.mesosphere.com/downloads/mesos is using the new
>>> configuration?
>>> >
>>> > On Sun, Jul 10, 2016 at 10:07 AM, Kapil Arya <ka...@mesosphere.io>
>>> wrote:
>>> >
>>> >> The binary rpm/deb packages can be found here:
>>> >>
>>> http://open.mesosphere.com/downloads/mesos-rc/#apache-mesos-1.0.0-rc2
>>> >> .
>>> >>
>>> >> Please note that starting with the 1.0.0 release (including RCs and
>>> >> recent nightly builds), Mesos is configured with SSL and 3rdparty
>>> >> module dependency installation. Here is the configure command line:
>>> >> ./configure --enable-libevent --enable-ssl
>>> >> --enable-install-module-dependencies
>>> >>
>>> >> As always, the stable builds are available at:
>>> >> http://open.mesosphere.com/downloads/mesos
>>> >>
>>> >> The instructions for nightly builds are available at:
>>> >> http://open.mesosphere.com/downloads/mesos-nightly/
>>> >>
>>> >> Best,
>>> >> Kapil
>>> >>
>>> >>
>>> >> On Thu, Jul 7, 2016 at 9:35 PM, Vinod Kone <vinodk...@apache.org>
>>> wrote:
>>> >> >
>>> >> > Hi all,
>>> >> >
>>> >> >
>>> >> > Please vote on releasing the following candidate as Apache Mesos
>>> 1.0.0.
>>> >> >
>>> >> >
>>> >> > 1.0.0 includes the following:
>>> >> >
>>> >> >
>>> >>
>>> 
>>> >> >
>>> >> >   * Scheduler and Executor v1 HTTP APIs are now considered stable.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >   * [MESOS-4791] - **Experimental** support for v1 Master and Agent
>>> >> APIs.
>>> >> > These
>>> >> >
>>> >> > APIs let operators and services (monitoring, load balancers)
>>> send
>>> >> HTTP
>>> >> >
>>> >> >
>>> >> > requests to '/api/v1' endpoint on master or agent. See
>>> >> >
>>> >> >
>>> >> > `docs/operator-http-api.md` for details.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >   * [MESOS-4828] - **Experimental** support for a new `disk/xfs'
>>> >> isolator
>>> >> >
>>> >> >
>>> >> > has been added to isolate disk resources more efficiently.
>>> Please
>>> >> refer
>>> >> > to
>>> >> >
>>> >> > docs/mesos-containerizer.md for more details.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >   * [MESOS-4355] - **Experimental** support for Docker volume
>>> plugin. We
>>> >> > added a
>>> >> >
>>> >> > new isolator 'docker/volume' which allows users to use external
>>> >> volumes
>

  1   2   3   >