Re: Cgroups v2 + Python 3 + Upstreaming X/Twitter Patches

2024-01-13 Thread Qian Zhang
Thanks Ben! I have let Shane know this to hold off on going into the attic,
and happy to help on the Cgroups v2 support :-)


Regards,
Qian Zhang


On Sat, Jan 13, 2024 at 8:48 AM Samuel Marks  wrote:

> On latest commit 5109b9069b62510ab6d0cc0c78a9ed8327d3986a `fd -epy | wc -l`
> shows 73 files.
>
> mesos/src/python/cli/src/mesos/http.py requires a trivial change to support
> Python 2 & 3. Everything else in that dir is fine.
>
> mesos/src/python/cli_new/lib/cli/util.py requires two lines to be
> conditioned on Python 2/3 to resolve the now removed `imp` builtin module.
>
> mesos/src/python/cli/src/mesos/futures.py needs cancel_futures implemented
> or stubbed with a `NotImplementedError` thrown
>
> Then you're down to 17 files left incompatible with Python 2 & 3.
>
> mesos/support has 15 of these files; quickly reading `rg -Ftpy import`
> shows nothing interesting. Should be mostly trivial. Take care though to
> remove distutils; as that was removed in Python 3.12.
>
> mesos/src/examples/python contains the final 2 Python files. A cursory
> glance I can't see anything Python 2 & 3 incompatible.
>
> (happy to send the PR/patch or you can)
>
> Samuel Marks
> Charity <https://sydneyscientific.org> | consultancy <https://offscale.io>
> | open-source <https://github.com/offscale> | LinkedIn
> <https://linkedin.com/in/samuelmarks>
>
>
> On Fri, Jan 12, 2024 at 6:46 PM Shatil Rafiullah 
> wrote:
>
> > Do you require both Python 2 and 3 bindings to work, or can we make a
> > clean migration to 3? I ask because there are breaking changes in C/C++
> > bindings going from 2 to 3, and supporting both will require conditionals
> > rather than replacements of code blocks here and there. Dependencies may
> > also end up incompatible between the two Python versions since some
> dropped
> > Python 2.7 support a long time ago.
> >
> > On Fri, Jan 12, 2024 at 3:18 PM Samuel Marks  wrote:
> >
> >> The Python 3 upgrade shouldn't be too difficult; happy to help out. Make
> >> sure you duplicate your configuration so that CMake still works (happy
> to
> >> help there also). Just message (privately or publicly).
> >>
> >> Good to see interest in Mesos remaining,
> >>
> >> Samuel Marks
> >> Charity
> >> <
> https://urldefense.com/v3/__https://sydneyscientific.org__;!!PWjfaQ!uCG12YNoPBcCb2z5v__gdjEEdxzUQsTntmDTSOVDMeG_sbq2zT58dS9IisTDUqIVlh18jPbSXruBk3U$
> >
> >> | consultancy
> >> <
> https://urldefense.com/v3/__https://offscale.io__;!!PWjfaQ!uCG12YNoPBcCb2z5v__gdjEEdxzUQsTntmDTSOVDMeG_sbq2zT58dS9IisTDUqIVlh18jPbSQf69uCY$
> >
> >> | open-source
> >> <
> https://urldefense.com/v3/__https://github.com/offscale__;!!PWjfaQ!uCG12YNoPBcCb2z5v__gdjEEdxzUQsTntmDTSOVDMeG_sbq2zT58dS9IisTDUqIVlh18jPbSR5dWm9A$
> >
> >> | LinkedIn
> >> <
> https://urldefense.com/v3/__https://linkedin.com/in/samuelmarks__;!!PWjfaQ!uCG12YNoPBcCb2z5v__gdjEEdxzUQsTntmDTSOVDMeG_sbq2zT58dS9IisTDUqIVlh18jPbS0q7dfzk$
> >
> >>
> >>
> >> On Fri, Jan 12, 2024 at 6:01 PM Benjamin Mahler 
> >> wrote:
> >>
> >>> +user@
> >>>
> >>> On Fri, Jan 12, 2024 at 5:55 PM Benjamin Mahler 
> >>> wrote:
> >>>
> >>> > As part of upgrading to CentOS 9 at X/Twitter, Shatil / Devin (cc'ed)
> >>> will
> >>> > be working on:
> >>> >
> >>> > * Upgrading to Python 3
> >>> > * Cgroups v2 support
> >>> >
> >>> > We will attempt to upstream this work for the benefit of other users.
> >>> >
> >>> > In addition, we have several long-standing internal patches that
> should
> >>> > have been upstreamed but weren't. We currently deploy from an
> internal
> >>> > 1.9.x branch with these additional patches on top of OSS 1.9.x. To
> >>> simplify
> >>> > the above work since the code changes will be substantial (especially
> >>> for
> >>> > cgroups v2), we'll try to work off master (or 1.11.x) and upstream
> our
> >>> > long-standing internal patches to minimize the delta between OSS and
> >>> our
> >>> > internal branch.
> >>> >
> >>> > Let me know if folks have any questions. IMO it would be good for
> >>> users to
> >>> > hold off on going into the attic so that we can land this work at
> >>> least.
> >>> >
> >>> > Ben
> >>> >
> >>>
> >>
>


Re: Can you help provide a future for Apache Mesos?

2024-01-13 Thread Qian Zhang
Hi Shane,

Please hold off on moving Mesos into the attic, Ben and his team members
will contribute some substantial code changes to Mesos, thanks Ben and your
team!


Regards,
Qian Zhang


On Tue, Nov 14, 2023 at 3:17 PM Charles-François Natali 
wrote:

> Still using Mesos - it's stable, boring - in a good way - and great for
> our specific use case.
>
> I'm a committer, happy to continue reviewing and merging small changes or
> address security issues.
>
> Cheers,
>
>
>
>
>
> On Mon, Nov 13, 2023, 13:34 Andreas Peters  wrote:
>
>> I'm still with Mesos and can do what I done the last years. Keep an eye
>> at issues and give support via Slack and Jira.
>>
>>
>> Am 13.11.23 um 13:51 schrieb Shane Curcuru:
>>
>> As a mature and successful project, Mesos hasn't seen much new
>> development in the past couple of years.  The question now for everyone on
>> these lists is:
>>
>> - Is Mesos still a maintained project, where even if no new features are
>> developed, there's at least a group to respond to security issues and make
>> new releases?  Or is it time to 'deprecate' Mesos, and move the project as
>> a whole to the Apache Attic? [1]
>>
>> It feels like there are still plenty of users who rely on Mesos; what we
>> need now is for enough people here to step up and volunteer to stick around
>> and be available to fix security issues in the future.
>>
>> Thanks to Qian for raising this question in March [2], where several
>> people did speak up.  I'd like to clarify what the ASF board's requirements
>> are for an 'active' Apache project.
>>
>> We don't actually need people doing active development on a project.
>> What's really needed is at least three PMC members who are monitoring the
>> project's lists and issues, and who could be available in the future *if* a
>> serious security issue or other major bug were discovered.
>>
>> So we're not looking for people with time to do active development - just
>> enough reliable volunteers who could monitor for major issues that are
>> reported, and make a new release if security fixes are needed.  Does that
>> help, and does that make sense?
>>
>> We will also be running a Roll Call of the PMC [3] now, so the board can
>> understand how many PMC members (who have access to security issue details,
>> for example) could still stick around to monitor lists.  Along with that
>> roll call, we'll also be reminding the existing PMC that they can vote in
>> any existing committers who will also step up and volunteer.
>>
>> * What can you do?
>>
>> If you are still using Mesos, have enough time to check the mailing
>> lists/issue tracker periodically for any security or giant breaking bugs
>> (i.e. not small bugs), and might be able to help someday with a fix or
>> making a new release of Mesos, then speak up now!  Be sure to say what
>> specific kinds of tasks you might be able to take on if they arise.
>>
>> Remember: we don't need active development, just some folks making sure
>> any security bugs are addressed in the future (if they come up).
>>
>>


Re: Next steps for Mesos

2023-11-11 Thread Qian Zhang
Hi Alexander,

Thanks for your mail and happy to know you are also using Mesos.

So my question is, is there an option to lift the requirement of 3 active
> contributors for the specific case of Mesos and postpone the decision
> on moving this project to the attic?


I think we may not be able to lift this requirement since it is an Apache
guideline for all its open source projects, and I agree with Andreas that
it is not possible to continue to maintain Mesos since we do not have
enough active committers/contributors.

Regards,
Qian Zhang


On Fri, Nov 10, 2023 at 8:41 PM Alexander Ushakov <
alex.ushakov.offic...@gmail.com> wrote:

> I am happy to know that clusterd exists. Some good news in this world.
> Thank you. I will not spam more here.
>
> On Fri, Nov 10, 2023, 15:33 Andreas Peters  wrote:
>
>> Hi Alexander,
>>
>> It seams not so easy (almost impossible) to maintain Mesos with a small
>> group of peoples under Apache. Thats why I decide to create a fork of Mesos
>> (https://github.com/m3scluster/clusterd). Have a look into the
>> changelog. Maybe some changes are interesting for you. You (and of course
>> every one else) are always welcome.
>>
>>
>> To m3s! As I know, there is only the K8 framework for DCOS. If M3S does
>> not match your requirements, I would be happy if you send me a EMail (or
>> ping me via Matrix or Slack) with the reasons. Feedback are always welcome.
>>
>>
>> Cheers,
>> Andreas
>>
>>
>>
>> Am 10.11.23 um 11:31 schrieb Alexander Sibiryakov:
>>
>> Hi Qian and others,
>>
>> We're also Mesos user, having 4-5 clusters of Mesos, running 1.4.1.
>> Some of them are powering our Scrapy Cloud product using 3K CPUs,
>> around 6-10K jobs executing in parallel, and several hundreds starting
>> per second.
>>
>> I think the situation with Mesos is that its purpose have changed with
>> the time. Initially it was widely adopted as containers orchestration
>> platform run a variety of applications. Currently there are mature,
>> future rich and battle-tested alternatives like Kubernetes, especially
>> the managed offers. So Mesos due its simplicity, concentrating mainly
>> on orchestration, is out of the competition in this market.
>>
>> But, there are still very little frameworks for building cloud
>> systems. As it is stated on the main page
>>
>> Apache Mesos abstracts CPU, memory, storage, and other compute resources 
>> away from machines (physical or virtual), enabling fault-tolerant and 
>> elastic distributed systems to easily be built and run effectively.
>>
>> These days Mesos is still a good solution for those who need to
>> orchestrate a specific workload at scale. Its operational model with
>> customisable schedulers receiving resource offers fits perfectly when
>> there is a queue of jobs which needs to be executed on a limited
>> amount of hardware. Kubernetes isn't really designed for that. But,
>> this is a concern of cloud builders, which is a rare occasion today.
>> Due to the complexity of the cloud systems, very few people on the
>> planet can design and build them. Meaning you should not expect the
>> same amount of active contributors as for other projects. So my
>> question is, is there an option to lift the requirement of 3 active
>> contributors for the specific case of Mesos and postpone the decision
>> on moving this project to the attic?
>>
>> A.
>>
>> On 2023/03/18 01:57:00 Qian Zhang wrote:
>>
>> Hi all,
>>
>> I'd like to restart the discussion around the future of the Mesos project.
>> As you may already be aware, the Mesos community has been inactive for the
>> last few years, there were only 3 contributors last year, that's obviously
>> not enough to keep the project moving forward. I think we need at least 3
>> active committers/PMC members and some active contributors to keep the
>> project alive, or we may have to move it to attic<https://attic.apache.org/> 
>> <https://attic.apache.org/>.
>>
>> Call for action: If you are the current committer/PMC member and still have
>> the capacity to maintain the project, or if you are willing to actively
>> contribute to the project as a contributor, please reply to this email,
>> thanks!
>>
>>
>> Regards,
>> Qian Zhang
>>
>>
>>


Re: Next steps for Mesos

2023-03-27 Thread Qian Zhang
>
> Qian, it might be worth having a more explicit email asking users to chime
> in as this email was tailored more for contributors.


Yes, we can have such email, but I think it does not affect whether we
should move Mesos to attic or not, since the most important factor is
whether we have enough active committers. Quoted from
https://attic.apache.org/:

When should a project move to the Attic?

Projects whose PMC are unable to muster 3 votes for a release, who have no
active committers or are unable to fulfill their reporting duties to the
board are all good candidates for the Attic.

I am happy to see there are still some companies using Mesos now, but in
this mail thread, so far there are just several contributors interested in
contributing to Mesos and only one committer Benjamin Mahler (no guaranteed
time), I think that's not enough to keep this project going, so we may have
to move Mesos to attic.


Regards,
Qian Zhang


On Tue, Mar 21, 2023 at 2:56 AM Benjamin Mahler  wrote:

> Also if you are still a user of mesos, please chime in.
> Qian, it might be worth having a more explicit email asking users to chime
> in as this email was tailored more for contributors.
>
> Twitter is still using mesos heavily, we upgraded from a branch based off
> of 1.2.x to 1.9.x in 2021, but haven't upgraded to 1.11.x yet. We do have a
> lot of patches carried on our branch that have not been upstreamed. I would
> like to upstream them to avoid relying on many custom patches and to
> get closer to HEAD, but it will take time and quite a bit of work, and it's
> not a priority at the moment.
>
> On the contribution side, at this point if I were to continue contributing,
> it would be on a volunteer basis, and I can't guarantee having enough time
> to do so.
>
> On Fri, Mar 17, 2023 at 9:57 PM Qian Zhang  wrote:
>
> > Hi all,
> >
> > I'd like to restart the discussion around the future of the Mesos
> project.
> > As you may already be aware, the Mesos community has been inactive for
> the
> > last few years, there were only 3 contributors last year, that's
> obviously
> > not enough to keep the project moving forward. I think we need at least 3
> > active committers/PMC members and some active contributors to keep the
> > project alive, or we may have to move it to attic
> > <https://attic.apache.org/>.
> >
> > Call for action: If you are the current committer/PMC member and still
> > have the capacity to maintain the project, or if you are willing to
> > actively contribute to the project as a contributor, please reply to this
> > email, thanks!
> >
> >
> > Regards,
> > Qian Zhang
> >
>


Next steps for Mesos

2023-03-17 Thread Qian Zhang
Hi all,

I'd like to restart the discussion around the future of the Mesos project.
As you may already be aware, the Mesos community has been inactive for the
last few years, there were only 3 contributors last year, that's obviously
not enough to keep the project moving forward. I think we need at least 3
active committers/PMC members and some active contributors to keep the
project alive, or we may have to move it to attic
<https://attic.apache.org/>.

Call for action: If you are the current committer/PMC member and still have
the capacity to maintain the project, or if you are willing to actively
contribute to the project as a contributor, please reply to this email,
thanks!


Regards,
Qian Zhang


Re: Future transfer of MesosCon 2015 videos

2023-03-14 Thread Qian Zhang
Thanks Dave for taking care of this!

I'd prefer to host the MesosCon 2015 videos together with the videos of
other MesosCons so we will have a single place for all videos.


Regards,
Qian Zhang


On Tue, Mar 14, 2023 at 2:20 AM Dave Lester  wrote:

> I’m looking into options for moving video of MesosCon 2015 presentations
> (the project's community conference that took place in Seattle with 700+
> attendees) from their current stand-alone YouTube channel (
> https://www.youtube.com/@mesoscon881) to a larger channel that’s more
> official.
>
> Motivation
> Apache Mesos became a top-level ASF project 10 years ago and I’d like these
> presentation videos to be accessible for many more years. To ensure that
> they aren’t lost or accidentally deleted I believe they should be saved to
> an official channel. Migration becomes less likely and more difficult with
> time so I’m prioritizing this now.
>
> Anticipated Impact
> Minimal. In the options I’m currently exploring the links will change but
> videos will remain discoverable via YouTube search. Previous YouTube view
> counts will likely be lost. Once the videos are uploaded to a new channel
> the previous MesosCon channel will be deleted.
>
> Paths Being Explored
> We’ve been given the green light to use The Apache Software Foundation
> YouTube channel and would likely post videos in a separate event playlist.
> I also plan to contact The Linux Foundation who managed the event to see if
> they’d prefer to host the videos themselves (the LF already hosts video
> from MesosCon 2016 and 2017).
>
> Your feedback is welcome! I’ll continue to work on this in the coming weeks
> and will report back with an updated link once the transfer is complete.
>
> Best,
> Dave Lester
> Apache Mesos PMC Member and Co-chair of MesosCon 2015
>


Re: marathon bug(?)

2021-06-22 Thread Qian Zhang
Hi Marc,

Here are the issue tracker and support info of Marathon:
https://github.com/mesosphere/marathon/issues
https://mesosphere.github.io/marathon/support.html

You may want to report issues there, but I am not sure if D2IQ folks still
maintains Marathon.


Regards,
Qian Zhang


On Tue, Jun 22, 2021 at 3:50 PM Marc  wrote:

>
> Where should marathon bugs be reported?
>
> I have switched from marathon 1.9 to 1.11.24
>
> I have a task with dedicated ip address and I used to restart a such a
> task via the gui by 'changing' the config and click something like 'change
> and deploy'.
> This 'change and deploy' is not working any more, it does nothing.
>
> PS. Choosing the restart does not work because the restart first creates a
> new task and then kills the old, which of course never works with a
> dedicated ip. (I reported this before)
>
>
>


Re: State of the Project

2021-05-31 Thread Qian Zhang
>
> Yes I think it's important to mention, in response to Javi's point,
> that one doesn't need to be an hard-core C++ dev to contribute.


Exactly! To be honest I did not have much C++ programming experience when I
started to contribute to Mesos. But when I read Mesos code, I felt it's
easy to understand and has very good design. Although Mesos is running in
multi-thread mode, you actually do not need to take care of the
locking/race condition in most cases (thanks to the actor mode with
libprocess). So I'd encourage everyone to read Mesos code and let us know
which area you'd like to contribute to :)


Regards,
Qian Zhang


On Mon, May 31, 2021 at 3:08 AM Charles-François Natali 
wrote:

> Le dim. 30 mai 2021 à 16:09, Qian Zhang  a écrit :
> > [...] So given the
> > active committers and contributors that we have in the community, I do
> not
> > think we can do anything big in the short term, instead we should do
> small
> > things to gradually activate the community. Here are what in my mind:
> > 1. Review and merge the outstanding PRs.
> > 2. Review the tickets in JIRA and select some high priority ones to work
> on.
> > 3. Add at least one new committer.
>
> Yes I think it's important to mention, in response to Javi's point,
> that one doesn't need to be an hard-core C++ dev to contribute.
> The code base is actually very clean and easy to read, the main
> problem is the use of libprocess/actor model which takes some getting
> used to, especially for people who're more used to a reactor, green
> thread, etc models. The libprocess doc [1] gives a good overview. The
> stout doc [2] is also worth a read although nothing surprising about
> its design.
>
> But in any case I think there's a lot of valuable work which doesn't
> require any C++, for example as Qian mentions going through the huge
> backlog of JIRA issues.
> I know it doesn't sound like the most exciting thing but it would
> actually help a lot to do some triage, try to reproduce bugs, close
> stale tickets, respond to user questions etc.
>
> I remember seeing a few other people answering Qian's call for
> contributors a couple months ago [3], it'd be great if they could
> reach out if they're still interested - if not it's fine, I know we're
> all busy with our lives :).
>
> Cheers,
>
> Charles
>
>
> [1] https://github.com/apache/mesos/tree/master/3rdparty/libprocess
> [2] https://github.com/apache/mesos/tree/master/3rdparty/stout
> [3]
> https://mail-archives.apache.org/mod_mbox/mesos-dev/202103.mbox/%3CCABY6VOb%3DT8VxehVaS1YBrC5_odEwKhZzj3R4o3b-ykCytDw3JA%40mail.gmail.com%3E
>
>
> >
> > Please let me know for any comments / suggestions, thanks!
> >
> >
> > Regards,
> > Qian Zhang
> >
> >
> > On Sun, May 30, 2021 at 7:59 PM Zahoor  wrote:
> >
> > > Hi
> > >
> > > Better to rewrite/redesign Mesos with a more popular language (like
> > > golang) to attract more developers.
> > > Just a mind voice.
> > >
> > > ../Zahoor
> > >
> > >
> > > On Sun, May 30, 2021 at 4:08 PM Javi Roman 
> > > wrote:
> > >
> > >> Totally agree that the main problem with this project is trying to
> > >> increase the developer community.
> > >>
> > >> From my point of view, attracting new developers to a project of this
> > >> complexity is difficult (C++ low level developers, creating Java and
> Python
> > >> bindings is not easy). However, if we try to broaden the objectives
> of the
> > >> project we may be able to attract other developers (not only C++
> > >> developers) who can help.
> > >>
> > >> One idea I have always had is to incorporate the concept and
> technology
> > >> of D2IQ DC/OS [1], in this way we would continue the abandoned work
> of D2IQ
> > >> by extending Apache Mesos to a more user-friendly technology and
> broaden
> > >> the base of developers with interest in (ReactJS, Go, Scala,
> databases).
> > >>
> > >> I would be interested in contributing in this line, being able to
> apply
> > >> my knowledge in other areas, beyond C++ (which unfortunately I am not
> > >> proficient in).
> > >>
> > >> [1] https://github.com/dcos
> > >> --
> > >> Javi Roman
> > >>
> > >> Twitter: @javiromanrh
> > >> GitHub: github.com/javiroman
> > >> Linkedin: es.linkedin.com/in/javiroman
> > >> Big Data Blog: dataintensive.info
> > >>
> > >>
> > >> On Sat, May 29, 2021 at 12:47 P

Re: State of the Project

2021-05-30 Thread Qian Zhang
Sorry for the late response, I was just back from a business trip and
started reviewing the PRs.

I think the current state is, from the PMC side, Andrei S and I can do the
reviews (and maybe Andrei B and Ben Mahler as well?), and Charles
and Andreas are doing code contributions (thank you!). So given the
active committers and contributors that we have in the community, I do not
think we can do anything big in the short term, instead we should do small
things to gradually activate the community. Here are what in my mind:
1. Review and merge the outstanding PRs.
2. Review the tickets in JIRA and select some high priority ones to work on.
3. Add at least one new committer.

Please let me know for any comments / suggestions, thanks!


Regards,
Qian Zhang


On Sun, May 30, 2021 at 7:59 PM Zahoor  wrote:

> Hi
>
> Better to rewrite/redesign Mesos with a more popular language (like
> golang) to attract more developers.
> Just a mind voice.
>
> ../Zahoor
>
>
> On Sun, May 30, 2021 at 4:08 PM Javi Roman 
> wrote:
>
>> Totally agree that the main problem with this project is trying to
>> increase the developer community.
>>
>> From my point of view, attracting new developers to a project of this
>> complexity is difficult (C++ low level developers, creating Java and Python
>> bindings is not easy). However, if we try to broaden the objectives of the
>> project we may be able to attract other developers (not only C++
>> developers) who can help.
>>
>> One idea I have always had is to incorporate the concept and technology
>> of D2IQ DC/OS [1], in this way we would continue the abandoned work of D2IQ
>> by extending Apache Mesos to a more user-friendly technology and broaden
>> the base of developers with interest in (ReactJS, Go, Scala, databases).
>>
>> I would be interested in contributing in this line, being able to apply
>> my knowledge in other areas, beyond C++ (which unfortunately I am not
>> proficient in).
>>
>> [1] https://github.com/dcos
>> --
>> Javi Roman
>>
>> Twitter: @javiromanrh
>> GitHub: github.com/javiroman
>> Linkedin: es.linkedin.com/in/javiroman
>> Big Data Blog: dataintensive.info
>>
>>
>> On Sat, May 29, 2021 at 12:47 PM Charles-François Natali <
>> cf.nat...@gmail.com> wrote:
>>
>>> Hi Renan,
>>>
>>> > Renaming the topic because apparently we need to have this discussion
>>> again.
>>>
>>> Thanks for bringing this up again, because it is indeed still a problem.
>>>
>>> > Therefore, the PMC *must* add members or the project  *will* fizzle out
>>> > and die.
>>> >
>>> > I'd also be curious to see if we even have enough PMC members to form a
>>> > quorum at the moment as I only see Andrei Sekretenko reviewing pull
>>> > requests on Github and the new chair Qian Zhang on emails. The project
>>> > needs three PMC members for the project to be considered in a good
>>> state
>>> > according to the Apache guidelines [0].
>>> >
>>>
>>> I must say I'm also a bit confused.
>>> The new project chair was elected exactly a month ago [1].
>>> Since then, the only thing I have seen - there might be more going on
>>> being the scenes - is a single thread calling for input on new
>>> technical direction [2], which as several people mentioned before is
>>> not the most important issue the project is facing right now.
>>> As far as I can tell, nothing as been done by the PMC/project chair to
>>> address the more fundamental issue of the health of the community.
>>> Now, Andrei has been doing a great job at reviewing MRs, but as
>>> mentioned before he only has so much time available, and the project
>>> can't have only one active committer.
>>> So it would be good to hear from the project chair what they are
>>> planning to do, if anything, to address this situation.
>>> From some private conversations I know that they have been busy with
>>> other obligations in the past month so maybe it's only a bad timing
>>> and just a transient state, however I don't think it's viable to
>>> continue if even the project chair doesn't have any time to dedicate
>>> to the project - not even replying to this thread.
>>>
>>> > At this point I suggest the PMC does a roll call and get Apche board
>>> > members involved so that they can be aware of the situation.
>>>
>>> I'm not familiar with the ASF but yes it does sounds like a possible
>>> course of action?
>>>
>>> Cheers,

Re: [BULK]Discuss the possible technical directions of Mesos

2021-05-24 Thread Qian Zhang
Hi Charles,

I agree that we should re-activate the community first, both Andrei and I
can help review patches, we have complementary Mesos knowledge, he can help
on the patches for Mesos master/allocator and I can do for Mesos
agent/containerizer.

> several fixing bugs which basically make Mesos unusable on a recent Linux
distro
Can you please elaborate a bit on this? Do you mean Mesos not working on a
recent Linux distro? If so, I think we can start to fix the issues and
maybe do a patch release for that.


Regards,
Qian Zhang


On Fri, May 21, 2021 at 2:57 AM Charles-François Natali 
wrote:

> Hey,
>
> Sorry for being a killjoy and repeating myself, but as mentioned in
> the past, I don't think that technical direction is the most important
> problem right now - community is.
> Coming up with medium/long-term technical roadmap doesn't do much if
> there are no contributors to implement them, and users to use them.
>
> The following issues which have been brought up are still not resolved:
> - very few committers willing to review and merge MRs - currently only
> Andrei Sekretenko is doing that, and I'm sure he's busy with his day
> job so only has that much bandwidth
> - very few people contribute MRs and triage/address JIRA issues -
> AFAICT it's pretty much Andreas and me
>
> So I think the first thing to do would be to address those problems.
> Some suggestions which come to mind:
> - to the remaining committers who'd still like to salvage the project,
> please take some time to review and merge MRs -
> https://github.com/apache/mesos/pulls has a few open, several fixing
> bugs which basically make Mesos unusable on a recent Linux distro
> - to the various users who've said they were interested in keeping the
> project alive: start contributing. It doesn't have to be anything big,
> just get familiar with the code base:
>   * start going through JIRA and triage bugs, closing invalid/stale
> ones, tackling small issues
>   * submit MRs so that the test suite passes on your OS
>   * submit MRs to merge various commits you have in your private repos
> if applicable
>
> Then in a few months, once the project  is back to having a small
> active contributors base, they can together decide how to take the
> project forward, and start addressing larger projects.
>
> Cheers,
>
> Charles
>
>
>
>
>
>
> Le jeu. 20 mai 2021 à 18:16, Gregoire Seux  a écrit :
> >
> > Hi,
> >
> > Interesting set of suggestions! Here are a few comments:
> >
> >   *   Mesos feels simple to deploy (only very few components: zookeeper,
> masters and agents), customization is done mostly through configuration
> files. I don't think there is a strong need to make it easier (even though
> I've used Mesos for years, so I'm pretty used to the difficulty if any)
> >   *   Having to manage Zookeeper adds some complexity but since
> Zookeeper piece is required to operate Marathon (which is our main
> framework), I don't see much value in the investment required to get rid of
> this dependency.
> >   *   Taking advantage of NUMA topology by default would be a good
> addition although I don't see it as strategic (at least we have solved this
> on our clusters with custom modules)
> >   *   I would love to see improvement on masters scalability for large
> clusters (our largest cluster is 3500 nodes and may start to suffer from
> the actor model)
> >
> > Something that I see as a very significant drawback to the ecosystem at
> large is the difficulty to write frameworks. In addition to this, most
> open-source frameworks feel abandoned. Without good frameworks, Mesos value
> really decreases a lot (although it is very technically strong).
> > I think, making Mesos thrive would necessarily go through a solution to
> this issue.
> >
> > Something that I'd see as strategic would be the ability to deploy
> complex workloads on Mesos without having to write a new framework. Random
> idea: make Mesos really usable as a backend for Kubernetes (as a virtual
> kubelet). This would remove a lot of barriers to use Mesos as a strong
> engine to operate a fleet of servers while allowing to use the Kubernetes
> API that apparently everybody loves.
> >
> > What do you think?
> >
> > --
> > Grégoire Seux
> >
>


Discuss the possible technical directions of Mesos

2021-05-20 Thread Qian Zhang
Hi folks,

I'd like to restart the discussion around the future of the Mesos project.
In my previous email, I collected some feature requests for Mesos (more are
welcome):

   1. HPC
   2. NUMA support.
   3. IPv6 support.
   4. Complete CSI support.
   5. Decouple Apache ZooKeeper.
   6. Making Mesos easier to deploy.
   7. Making Mesos UI fully functional (it's kind of read-only currently)

The last Mesos release (1.11.0) was done around 6 months ago, if we want to
do a new release, IMO, just some small improvements on the existing
features or internal refactors are not enough, we may need to release
something new to indicate this project will be head to a new direction so
that we can possibly reactivate this project and attract more people in the
community.

I feel that we should not position Mesos as a generic container
orchestrator, instead we should find a specific scenario for it, maybe HPC
is a good direction? But we definitely need more concrete ideas for it.

Please feel free to add any comments/suggestions, thanks!


Regards,
Qian Zhang


Re: New PMC Chair

2021-05-01 Thread Qian Zhang
Hi Marc,

I will take the lead on this project. My plan is to collect the
requirements from the community and then come up with a roadmap plan which
I will send here to discuss.

Please feel free to let me know for any comments / suggestions, thanks!

Regards,
Qian Zhang


On Fri, Apr 30, 2021 at 4:36 PM Marc  wrote:

> I do not really understand what this means, and how it affects (the future
> of) the mesos project. Can anyone elaborate on that?
>
>
>
> > -Original Message-
> > From: Vinod Kone
> > Sent: 29 April 2021 16:35
> > To: dev ; user 
> > Subject: New PMC Chair
> >
> > Hi community,
> >
> > Just wanted to let you all know that the board passed the resolution to
> > elect a new PMC chair!
> >
> > Hearty congratulations to Qian Zhang for becoming the new Apache Mesos
> > PMC chair and VP of the project.
> >
> > Thanks,
>


Re: [BULK]Re: New PMC Chair

2021-04-29 Thread Qian Zhang
Thanks Vinod! Really appreciate your contributions and guidance to this
great project.

It's a huge honor to be the new PMC chair, I will try my best to work with
the community and make Mesos better.


Regards,
Qian Zhang


On Fri, Apr 30, 2021 at 1:16 AM Thomas Langé  wrote:

> That's really good news! Well done!
>
> Thomas
> --
> *From:* Charles-François Natali 
> *Sent:* Thursday, 29 April 2021 19:12
> *To:* user 
> *Cc:* dev ; Vinod Kone 
> *Subject:* [BULK]Re: New PMC Chair
>
> Congratulations!
>
>
>
> On Thu, 29 Apr 2021, 23:37 Andreas Peters,  wrote:
>
> Great to hear. :-)
>
> Am 29.04.21 um 16:35 schrieb Vinod Kone:
> > Hi community,
> >
> > Just wanted to let you all know that the board passed the resolution to
> > elect a new PMC chair!
> >
> > Hearty congratulations to *Qian Zhang* for becoming the new Apache Mesos
> > PMC chair and VP of the project.
> >
> > Thanks,
> >
>
>


Call for new committers

2021-03-14 Thread Qian Zhang
Hi folks,

Please reply to this mail if you plan to actively contribute to Mesos and
want to become a new Mesos committer, thanks!


Regards,
Qian Zhang


Call for active contributors

2021-03-04 Thread Qian Zhang
Hi folks,

Please reply to this mail if you plan to actively contribute to Mesos and
want to become a committer and PMC member in future.


Regards,
Qian Zhang


Feature requests for Mesos

2021-02-28 Thread Qian Zhang
Hi Folks,

To reboot this awesome project, I'd like to collect feature requests for
Mesos. Please let us know your requirements for Mesos and whether you or
your organization would like to contribute to the implementation of the
requirements. Thanks!


Regards,
Qian Zhang


Re: Next Steps

2021-02-18 Thread Qian Zhang
Hi Vinod,

I am still interested in the project. As other folks said, we need to have
a direction for the project. I think there are still a lot of Mesos
users/customers in the mail list, can you please send another mail to
collect their requirements / pain points on Mesos, and then we can try to
set up a roadmap for the project to move forward.


Regards,
Qian Zhang


On Thu, Feb 18, 2021 at 9:16 PM Andrei Sekretenko 
wrote:

> IIUC, Attic is not intended for projects which still have active users
> and thus might be in need of fixing bugs.
>
> Key items about moving project to Attic:
> > It is not intended to:
> > - Rebuild community
> > - Make bugfixes
> > - Make releases
>
> >Projects whose PMC are unable to muster 3 votes for a release, who have
> no active committers or are unable to fulfill their reporting duties to the
> board are all good candidates for the Attic.
>
> As a D2iQ employee, I can say that if we find a bug critical for our
> customers, we will be interested in fixing that. Should the project be
> moved into Attic, the fix will be present only in forks (which might
> mean our internal forks).
>
> I could imagine that other entities and people using Mesos are in a
> similar position with regards to bugfixes.
> If this is true, then moving the project to Attic in the near future
> is not a proper solution to the issue of insufficient bandwidth of the
> active PMC members/chair.
>
> ---
> A long-term future of the project is a different story, which, in my
> personal view, will "end" either in moving the project into Attic or
> in shifting the project direction from what it used to be in the
> recent few years to something substantially different. IMO, this
> requires a  _separate_ discussion.
>
> Damien's questions sound like a good starting point for that
> discussion, I'll try to answer them from my committer/PMC member
> perspective when I have enough time.
>
> On Thu, 18 Feb 2021 at 12:49, Charles-François Natali
>  wrote:
> >
> > Thanks Tomek, that's what I suspected.
> > It would therefore make it much more difficult for anyone to carry on
> since it would effectively have to be a fork, etc.
> > I think it'd be a bit of a shame, but I understand Benjamin's point.
> > I hope it can be avoided.
> >
> >
> > Cheers,
> >
> >
> >
> > On Thu, 18 Feb 2021, 11:02 Tomek Janiszewski,  wrote:
> >>
> >> Moving to attic is making project read only
> >> https://attic.apache.org/
> >> https://attic.apache.org/projects/aurora.html
> >>
> >> czw., 18 lut 2021, 11:56 użytkownik Charles-François Natali <
> cf.nat...@gmail.com> napisał:
> >>>
> >>> I'm not familiar with the attic but would it still allow to actually
> >>> develop, make commits to the repository etc?
> >>>
> >>>
> >>> On Thu, 18 Feb 2021, 08:27 Benjamin Bannier, 
> wrote:
> >>>
> >>> > Hi Vinod,
> >>> >
> >>> > > I would like to start a discussion around the future of the Mesos
> >>> > project.
> >>> > >
> >>> > > As you are probably aware, the number of active committers and
> >>> > contributors
> >>> > > to the project have declined significantly over time. As of today,
> >>> > there's
> >>> > > no active development of any features or a public release planned.
> On the
> >>> > > flip side, I do know there are a few companies who are still
> actively
> >>> > using
> >>> > > Mesos.
> >>> >
> >>> > Thanks for starting this discussion Vinod. Looking at Slack, mailing
> >>> > lists, JIRA and reviewboard/github the project has wound down a lot
> in
> >>> > the last 12+ months.
> >>> >
> >>> > > Given that, we need to assess if there's interest in the community
> to
> >>> > keep
> >>> > > this project moving forward. Specifically, we need some active
> committers
> >>> > > and PMC members who are going to manage the project. Ideally,
> these would
> >>> > > be people who are using Mesos in some capacity and can make code
> >>> > > contributions.
> >>> >
> >>> > While I have seen a few non-committer folks contribute patches in the
> >>> > last months, I feel it might be too late to bootstrap an active
> >>> > community at this point.
> >>> >
> >>> > Apache Mesos is still mention

Re: Subject: [VOTE] Release Apache Mesos 1.11.0 (rc1)

2020-11-18 Thread Qian Zhang
+1

Regards,
Qian Zhang


On Thu, Nov 19, 2020 at 4:16 AM Till Toenshoff  wrote:

> +1
>
> Build using Apple clang version 12.0.0 (clang-1200.0.32.27) and ran on
> macOS 11.0.1 (20B29).
>
> > On 17. Nov 2020, at 15:53, Andrei Sekretenko 
> wrote:
> >
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.11.0.
> >
> > 1.11.0 includes the following:
> >
> 
> >  * CSI external volumes support: now, Mesos Containerizer supports using
> >pre-provisioned external CSI storage volumes by means of the new
> > `volume/csi`
> >isolator. Also, the latter significantly extends the range of
> > compatible 3rd party
> >   CSI plugins compared to the previous SLRP-based solution
> > (MESOS-10141).
> >
> >  * Constraints-based offer filtering: the Scheduler API adds an
> > interface allowing
> >frameworks to put constraints  on agent attributes in resource
> > offers to help "picky"
> >frameworks significantly reduce scheduling latency when close to
> > being out of quota
> >(MESOS-10161).
> >
> >  * CMake build becomes usable for deploying in production (MESOS-898).
> >
> > The CHANGELOG for the release is available at:
> >
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.11.0-rc1
> >
> 
> >
> > The candidate for Mesos 1.11.0 release is available at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.11.0-rc1/mesos-1.11.0.tar.gz
> >
> > The tag to be voted on is 1.11.0-rc1:
> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.11.0-rc1
> >
> > The SHA512 checksum of the tarball can be found at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.11.0-rc1/mesos-1.11.0.tar.gz.sha512
> >
> > The signature of the tarball can be found at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.11.0-rc1/mesos-1.11.0.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is in a staging repository here:
> > https://repository.apache.org/content/repositories/orgapachemesos-1260
> >
> > Please vote on releasing this package as Apache Mesos 1.11.0!
> >
> > The vote is open until 2020 Nov 20th 15:00 UTC at least, and passes if
> > a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 1.11.0
> > [ ] -1 Do not release this package because ...
> >
> > Thanks,
> > Andrei Sekretenko
>
>


Re: Assymetric route possible between agent and container?

2020-08-08 Thread Qian Zhang
I think you could try this flag `http_executor_domain_sockets` which
was introduced in Mesos 1.10.0.

--http_executor_domain_sockets: If true, the agent will provide a unix
domain socket that the executor can use to connect to the agent, instead of
relying on a TCP connection.



Regards,
Qian Zhang


On Sat, Aug 8, 2020 at 4:59 PM Marc Roos  wrote:

>
> "it is imperative that the Agent IP is reachable from the container IP
> and vice versa."
>
> Anyone know/tested if this can be an asymmetric route when you are
> having multiple networks?
>
> [1]
> http://mesos.apache.org/documentation/latest/cni/
>
>


Re: Subject: [VOTE] Release Apache Mesos 1.10.0 (rc1)

2020-05-27 Thread Qian Zhang
+1 (binding)

Regards,
Qian Zhang


On Thu, May 28, 2020 at 12:56 AM Benjamin Mahler  wrote:

> +1 (binding)
>
> On Mon, May 18, 2020 at 4:36 PM Andrei Sekretenko 
> wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.10.0.
>>
>> 1.10.0 includes the following major improvements:
>>
>> 
>> * support for resource bursting (setting task resource limits separately
>> from requests) on Linux
>> * ability for an executor to communicate with an agent via Unix domain
>> socket instead of TCP
>> * ability for operators to modify reservations via the RESERVE_RESOURCES
>> master API call
>> * performance improvements of V1 operator API read-only calls bringing
>> them on par with V0 HTTP endpoints
>> * ability for a scheduler to expect that effects of calls sent through
>> the same connection will not be reordered/interleaved by master
>>
>> NOTE: 1.10.0 includes a breaking change for custom authorizer modules.
>> Now, `ObjectApprover`s may be stored by Mesos indefinitely and must be
>> kept up-to-date by an authorizer throughout their lifetime.
>> This allowed for several bugfixes and performance improvements.
>>
>> The CHANGELOG for the release is available at:
>>
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.10.0-rc1
>>
>> 
>>
>> The candidate for Mesos 1.10.0 release is available at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz
>>
>> The tag to be voted on is 1.10.0-rc1:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.10.0-rc1
>>
>> The SHA512 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.sha512
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1259
>>
>> Please vote on releasing this package as Apache Mesos 1.10.0!
>>
>> The vote is open until Fri, May 21, 19:00 CEST  and passes if a majority
>> of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.10.0
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Andrei Sekretenko
>>
>


Re: Number of tasks per executor and resource limits

2020-02-20 Thread Qian Zhang
>
> if we wanted to support multiple tasks per executor, could we leverage the
> mesos containeriser to easily put each task in its own cgroup?


If what your custom executor does is similar with default executor (i.e.,
launch each task in a task group as a nested container by calling Mesos
agent API `LAUNCH_CONTAINER`), then the answer is yes since we are going to
support nested cgroups for nested containers (i.e., each nested container
can have its own cgroups) in Mesos containerizer soon.


Regards,
Qian Zhang


On Fri, Feb 21, 2020 at 4:46 AM Vinod Kone  wrote:

> Andrei, Qian: Can one of you answer the above question?
>
> On Thu, Feb 20, 2020 at 7:15 PM Charles-François Natali <
> cf.nat...@gmail.com> wrote:
>
>> Thanks for the quick reply!
>>
>> I think we're going to go for one executor per task for now, that's much
>> simpler.
>>
>> Otherwise I was wondering - if we wanted to support multiple tasks per
>> executor, could we leverage the mesos containeriser to easily put each task
>> in its own cgroup?
>>
>>
>>
>> Is it just a matter of setting 'execu
>> On Thu, 20 Feb 2020, 17:40 Vinod Kone,  wrote:
>>
>>> Hi Charles,
>>>
>>> We are actually working on a new feature that puts each of the tasks (of
>>> the default executor) in its own cgroup so that they can be individually
>>> limited. See https://issues.apache.org/jira/browse/MESOS-9916 . For
>>> custom executor, you would be on your own to implement the same. Also, your
>>> custom executor / scheduler should be able to limit one task per executor
>>> if you so desire.
>>>
>>> On Thu, Feb 20, 2020 at 6:34 PM Charles-François Natali <
>>> cf.nat...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there a way to force Mesos to use one executor per task?
>>>> The reason I would want to do that is for resources limits: for
>>>> example, when using cgroup to limit CPU and memory, IIUC the containeriser
>>>> sets limits corresponding to the sum of the resources allocated to the
>>>> tasks managed by the underlying executor.
>>>>
>>>> Which means that for example if a task is allocated 1GB and another
>>>> 2GB, if they are started by the same executor, the collarbone containeriser
>>>> will limit the total memory for the two tasks (plus the executor) to 3GB.
>>>> But, unless I'm missing something, nothing prevents a process (assuming one
>>>> process per task with e.g. the command executor or a custom executor) from
>>>> using more than its limit.
>>>>
>>>> Obviously it would be possible for the executor to enforce the
>>>> individual limits itself using cgroup - does the command/default executor
>>>> do that? - but in our case where we use a custom executor it would be quite
>>>> painful.
>>>>
>>>> Unless I'm missing something?
>>>>
>>>> Thanks in advance for suggestions,
>>>>
>>>> Charles
>>>>
>>>


Re: cni iptables best practice

2020-01-28 Thread Qian Zhang
I do not think we plan to do it in short term.

Regards,
Qian Zhang


On Tue, Jan 28, 2020 at 1:54 AM Marc Roos  wrote:

>
>  Hi Qian,
>
> Any idea on when this cni 0.3 is going to be implemented? I saw the
> issue priority is Major, can't remember if it was always like this. But
> looks promising.
>
> Regards,
> Marc
>
>
>
>
> -Original Message-
> Sent: 14 December 2019 09:46
> To: user
> Subject: RE: cni iptables best practice
>
>
> Yes, yes I know, disaster. I wondered how or even if people are using
> iptables with tasks. Even on internal environment it could be nice to
> use not?
>
>
>
> -Original Message-
> To: user
> Subject: Re: cni iptables best practice
>
> You are right, we do not support CNI chaining plugin yet, and I think
> there is a ticket to trace it:
> https://issues.apache.org/jira/browse/MESOS-7079.
>
>
> Regards,
> Qian Zhang
>
>
> On Sat, Dec 14, 2019 at 7:08 AM Marc Roos 
> wrote:
>
>
>
>
> Is anyone applying iptables rules in their cni networking, and
> how?
>
> I
> wrote a iptables chaining plugin but cannot use it because this
> cni
>
>
> 0.3.0 is still not supported in mesos 1.9. I wondered how this
> done
>
>
> currently
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: cni iptables best practice

2019-12-13 Thread Qian Zhang
You are right, we do not support CNI chaining plugin yet, and I think there
is a ticket to trace it: https://issues.apache.org/jira/browse/MESOS-7079.

Regards,
Qian Zhang


On Sat, Dec 14, 2019 at 7:08 AM Marc Roos  wrote:

>
>
> Is anyone applying iptables rules in their cni networking, and how? I
> wrote a iptables chaining plugin but cannot use it because this cni
> 0.3.0 is still not supported in mesos 1.9. I wondered how this done
> currently
>
>
>
>
>
>
>
>


Re: Mesos task example json

2019-10-11 Thread Qian Zhang
Hi Marc,

Here is an example json that I use for testing:

{
  "name": "test",
  "task_id": {"value" : "test"},
  "agent_id": {"value" : ""},
  "resources": [
{"name": "cpus", "type": "SCALAR", "scalar": {"value": 1}},
{"name": "mem", "type": "SCALAR", "scalar": {"value": 128}}
  ],
  "command": {
"value": "sleep 10"
  },
  "container": {
"type": "MESOS",
"mesos": {
  "image": {
"type": "DOCKER",
"docker": {
  "name": "busybox"
}
  }
}
  }
}


Regards,
Qian Zhang


On Sat, Oct 12, 2019 at 6:26 AM Marc Roos  wrote:

>
>
> Is there some example json available with all options for use with
> 'mesos-execute --task='
>
>


[RESULT][VOTE] Release Apache Mesos 1.9.0 (rc3)

2019-09-04 Thread Qian Zhang
Hi all,

The vote for Mesos 1.9.0 (rc3) has passed with the
following votes.

+1 (Binding)
--
Vinod Kone
Gilbert Song
Chun-Hung Hsiao

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.9.0

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.9.0

The mesos-1.9.0.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Qian and Gilbert


Re: W0831 containerizer.cpp:2375] Ignoring update for unknown container

2019-09-02 Thread Qian Zhang
It seems the container was already destroyed when Mesos agent tried to
update it.

Regards,
Qian Zhang


On Sun, Sep 1, 2019 at 12:05 AM Marc Roos  wrote:

>
> Why do get this message? How to resolve this?
>
> W0831 18:01:45.403295 2943686 containerizer.cpp:2375] Ignoring update
> for unknown container 48d9b77c-7348-4404-9845-211be74bad1d
>
> mesos-1.8.1-2.0.1.el7.x86_64
>
>
>
>
>


Re: Please some help regression testing a task

2019-09-02 Thread Qian Zhang
Can you check if the task is throttled? You can run the command
`/proc//cgroup` to get the cgroups of the task, and then check
the `cpu.stat` file under task's CPU cgroups, e.g.:

$ cat
> /sys/fs/cgroup/cpuacct/mesos/bd5bc588-7565-4c7e-a5f0-d33850b2ec0a/cpu.stat
> nr_periods 118
> nr_throttled 37
> throttled_time 633829202
>

If `nr_throttled` is greater than 0, then that means the task was throttled
which may affect its performance.


Regards,
Qian Zhang


On Sat, Aug 31, 2019 at 11:48 PM Marc Roos  wrote:

>
>
> mesos-1.8.1-2.0.1.el7.x86_64
> CentOS Linux release 7.6.1810 (Core)
>
>
>
> -Original Message-
> To: user
> Subject: Please some help regression testing a task
>
>
> I have a task that under performs. I am unable to discover what is
> causing it. Could this be something mesos specific?
> Performance difference is 1k q/s vs 20k q/s
>
>
> 1. If manually I run the task on the host the performance is ok
> > I think one could rule out network connectivity on/of the host and
> > host issues
>
>
> 2. If I manually run a task in the same netns as the under performing
> task, the performance is ok.
>   ip netns exec bind bash
>   chroot 04a81d99-9b99-410d-bf83-d6d70ef2c7bb/
>   (changed only the config port to 54)
>   named -u named
> > I think we can rule out netns issues
>
>
> 3. If I manually remove or change the cgroups of the mesos/marathon
> task, the performance is still bad
>
> echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks
>
> or
>
> echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks
> echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks
>
>
> [@]# cat /proc/2936696/cgroup
> 11:hugetlb:/
> 10:memory:/user.slice
> 9:devices:/user.slice
> 8:cpuacct,cpu:/user.slice
> 7:perf_event:/
> 6:cpuset:/
> 5:pids:/user.slice
> 4:freezer:/
> 3:blkio:/user.slice
> 2:net_prio,net_cls:/
> 1:name=systemd:/user.slice/user-0.slice/session-17385.scope
>
> [@]# cat /proc/2932859/cgroup
> 11:hugetlb:/
> 10:memory:/user.slice
> 9:devices:/user.slice
> 8:cpuacct,cpu:/user.slice
> 7:perf_event:/
> 6:cpuset:/
> 5:pids:/user.slice
> 4:freezer:/
> 3:blkio:/user.slice
> 2:net_prio,net_cls:/
> 1:name=systemd:/mesos/812c481b-c0a4-444a-aafa-de98da9698e2
>
>
>
>


[VOTE] Release Apache Mesos 1.9.0 (rc3)

2019-09-01 Thread Qian Zhang
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.9.0.


1.9.0 includes the following:

* Agent draining
* Support configurable /dev/shm and IPC namespace.
* Containerizer debug endpoint.
* Add `no-new-privileges` isolator.
* Client side SSL certificate verification in Libprocess.

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.9.0-rc3


The candidate for Mesos 1.9.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc3/mesos-1.9.0.tar.gz

The tag to be voted on is 1.9.0-rc3:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.9.0-rc3

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc3/mesos-1.9.0.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc3/mesos-1.9.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1257

Please vote on releasing this package as Apache Mesos 1.9.0!

The vote is open until Thursday, Sep 5 and passes if a majority of at least
3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.9.0
[ ] -1 Do not release this package because ...


Thanks,
Qian and Gilbert


Re: Large container image failing to start 'first' time

2019-08-30 Thread Qian Zhang
These are the logs when the container was being destroyed, we need the logs
when the container was launched to figure out why the container was stuck
at provisioning state.


Regards,
Qian Zhang


On Sat, Aug 31, 2019 at 6:24 AM Marc Roos  wrote:

>
> I only have these two messages
>
>
> mesos-slave.ERROR:E0828 12:51:46.146246 2663200 slave.cpp:6486]
> Container '680d3849-2b2a-4549-8842-8ef358599478' for executor
> 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework
> d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is
> being destroyed during provisioning
> mesos-slave.INFO:E0828 12:51:46.146246 2663200 slave.cpp:6486] Container
> '680d3849-2b2a-4549-8842-8ef358599478' for executor
> 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework
> d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is
> being destroyed during provisioning
> mesos-slave.INFO:W0828 12:51:46.650323 2663184 containerizer.cpp:2375]
> Ignoring update for unknown container
> 680d3849-2b2a-4549-8842-8ef358599478
> mesos-slave.WARNING:E0828 12:51:46.146246 2663200 slave.cpp:6486]
> Container '680d3849-2b2a-4549-8842-8ef358599478' for executor
> 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework
> d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is
> being destroyed during provisioning
> mesos-slave.WARNING:W0828 12:51:46.650323 2663184
> containerizer.cpp:2375] Ignoring update for unknown container
> 680d3849-2b2a-4549-8842-8ef358599478
>
>
>
>
> -Original Message-
> From: Qian Zhang [mailto:zhq527...@gmail.com]
> Sent: woensdag 28 augustus 2019 15:07
> To: Marc Roos
> Cc: user
> Subject: Re: Large container image failing to start 'first' time
>
> Can you please send the full logs about this container (just grep
> 680d3849-2b2a-4549-8842-8ef358599478 in agent log)? And is there
> anything left in the staging directory (`--docker_store_dir/staging/`)
> when this issue happens?
>
>
> Regards,
> Qian Zhang
>
>
> On Wed, Aug 28, 2019 at 7:07 PM Marc Roos 
> wrote:
>
>
>  I had this again.
>
> E0828 12:51:46.146246 2663200 slave.cpp:6486] Container
> '680d3849-2b2a-4549-8842-8ef358599478' for executor
> 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of
> framework
> d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start:
> Container is
> being destroyed during provisioning
>
>
>
> -Original Message-
> From: Qian Zhang [mailto:zhq527...@gmail.com]
> Sent: dinsdag 20 augustus 2019 1:12
> To: user
> Subject: Re: Large container image failing to start 'first' time
>
>     >
>
>  Large container image failing to start 'first' time Did you see
> any
> errors/warnings in agent logs when the container failed to start?
>
>
> Regards,
> Qian Zhang
>
>
> On Mon, Aug 19, 2019 at 10:46 PM Marc Roos
> 
> wrote:
>
>
>
> I have a container image of around 800MB. I am not sure if
> that is
> a
> lot. But I have noticed it is probably to big for a
> default
> setup
> to get
> it to launch. I think the only reason it launches
> eventually is
> because
> data is cached and no timeout expires. The container will
> launch
> eventually when you constrain it to a host.
>
> How can I trace where this timeout occurs? Are there
> options to
> specify
> timeouts?
>
>
>
>
>
>
>
>
>
>
>
>
>
>


[RESULT] [VOTE] Release Apache Mesos 1.9.0 (rc2)

2019-08-30 Thread Qian Zhang
The vote for this rc is cancelled since we need to fix the blocking issue
MESOS-9956 <https://issues.apache.org/jira/browse/MESOS-9956>.

Regards,
Qian Zhang


On Fri, Aug 30, 2019 at 6:36 AM Chun-Hung Hsiao  wrote:

> -1 for https://issues.apache.org/jira/browse/MESOS-9956. I'm working on a
> fix for it.
>
> On Wed, Aug 28, 2019 at 4:13 AM Qian Zhang  wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.9.0.
>>
>>
>> 1.9.0 includes the following:
>>
>> 
>> * Agent draining
>> * Support configurable /dev/shm and IPC namespace.
>> * Containerizer debug endpoint.
>> * Add `no-new-privileges` isolator.
>> * Client side SSL certificate verification in Libprocess.
>>
>> The CHANGELOG for the release is available at:
>>
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.9.0-rc2
>>
>> 
>>
>> The candidate for Mesos 1.9.0 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz
>>
>> The tag to be voted on is 1.9.0-rc2:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.9.0-rc2
>>
>> The SHA512 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz.sha512
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1256
>>
>> Please vote on releasing this package as Apache Mesos 1.9.0!
>>
>> The vote is open until  and passes if a majority of at least 3 +1 PMC
>> votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.9.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> Thanks,
>> Qian and Gilbert
>>
>


Re: Large container image failing to start 'first' time

2019-08-28 Thread Qian Zhang
Can you please send the full logs about this container (just grep
680d3849-2b2a-4549-8842-8ef358599478 in agent log)? And is there anything
left in the staging directory (`--docker_store_dir/staging/`) when this
issue happens?


Regards,
Qian Zhang


On Wed, Aug 28, 2019 at 7:07 PM Marc Roos  wrote:

>  I had this again.
>
> E0828 12:51:46.146246 2663200 slave.cpp:6486] Container
> '680d3849-2b2a-4549-8842-8ef358599478' for executor
> 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework
> d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is
> being destroyed during provisioning
>
>
>
> -----Original Message-
> From: Qian Zhang [mailto:zhq527...@gmail.com]
> Sent: dinsdag 20 augustus 2019 1:12
> To: user
> Subject: Re: Large container image failing to start 'first' time
>
> >
>
>  Large container image failing to start 'first' time Did you see any
> errors/warnings in agent logs when the container failed to start?
>
>
> Regards,
> Qian Zhang
>
>
> On Mon, Aug 19, 2019 at 10:46 PM Marc Roos 
> wrote:
>
>
>
> I have a container image of around 800MB. I am not sure if that is
> a
> lot. But I have noticed it is probably to big for a default setup
> to get
> it to launch. I think the only reason it launches eventually is
> because
> data is cached and no timeout expires. The container will launch
> eventually when you constrain it to a host.
>
> How can I trace where this timeout occurs? Are there options to
> specify
> timeouts?
>
>
>
>
>
>
>
>
>
>
>


[VOTE] Release Apache Mesos 1.9.0 (rc2)

2019-08-28 Thread Qian Zhang
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.9.0.


1.9.0 includes the following:

* Agent draining
* Support configurable /dev/shm and IPC namespace.
* Containerizer debug endpoint.
* Add `no-new-privileges` isolator.
* Client side SSL certificate verification in Libprocess.

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.9.0-rc2


The candidate for Mesos 1.9.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz

The tag to be voted on is 1.9.0-rc2:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.9.0-rc2

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc2/mesos-1.9.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1256

Please vote on releasing this package as Apache Mesos 1.9.0!

The vote is open until  and passes if a majority of at least 3 +1 PMC votes
are cast.

[ ] +1 Release this package as Apache Mesos 1.9.0
[ ] -1 Do not release this package because ...


Thanks,
Qian and Gilbert


Re: [VOTE] Release Apache Mesos 1.9.0 (rc1)

2019-08-27 Thread Qian Zhang
The commits 25070f2
<https://github.com/apache/mesos/commit/25070f232a9bb97d1b78f8a7e5b774bbd50654f9>,
95201cb
<https://github.com/apache/mesos/commit/95201cbe4dc87eae2fde5754d16f5effbb6c1974>
and
7303313
<https://github.com/apache/mesos/commit/73033130de7872c6f240b9b05dced039d7666138>
have been reverted on 1.9.x branch, so I will cut rc2 later today.


Regards,
Qian Zhang


On Wed, Aug 28, 2019 at 4:40 AM Vinod Kone  wrote:

> I see. That's reduces the risk considerably than what I originally thought
> but I guess still risky to introduce it so late?
>
> On Tue, Aug 27, 2019 at 1:28 PM Benjamin Mahler 
> wrote:
>
>> > We upgraded the version of the bundled boost very late in the release
>> cycle
>>
>> Did we? We still bundle boost 1.65.0, just like we did during 1.8.x. We
>> just adjusted our special stripped bundle to include additional headers.
>>
>> On Tue, Aug 27, 2019 at 1:39 PM Vinod Kone  wrote:
>>
>>> -1
>>>
>>> We upgraded the version of the bundled boost very late in the release
>>> cycle
>>> which doesn't give downstream customers (who also depend on boost) enough
>>> time to vet any compatibility/perf/other issues. I propose we revert the
>>> boost upgrade (and the corresponding code changes depending on the
>>> upgrade)
>>> in 1.9.x branch but keep it in the master branch.
>>>
>>> On Tue, Aug 27, 2019 at 4:18 AM Qian Zhang  wrote:
>>>
>>> > Hi all,
>>> >
>>> > Please vote on releasing the following candidate as Apache Mesos 1.9.0.
>>> >
>>> >
>>> > 1.9.0 includes the following:
>>> >
>>> >
>>> 
>>> > * Agent draining
>>> > * Support configurable /dev/shm and IPC namespace.
>>> > * Containerizer debug endpoint.
>>> > * Add `no-new-privileges` isolator.
>>> > * Client side SSL certificate verification in Libprocess.
>>> >
>>> > The CHANGELOG for the release is available at:
>>> >
>>> >
>>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.9.0-rc1
>>> >
>>> >
>>> 
>>> >
>>> > The candidate for Mesos 1.9.0 release is available at:
>>> >
>>> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz
>>> >
>>> > The tag to be voted on is 1.9.0-rc1:
>>> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.9.0-rc1
>>> >
>>> > The SHA512 checksum of the tarball can be found at:
>>> >
>>> >
>>> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz.sha512
>>> >
>>> > The signature of the tarball can be found at:
>>> >
>>> >
>>> https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz.asc
>>> >
>>> > The PGP key used to sign the release is here:
>>> > https://dist.apache.org/repos/dist/release/mesos/KEYS
>>> >
>>> > The JAR is in a staging repository here:
>>> > https://repository.apache.org/content/repositories/orgapachemesos-1255
>>> >
>>> > Please vote on releasing this package as Apache Mesos 1.9.0!
>>> >
>>> > The vote is open until Friday, April 30 and passes if a majority of at
>>> > least 3 +1 PMC votes are cast.
>>> >
>>> > [ ] +1 Release this package as Apache Mesos 1.9.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> >
>>> > Thanks,
>>> > Qian and Gilbert
>>> >
>>>
>>


[VOTE] Release Apache Mesos 1.9.0 (rc1)

2019-08-27 Thread Qian Zhang
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.9.0.


1.9.0 includes the following:

* Agent draining
* Support configurable /dev/shm and IPC namespace.
* Containerizer debug endpoint.
* Add `no-new-privileges` isolator.
* Client side SSL certificate verification in Libprocess.

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.9.0-rc1


The candidate for Mesos 1.9.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz

The tag to be voted on is 1.9.0-rc1:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.9.0-rc1

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.9.0-rc1/mesos-1.9.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1255

Please vote on releasing this package as Apache Mesos 1.9.0!

The vote is open until Friday, April 30 and passes if a majority of at
least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.9.0
[ ] -1 Do not release this package because ...


Thanks,
Qian and Gilbert


Re: Large container image failing to start 'first' time

2019-08-19 Thread Qian Zhang
> Large container image failing to start 'first' time
Did you see any errors/warnings in agent logs when the container failed to
start?


Regards,
Qian Zhang


On Mon, Aug 19, 2019 at 10:46 PM Marc Roos  wrote:

>
> I have a container image of around 800MB. I am not sure if that is a
> lot. But I have noticed it is probably to big for a default setup to get
> it to launch. I think the only reason it launches eventually is because
> data is cached and no timeout expires. The container will launch
> eventually when you constrain it to a host.
>
> How can I trace where this timeout occurs? Are there options to specify
> timeouts?
>
>
>
>
>
>
>
>


Re: Mesos 1.9.0 release

2019-08-14 Thread Qian Zhang
Sorry, I forgot to make it sharable, can you please try again?
And we are aiming to cut 1.9.x branch and rc1 next Wednesday.


Regards,
Qian Zhang


On Wed, Aug 14, 2019 at 11:03 AM Benjamin Mahler  wrote:

> Thanks for taking this on Qian!
>
> I seem to be unable to view the dashboard.
> Also, when are we aiming to make the cut?
>
> On Tue, Aug 13, 2019 at 10:58 PM Qian Zhang  wrote:
>
>> Folks,
>>
>> It is time for Mesos 1.9.0 release and I am the release manager. Here is
>> the dashboard:
>> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12334125
>>
>> Please start to wrap up patches if you are contributing or shepherding
>> any issues. If you expect any particular JIRA for this new release, please
>> set "Target Version" as "1.9.0" and mark it as "Blocker" priority.
>>
>>
>> Regards,
>> Qian Zhang
>>
>


Mesos 1.9.0 release

2019-08-13 Thread Qian Zhang
Folks,

It is time for Mesos 1.9.0 release and I am the release manager. Here is
the dashboard:
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12334125

Please start to wrap up patches if you are contributing or shepherding any
issues. If you expect any particular JIRA for this new release, please set
"Target Version" as "1.9.0" and mark it as "Blocker" priority.


Regards,
Qian Zhang


Re: Is it not time the cni configuration dir loads only specific extensions?

2019-07-28 Thread Qian Zhang
Can you please check the file
`/etc/mesos-cni/91-podman-bridge-not.conflist.bak`? It seems that file does
not have the required field `type`.

Regards,
Qian Zhang


On Sun, Jul 28, 2019 at 1:04 AM Marc Roos  wrote:

>
>
> mesos-slave.m03.invalid-user.log.WARNING.20190726-124341.9083:E0726
> 12:43:41.716598  9083 cni.cpp:330] Failed to parse CNI network
> configuration file '/etc/mesos-cni/91-podman-bridge-not.conflist.bak':
> Protobuf parse failed: Missing required fields: type
>
>
>
>


Make IPC namespace and `/dev/shm` configurable in Mesos containerizer

2019-05-17 Thread Qian Zhang
Hi Folks,

I am working on MESOS-9788
<https://issues.apache.org/jira/browse/MESOS-9788> to make IPC namespace
and `/dev/shm` configurable in Mesos containerizer. Here is the design doc
<https://docs.google.com/document/d/10t1jf97vrejUWEVSvxGtqw4vhzfPef41JMzb5jw7l1s/edit?usp=sharing>.
Please feel free to let me know if you have any comments/feedbacks, you can
reply this mail or comment on the design doc directly.

Thanks!

Regards,
Qian Zhang


Docker Image Manifest Version 2 Schema 2 Suport

2019-03-22 Thread Qian Zhang
Hi Folks,

We are working on MESOS-6934
<https://jira.apache.org/jira/browse/MESOS-6934> to make UCR supports
Docker image manifest version 2 schema 2, currently what UCR supports is
version 2 schema 1 which may be deprecated by some major registries in
future. Here is the design doc
<https://docs.google.com/document/d/1AU5IXMbR0AGlustNs-_62L4M3UvOuYKgjun8SlnCsS0/edit?usp=sharing>.
Please feel free to let us know if you have any comments/feedbacks, you can
reply this mail or comment on the design doc directly. Thanks!


Regards,
Qian Zhang


Re: Propose to run nested container as the same user of its parent container by default

2018-11-08 Thread Qian Zhang
Agree with Gilbert, we should actually run nested container (rather than
just debug container) as the same user of its parent container by default.

Please let me know if you have any concerns, thanks!


Regards,
Qian Zhang


On Sat, Nov 3, 2018 at 2:56 AM Gilbert Song  wrote:

> @James Peach  agree, for debug containers, the default
> user should inherit from parent, while CLI toolings (e.g., task exec)
> should provide an option `--root` (by setting the commandinfo user as root).
>
> @Qian Zhang  @Benjamin Mahler  ,
> if we step back, it seems to me we should extend the user inheritance for
> all nested container (instead of just for debug container). Now the default
> user for nested container is from the executor (see this patch
> <https://github.com/apache/mesos/commit/558613cc72248b633bb5e26ef93708abca8ccbf0#diff-8fd185b932590eb8fa1c53964f7c5a82R1956>),
> which does not make sense for 3rd level nested containers or further.
>
> I would suggest that any type of nested container (debug container, check
> container, nested container etc.), its user should just inherit from its
> parent's user. This would not change the behavior of default executor,
> potentially change behaviors for custom executor which support 3 level or
> up nested.
>
> - Gilbert
>
> On Thu, Oct 25, 2018 at 9:51 AM Vinod Kone  wrote:
>
>> Sounds good to me.
>>
>> If I understand correctly, you want to treat this is a bug and backport
>> it to previous release branches? So, you are also asking whether
>> backporting this bug will be considered a breaking change for any existing
>> users?
>>
>> On Thu, Oct 25, 2018 at 11:46 AM James Peach  wrote:
>>
>>>
>>>
>>> On Oct 23, 2018, at 7:47 PM, Qian Zhang  wrote:
>>>
>>> Hi all,
>>>
>>> Currently when launching a debug container (e.g., via `dcos task exec`
>>> or command health check) to debug a task, by default Mesos agent will use
>>> the executor's user as the debug container's user. There are actually 2
>>> cases:
>>> 1. Command task: Since the command executor's user is same with command
>>> task's user, so the debug container will be launched as the same user of
>>> the command task.
>>> 2. The task in a task group: The default executor's user is same with
>>> the framework user, so in this case the debug container will be launched as
>>> the same user of the framework rather than the task.
>>>
>>> Basically I think the behavior of case 1 is correct. For case 2, we may
>>> run into a situation that the task is run as a user (e.g., root), but the
>>> debug container used to debug that task is run as another user (e.g., a
>>> normal user, suppose framework is run as a normal user), this may not be
>>> what user expects.
>>>
>>> So I created MESOS-9332
>>> <https://issues.apache.org/jira/browse/MESOS-9332> and propose to run
>>> debug container as the same user of its parent container (i.e., the task to
>>> be debugged) by default. Please let me know if you have any comments,
>>> thanks!
>>>
>>>
>>> This sounds like a sensible default to me. I can imagine for debug use
>>> cases you might want to run the debug container as root or give it elevated
>>> capabilities, but that should not be the default.
>>>
>>> J
>>>
>>


Re: Welcome Meng Zhu as PMC member and committer!

2018-11-01 Thread Qian Zhang
Congrats Meng!

Regards,
Qian Zhang


On Wed, Oct 31, 2018 at 4:50 PM Vinod Kone  wrote:

> Congrats Meng!
>
> Thanks,
> Vinod
>
> On Oct 31, 2018, at 4:26 PM, Gilbert Song  wrote:
>
> Well deserved, Meng!
>
> On Wed, Oct 31, 2018 at 2:36 PM Benjamin Mahler 
> wrote:
>
>> Please join me in welcoming Meng Zhu as a PMC member and committer!
>>
>> Meng has been active in the project for almost a year and has been very
>> productive and collaborative. He is now one of the few people of
>> understands the allocator code well, as well as the roadmap for this area
>> of the project. He has also found and fixed bugs, and helped users in slack.
>>
>> Thanks for all your work so far Meng, I'm looking forward to more of your
>> contributions in the project.
>>
>> Ben
>>
>


Propose to update the minimal supported Docker version in Mesos to Docker 1.8.0

2018-09-28 Thread Qian Zhang
Hi All,

When I worked on MESOS-9231
<https://issues.apache.org/jira/browse/MESOS-9231>, I found the way that we
run the `docker inspect` command seems not correct, we'd better specify
`--type=container` for it, otherwise, `docker inspect` may return an object
which is not a container (e.g., a volume).

So I posted a patch <https://reviews.apache.org/r/68872/> to add
`--type=container` to the `docker inspect` command. However I found Docker
started to support `--type=container` from 1.8.0, but our minimal supported
Docker version is 1.0.0 (see here
<https://github.com/apache/mesos/blob/1.7.0/src/docker/docker.cpp#L152> for
details). I think it does not make sense for us to support Docker 1.0.0
which is too old, so I propose to update the minimal supported Docker
version in Mesos to 1.8.0.

Please let me know for any comments, thanks!


Regards,
Qian Zhang


Re: Volume ownership and permission

2018-08-17 Thread Qian Zhang
Thanks James for your comments! Please see my replies inline.lin

> I assume that this scheme will only be supported on Linux, due to the
dependencies on the Linux ACLs and supplementary group behaviour?

Yes.

> Rewriting ACLs on volumes at each container launch sounds hugely
expensive. It's IOP-bound process and there are an effectively unbounded
number of files in the volume. Would this serialize container cleanup?

My original thinking was, we only set ACLs on the volume root dir for each
container, do you think we need to do it for each file & sub-dir under the
volume root dir? If so, then instead of setting ACL another option is: when
the first time a volume is used by a container, volume ACL manager generate
a unique GID for the volume rather than the container, and do the two steps
below for the volume root dir and each file and sub-dir under it.

   1. change the owner group to the allocated GID
   2. set the `setgid` bit (this is just for dir but not file)

And when the second container tries to use the same volume, volume ACL
manager (we may give it another name) will just return the previous
allocated GID to the volume isolator and no need to do anything for the
volume. And volume ACL manager needs to maintain a reference count for each
volume for which containers are using the volume, and deallocate the GID
when no container is using the volume (i.e., reference count == 0). How do
you think?

And what did you mean for "serialize container cleanup"?

> It seems like ACL evaluation will mean that this scheme will only mostly
work. For example, if the container process UID matches a user ACE, then
access could be denied independently of the volume policy.

Did you mean the case that the supplementary group of the container process
is allowed to write to the volume (e.g., rwx) but the container process UID
is not allowed to write to the volume (e.g., r-x) and then the result is
the container can not write to the volume which is not what we expect?

> Will the VolumeAclManager apply a default ACL on the root of the volume?
Does this imply that when it updates the ACEs for the container GID, it
also needs to update the default ACLs on all directories?

Currently I do not think we need to set the default ACL on the volume root
dir.



Regards,
Qian Zhang

On Fri, Aug 17, 2018 at 12:38 AM, James Peach  wrote:

>
>
> > On Aug 15, 2018, at 6:22 PM, Qian Zhang  wrote:
> >
> > Hi Folks,
> >
> > We found some issues for the solutions of this project and propose a
> better
> > one, see here
> > <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-
> v4KWwjmnCR0l8V4Tq2U/edit#heading=h.tjuy5xk67tuu>
> > for details. Please let me know if you have any comments, thanks!
>
> Some general comments.
>
> I assume that this scheme will only be supported on Linux, due to the
> dependencies on the Linux ACLs and supplementary group behaviour?
>
> Rewriting ACLs on volumes at each container launch sounds hugely
> expensive. It's IOP-bound process and there are an effectively unbounded
> number of files in the volume. Would this serialize container cleanup?
>
> It seems like ACL evaluation will mean that this scheme will only mostly
> work. For example, if the container process UID matches a user ACE, then
> access could be denied independently of the volume policy.
>
> Will the VolumeAclManager apply a default ACL on the root of the volume?
> Does this imply that when it updates the ACEs for the container GID, it
> also needs to update the default ACLs on all directories?
>
> >
> >
> > Regards,
> > Qian Zhang
> >
> > On Sat, Apr 28, 2018 at 7:57 AM, Qian Zhang  wrote:
> >
> >>> The framework launched tasks in a group with different users? Sounds
> >> like they dug their own hole :)
> >>
> >> So you mean we should actually put a best practice or limitation in doc:
> >> when launching a task group with multiple tasks to share a SANDBOX
> volume
> >> of PARENT type, all the tasks should be run with the same user, and that
> >> user must be same with the user to launch the executor? Otherwise the
> task
> >> will not be able to write to the volume.
> >>
> >>> I'd argue that the "rw" on the sandbox path is analogous to the "rw"
> >> mount option. That is, it is mounted writeable, but says nothing about
> >> which credentials can write to it.
> >>
> >> Can you please elaborate a bit on this? What would you suggest for the
> >> "rw` volume mode?
> >>
> >>
> >> Regards,
> >> Qian Zhang
> >>
> >> On Fri, Apr 27, 2018 at 12:07 PM, James Peach  wrote:
> >>
> >>>
> >>>
&

Re: Volume ownership and permission

2018-08-15 Thread Qian Zhang
Hi Folks,

We found some issues for the solutions of this project and propose a better
one, see here
<https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.tjuy5xk67tuu>
for details. Please let me know if you have any comments, thanks!


Regards,
Qian Zhang

On Sat, Apr 28, 2018 at 7:57 AM, Qian Zhang  wrote:

> > The framework launched tasks in a group with different users? Sounds
> like they dug their own hole :)
>
> So you mean we should actually put a best practice or limitation in doc:
> when launching a task group with multiple tasks to share a SANDBOX volume
> of PARENT type, all the tasks should be run with the same user, and that
> user must be same with the user to launch the executor? Otherwise the task
> will not be able to write to the volume.
>
> > I'd argue that the "rw" on the sandbox path is analogous to the "rw"
> mount option. That is, it is mounted writeable, but says nothing about
> which credentials can write to it.
>
> Can you please elaborate a bit on this? What would you suggest for the
> "rw` volume mode?
>
>
> Regards,
> Qian Zhang
>
> On Fri, Apr 27, 2018 at 12:07 PM, James Peach  wrote:
>
>>
>>
>> > On Apr 26, 2018, at 7:25 PM, Qian Zhang  wrote:
>> >
>> > Hi James,
>> >
>> > Thanks for your comment!
>> >
>> > I think you are talking about the SANDBOX_PATH volume ownership issue
>> > mentioned in the design doc
>> > <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE
>> -v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p>,
>> > IIUC, you prefer to leaving it to framework, i.e., framework itself
>> ought
>> > to be able to handle such issue. But I am curious how framework can
>> handle
>> > it in such situation. If the framework launches a task group with
>> different
>> > users and with a SANDBOX_PATH volume of PARENT type, the tasks in the
>> group
>> > will definitely fail to write to the volume due to the ownership issue
>> > though the volume's mode is set to "rw". So in this case, how should
>> > framework handle it?
>>
>> The framework launched tasks in a group with different users? Sounds like
>> they dug their own hole :)
>>
>> I'd argue that the "rw" on the sandbox path is analogous to the "rw"
>> mount option. That is, it is mounted writeable, but says nothing about
>> which credentials can write to it.
>>
>> > And if we want to document it, what is our recommended
>> > solution in the doc?
>> >
>> >
>> >
>> > Regards,
>> > Qian Zhang
>> >
>> > On Fri, Apr 27, 2018 at 1:16 AM, James Peach  wrote:
>> >
>> >> I commented on the doc, but at least some of the issues raised there I
>> >> would not regard as issues. Rather, they are about setting expectations
>> >> correctly and ensuring that we are documenting (and maybe enforcing)
>> >> sensible behavior.
>> >>
>> >> I'm not that keen on Mesos automatically "fixing" filesystem
>> permissions
>> >> and we should proceed down that path with caution, especially in the
>> ACLs
>> >> case.
>> >>
>> >>> On Apr 10, 2018, at 3:15 AM, Qian Zhang  wrote:
>> >>>
>> >>> Hi Folks,
>> >>>
>> >>> I am working on MESOS-8767 to improve Mesos volume support regarding
>> >> volume ownership and permission, here is the design doc. Please feel
>> free
>> >> to let me know if you have any comments/feedbacks, you can reply this
>> mail
>> >> or comment on the design doc directly. Thanks!
>> >>>
>> >>>
>> >>> Regards,
>> >>> Qian Zhang
>> >>
>> >>
>>
>>
>


Behavior change for the cgroups mounts inside container

2018-07-08 Thread Qian Zhang
Hi Folks,

Recently we did a behavior change for the cgroups mounts inside the
containers launched by UCR in the ticket MESOS-8327
<https://issues.apache.org/jira/browse/MESOS-8327>: For the container with
its own rootfs, before the change, it will see all cgroups as the agent
host, and after the change, we will do container specific cgroups mounts in
the container's mount namespace so it will only see its own cgroups which
is also the default behavior of Docker, see more details in the design doc
<https://docs.google.com/document/d/1aaPmoKRupTSplFoQoT8op1aktm0vOmnuF5JXN0g_aTY/edit?usp=sharing>
.

Please feel free to let us know for any comments, thanks!


Regards,
Qian Zhang


Re: mesos-execute cmd

2018-06-22 Thread Qian Zhang
> In relation to your last comment, it does make sense to launch one task
on multiple nodes: think about fault tolerance, where one needs replicas in
case of failures in the datacenter.

The use case that you mentioned makes sense, but that is not launching one
Mesos task on multiple nodes, instead it is launching an application which
has multiple instances and each instance (as a Mesos task) is launched on a
different node (You can try Marathon which is a popular Mesos framework, it
supports this use case). In Mesos, you can never launch a single task on
multiple nodes.


Regards,
Qian Zhang

On Fri, Jun 22, 2018 at 2:30 PM, Abel Souza  wrote:

> Thank you Qian,
>
> Yes, I believe a framework is needed for handling my case. I was just
> wondering if it could be possible to simply execute one specific command in
> multiple agent nodes using current mesos-execute command.
>
> In relation to your last comment, it does make sense to launch one task on
> multiple nodes: think about fault tolerance, where one needs replicas in
> case of failures in the datacenter.
>
> Best,
>
> /Abel Souza
>
>
> On Jun 21, 2018, at 2:44 AM, Qian Zhang  wrote:
>
> > It is possible to use multiple offers from a single agent node to
> launch a task, but I do not think you can use multiple offers from
> different agent nodes to launch a task.
>
> Abel, this is Mesos general behavior. It does not make sense to launch a
> single task on multiple agent nodes, right? But you can write a custom
> framework to split a big resource request to multiple tasks, and launch
> each task on an agent node.
>
>
> Regards,
> Qian Zhang
>
> On Wed, Jun 20, 2018 at 10:15 PM, Tomek Janiszewski 
> wrote:
>
>> IMO it requires a custom framework that will handle spiting task into
>> multiple nodes. I need more details to help.
>>
>> śr., 20 cze 2018 o 15:38 użytkownik Abel Souza  napisał:
>>
>>> Anyone that could suggest me anything? Is it a problem that could be
>>> fixed by writing a custom framework?
>>>
>>> /Abel
>>>
>>> On 06/13/2018 06:05 AM, Abel Souza wrote:
>>>
>>> Did you mean through ‘mesos-execute’ command or is it a Mesos general
>>> behavior?
>>>
>>> Best,
>>>
>>> /Abel Souza
>>>
>>> On Jun 13, 2018, at 02:04, Qian Zhang  wrote:
>>>
>>> It is possible to use multiple offers from a single agent node to launch
>>> a task, but I do not think you can use multiple offers from different agent
>>> nodes to launch a task.
>>>
>>>
>>> Regards,
>>> Qian Zhang
>>>
>>> On Tue, Jun 12, 2018 at 9:12 PM, Abel Souza  wrote:
>>>
>>>> Hello,
>>>>
>>>> I believe this question relates to the framework used by the
>>>> mesos-execute command (available by default in Mesos installation):
>>>>
>>>> When I request a number of cores greater than what is available in one
>>>> single node, the mesos-execute automatically turn down all offers made
>>>> by Mesos and hangs forever. E.g.: Each agent node in my cluster has 8
>>>> cores, and when I request 9 cores through mesos-execute
>>>> --resources='cpus:9', the command waits forever. But If I execute 
>>>> mesos-execute
>>>> --resources='cpus:8', tasks start execution right away.
>>>>
>>>> So I would like to know if there is a way to enable the mesos-execute
>>>> to handle situations where multiple nodes are needed to satisfy a resource
>>>> request. If so, what would be needed?
>>>>
>>>> Thank you,
>>>>
>>>> /Abel Souza
>>>>
>>>
>>>
>>>
>
>


Re: mesos-execute cmd

2018-06-20 Thread Qian Zhang
> It is possible to use multiple offers from a single agent node to launch
a task, but I do not think you can use multiple offers from different agent
nodes to launch a task.

Abel, this is Mesos general behavior. It does not make sense to launch a
single task on multiple agent nodes, right? But you can write a custom
framework to split a big resource request to multiple tasks, and launch
each task on an agent node.


Regards,
Qian Zhang

On Wed, Jun 20, 2018 at 10:15 PM, Tomek Janiszewski 
wrote:

> IMO it requires a custom framework that will handle spiting task into
> multiple nodes. I need more details to help.
>
> śr., 20 cze 2018 o 15:38 użytkownik Abel Souza  napisał:
>
>> Anyone that could suggest me anything? Is it a problem that could be
>> fixed by writing a custom framework?
>>
>> /Abel
>>
>> On 06/13/2018 06:05 AM, Abel Souza wrote:
>>
>> Did you mean through ‘mesos-execute’ command or is it a Mesos general
>> behavior?
>>
>> Best,
>>
>> /Abel Souza
>>
>> On Jun 13, 2018, at 02:04, Qian Zhang  wrote:
>>
>> It is possible to use multiple offers from a single agent node to launch
>> a task, but I do not think you can use multiple offers from different agent
>> nodes to launch a task.
>>
>>
>> Regards,
>> Qian Zhang
>>
>> On Tue, Jun 12, 2018 at 9:12 PM, Abel Souza  wrote:
>>
>>> Hello,
>>>
>>> I believe this question relates to the framework used by the
>>> mesos-execute command (available by default in Mesos installation):
>>>
>>> When I request a number of cores greater than what is available in one
>>> single node, the mesos-execute automatically turn down all offers made
>>> by Mesos and hangs forever. E.g.: Each agent node in my cluster has 8
>>> cores, and when I request 9 cores through mesos-execute
>>> --resources='cpus:9', the command waits forever. But If I execute 
>>> mesos-execute
>>> --resources='cpus:8', tasks start execution right away.
>>>
>>> So I would like to know if there is a way to enable the mesos-execute
>>> to handle situations where multiple nodes are needed to satisfy a resource
>>> request. If so, what would be needed?
>>>
>>> Thank you,
>>>
>>> /Abel Souza
>>>
>>
>>
>>


Re: mesos-execute cmd

2018-06-12 Thread Qian Zhang
It is possible to use multiple offers from a single agent node to launch a
task, but I do not think you can use multiple offers from different agent
nodes to launch a task.


Regards,
Qian Zhang

On Tue, Jun 12, 2018 at 9:12 PM, Abel Souza  wrote:

> Hello,
>
> I believe this question relates to the framework used by the mesos-execute
> command (available by default in Mesos installation):
>
> When I request a number of cores greater than what is available in one
> single node, the mesos-execute automatically turn down all offers made by
> Mesos and hangs forever. E.g.: Each agent node in my cluster has 8 cores,
> and when I request 9 cores through mesos-execute --resources='cpus:9',
> the command waits forever. But If I execute mesos-execute
> --resources='cpus:8', tasks start execution right away.
>
> So I would like to know if there is a way to enable the mesos-execute to
> handle situations where multiple nodes are needed to satisfy a resource
> request. If so, what would be needed?
>
> Thank you,
>
> /Abel Souza
>


Re: Volume ownership and permission

2018-04-27 Thread Qian Zhang
> The framework launched tasks in a group with different users? Sounds like
they dug their own hole :)

So you mean we should actually put a best practice or limitation in doc:
when launching a task group with multiple tasks to share a SANDBOX volume
of PARENT type, all the tasks should be run with the same user, and that
user must be same with the user to launch the executor? Otherwise the task
will not be able to write to the volume.

> I'd argue that the "rw" on the sandbox path is analogous to the "rw"
mount option. That is, it is mounted writeable, but says nothing about
which credentials can write to it.

Can you please elaborate a bit on this? What would you suggest for the "rw`
volume mode?


Regards,
Qian Zhang

On Fri, Apr 27, 2018 at 12:07 PM, James Peach <jor...@gmail.com> wrote:

>
>
> > On Apr 26, 2018, at 7:25 PM, Qian Zhang <zhq527...@gmail.com> wrote:
> >
> > Hi James,
> >
> > Thanks for your comment!
> >
> > I think you are talking about the SANDBOX_PATH volume ownership issue
> > mentioned in the design doc
> > <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-
> v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p>,
> > IIUC, you prefer to leaving it to framework, i.e., framework itself ought
> > to be able to handle such issue. But I am curious how framework can
> handle
> > it in such situation. If the framework launches a task group with
> different
> > users and with a SANDBOX_PATH volume of PARENT type, the tasks in the
> group
> > will definitely fail to write to the volume due to the ownership issue
> > though the volume's mode is set to "rw". So in this case, how should
> > framework handle it?
>
> The framework launched tasks in a group with different users? Sounds like
> they dug their own hole :)
>
> I'd argue that the "rw" on the sandbox path is analogous to the "rw" mount
> option. That is, it is mounted writeable, but says nothing about which
> credentials can write to it.
>
> > And if we want to document it, what is our recommended
> > solution in the doc?
> >
> >
> >
> > Regards,
> > Qian Zhang
> >
> > On Fri, Apr 27, 2018 at 1:16 AM, James Peach <jpe...@apache.org> wrote:
> >
> >> I commented on the doc, but at least some of the issues raised there I
> >> would not regard as issues. Rather, they are about setting expectations
> >> correctly and ensuring that we are documenting (and maybe enforcing)
> >> sensible behavior.
> >>
> >> I'm not that keen on Mesos automatically "fixing" filesystem permissions
> >> and we should proceed down that path with caution, especially in the
> ACLs
> >> case.
> >>
> >>> On Apr 10, 2018, at 3:15 AM, Qian Zhang <zhq527...@gmail.com> wrote:
> >>>
> >>> Hi Folks,
> >>>
> >>> I am working on MESOS-8767 to improve Mesos volume support regarding
> >> volume ownership and permission, here is the design doc. Please feel
> free
> >> to let me know if you have any comments/feedbacks, you can reply this
> mail
> >> or comment on the design doc directly. Thanks!
> >>>
> >>>
> >>> Regards,
> >>> Qian Zhang
> >>
> >>
>
>


Re: Volume ownership and permission

2018-04-26 Thread Qian Zhang
Hi James,

Thanks for your comment!

I think you are talking about the SANDBOX_PATH volume ownership issue
mentioned in the design doc
<https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p>,
IIUC, you prefer to leaving it to framework, i.e., framework itself ought
to be able to handle such issue. But I am curious how framework can handle
it in such situation. If the framework launches a task group with different
users and with a SANDBOX_PATH volume of PARENT type, the tasks in the group
will definitely fail to write to the volume due to the ownership issue
though the volume's mode is set to "rw". So in this case, how should
framework handle it? And if we want to document it, what is our recommended
solution in the doc?



Regards,
Qian Zhang

On Fri, Apr 27, 2018 at 1:16 AM, James Peach <jpe...@apache.org> wrote:

> I commented on the doc, but at least some of the issues raised there I
> would not regard as issues. Rather, they are about setting expectations
> correctly and ensuring that we are documenting (and maybe enforcing)
> sensible behavior.
>
> I'm not that keen on Mesos automatically "fixing" filesystem permissions
> and we should proceed down that path with caution, especially in the ACLs
> case.
>
> > On Apr 10, 2018, at 3:15 AM, Qian Zhang <zhq527...@gmail.com> wrote:
> >
> > Hi Folks,
> >
> > I am working on MESOS-8767 to improve Mesos volume support regarding
> volume ownership and permission, here is the design doc. Please feel free
> to let me know if you have any comments/feedbacks, you can reply this mail
> or comment on the design doc directly. Thanks!
> >
> >
> > Regards,
> > Qian Zhang
>
>


Re: Questions about secret handling in Mesos

2018-04-21 Thread Qian Zhang
Hi Aditya,

Yeah, you are right. `hostSecretPath` is a sub-directory under agent's
runtime dir, and the default value of agent's runtime dir is `/var/run/mesos`
which is a tmpfs. So the secret is written to tmpfs on agent host.


Regards,
Qian Zhang

On Sat, Apr 21, 2018 at 8:19 AM, Aditya Bhave <adity...@uber.com> wrote:

> Hi Qian,
>
> Secret is written to file at hostSecretPath which is derived like this:
>
> const string hostSecretPath = path::join(flags.runtime_dir, SECRET_DIR,
> stringify(id::UUID::random()));
> Also,
> const string hostSecretTmpDir = path::join(flags.runtime_dir, SECRET_DIR);
> Is the hostSecretTmpDir not located on tmpfs? The dir name alludes to
> this.
>
> Thanks,
> -Aditya
>
> On Fri, Apr 20, 2018 at 5:05 PM, Qian Zhang <zhq527...@gmail.com> wrote:
>
>> > When the secret is first downloaded on the mesos agent, it will be
>> stored as "root" on the tmpfs/ramfs before being mounted in the container
>> ramfs.
>>
>> It seems the secret is not stored on the tmpfs/ramfs on the agent host,
>> we just write it into a file
>> <https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/mesos/isolators/volume/secret.cpp#L281>
>> under the agent's runtime directory, and then move it into the ramfs
>> <https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/mesos/isolators/volume/secret.cpp#L260:L267>
>> in the container when the container is launched.
>>
>>
>> Regards,
>> Qian Zhang
>>
>> On Fri, Apr 20, 2018 at 2:47 PM, Gilbert Song <gilb...@apache.org> wrote:
>>
>>> IIUC, your assumptions are all correct.
>>>
>>> @Kapil, could you please confirm? Maybe we could improve the document at
>>> the next Docathon.
>>>
>>> Gilbert
>>>
>>> On Thu, Apr 19, 2018 at 10:57 AM, Zhitao Li <zhitaoli...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We at Uber plan to use volume/secret isolator to send secrets from Uber
>>>> framework to Mesos agent.
>>>>
>>>> For this purpose, we are referring to these documents:
>>>>
>>>>- File based secrets design doc
>>>><https://docs.google.com/document/d/18raiiUfxTh-JBvjd6RyHe_
>>>> TOScY87G_bMi5zBzMZmpc/edit#>
>>>>and slides
>>>><http://schd.ws/hosted_files/mesosconasia2017/70/Secrets%20
>>>> Management%20in%20Mesos.pdf>
>>>>.
>>>>- Apache Mesos secrets documentation
>>>><http://mesos.apache.org/documentation/latest/secrets/>
>>>>
>>>> Could you please confirm that the following assumptions are correct?
>>>>
>>>>- Mesos agent and master will never log the secret data at any
>>>> logging
>>>>level;
>>>>- Mesos agent and master will never expose the secret data as part of
>>>>any API response;
>>>>- Mesos agent and master will never store the secret in any
>>>> persistent
>>>>storage, but only on tmpfs or ramfs;
>>>>- When the secret is first downloaded on the mesos agent, it will be
>>>>stored as "root" on the tmpfs/ramfs before being mounted in the
>>>> container
>>>>ramfs.
>>>>
>>>> If above assumptions are true, then I would like to see them documented
>>>> in
>>>> this as part of the Apache Mesos secrets documentation
>>>> <http://mesos.apache.org/documentation/latest/secrets/>. Otherwise,
>>>> we'd
>>>> like to have a design discussion with maintainer of the isolator.
>>>>
>>>> We appreciate your help regarding this. Thanks!
>>>>
>>>> Regards,
>>>> Aditya And Zhitao
>>>>
>>>
>>>
>>
>


Re: Questions about secret handling in Mesos

2018-04-20 Thread Qian Zhang
> When the secret is first downloaded on the mesos agent, it will be stored
as "root" on the tmpfs/ramfs before being mounted in the container ramfs.

It seems the secret is not stored on the tmpfs/ramfs on the agent host, we
just write it into a file
<https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/mesos/isolators/volume/secret.cpp#L281>
under the agent's runtime directory, and then move it into the ramfs
<https://github.com/apache/mesos/blob/1.5.0/src/slave/containerizer/mesos/isolators/volume/secret.cpp#L260:L267>
in the container when the container is launched.


Regards,
Qian Zhang

On Fri, Apr 20, 2018 at 2:47 PM, Gilbert Song <gilb...@apache.org> wrote:

> IIUC, your assumptions are all correct.
>
> @Kapil, could you please confirm? Maybe we could improve the document at
> the next Docathon.
>
> Gilbert
>
> On Thu, Apr 19, 2018 at 10:57 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
>> Hello,
>>
>> We at Uber plan to use volume/secret isolator to send secrets from Uber
>> framework to Mesos agent.
>>
>> For this purpose, we are referring to these documents:
>>
>>- File based secrets design doc
>><https://docs.google.com/document/d/18raiiUfxTh-JBvjd6RyHe_
>> TOScY87G_bMi5zBzMZmpc/edit#>
>>and slides
>><http://schd.ws/hosted_files/mesosconasia2017/70/Secrets%20
>> Management%20in%20Mesos.pdf>
>>.
>>- Apache Mesos secrets documentation
>><http://mesos.apache.org/documentation/latest/secrets/>
>>
>> Could you please confirm that the following assumptions are correct?
>>
>>- Mesos agent and master will never log the secret data at any logging
>>level;
>>- Mesos agent and master will never expose the secret data as part of
>>any API response;
>>- Mesos agent and master will never store the secret in any persistent
>>storage, but only on tmpfs or ramfs;
>>- When the secret is first downloaded on the mesos agent, it will be
>>stored as "root" on the tmpfs/ramfs before being mounted in the
>> container
>>ramfs.
>>
>> If above assumptions are true, then I would like to see them documented in
>> this as part of the Apache Mesos secrets documentation
>> <http://mesos.apache.org/documentation/latest/secrets/>. Otherwise, we'd
>> like to have a design discussion with maintainer of the isolator.
>>
>> We appreciate your help regarding this. Thanks!
>>
>> Regards,
>> Aditya And Zhitao
>>
>
>


Re: Volume ownership and permission

2018-04-10 Thread Qian Zhang
Hi Marc,

I have shared the design doc to ensure anyone (no sign-in required) with
the link can comment, can you try again?



Regards,
Qian Zhang

On Tue, Apr 10, 2018 at 1:04 PM, Marc Roos <m.r...@f1-outsourcing.eu> wrote:

>
> Cannot access it
>
>
>
> -Original Message-
> From: Qian Zhang [mailto:zhq527...@gmail.com]
> Sent: dinsdag 10 april 2018 12:16
> To: mesos; user
> Subject: Volume ownership and permission
>
> Hi Folks,
>
> I am working on MESOS-8767
> <https://issues.apache.org/jira/browse/MESOS-8767>  to improve Mesos
> volume support regarding volume ownership and permission, here is the
> design doc <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-
> v4KWwjmnCR0l8V4Tq2U/edit?usp=sharing>
> . Please feel free to let me know if you have any comments/feedbacks,
> you can reply this mail or comment on the design doc directly. Thanks!
>
>
> Regards,
> Qian Zhang
>
>
>


Volume ownership and permission

2018-04-10 Thread Qian Zhang
Hi Folks,

I am working on MESOS-8767
<https://issues.apache.org/jira/browse/MESOS-8767> to improve Mesos volume
support regarding volume ownership and permission, here is the design doc
<https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit?usp=sharing>.
Please feel free to let me know if you have any comments/feedbacks, you can
reply this mail or comment on the design doc directly. Thanks!


Regards,
Qian Zhang


Re: Welcome Zhitao Li as Mesos Committer and PMC Member

2018-03-12 Thread Qian Zhang
Congrats Zhitao!


Regards,
Qian Zhang

On Tue, Mar 13, 2018 at 6:30 AM, Jason Lai <ja...@jasonlai.net> wrote:

> Huge congrats, Zhitao!
>
> It is super awesome to have you represent Uber for our Mesos open source
> efforts! Well deserved!
>
> Jason
>
> On Mon, Mar 12, 2018 at 3:28 PM Chun-Hung Hsiao <chhs...@mesosphere.io>
> wrote:
>
> > Congrats Zhitao!
> >
> > On Mon, Mar 12, 2018 at 2:51 PM, Benjamin Mahler <bmah...@apache.org>
> > wrote:
> >
> > > Welcome Zhitao! Thanks for your contributions so far
> > >
> > > On Mon, Mar 12, 2018 at 2:02 PM, Gilbert Song <gilb...@apache.org>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am excited to announce that the PMC has voted Zhitao Li as a new
> > > > committer and member of PMC for the Apache Mesos project. Please join
> > me
> > > to
> > > > congratulate Zhitao!
> > > >
> > > > Zhitao has been an active contributor to Mesos for one and a half
> > years.
> > > > His main contributions include:
> > > >
> > > >- Designed and implemented Container Image Garbage Collection (
> > > >MESOS-4945 <https://issues.apache.org/jira/browse/MESOS-4945>);
> > > >- Designed and implemented part of the HTTP Operator API
> (MESOS-6007
> > > ><https://issues.apache.org/jira/browse/MESOS-6007>);
> > > >- Reported and fixed a lot of bugs
> > > ><https://issues.apache.org/jira/issues/?jql=type%20%3D%
> > > 20Bug%20AND%20(assignee%20%3D%20zhitao%20OR%20reporter%20%
> > > 3D%20zhitao%20)%20ORDER%20BY%20priority%20>
> > > >.
> > > >
> > > > Zhitao spares no effort to improve the project quality and to propose
> > > > ideas. Thank you Zhitao for all contributions!
> > > >
> > > > Here is his committer candidate checklist for your perusal:
> > > > https://docs.google.com/document/d/1HGz7iBdo1Q9z9c8fNRgNNLnj0XQ_
> > > > PhDhjXLAfOx139s/
> > > >
> > > > Congrats Zhitao!
> > > >
> > > > Cheers,
> > > > Gilbert
> > > >
> > >
> >
>


Re: Welcome Chun-Hung Hsiao as Mesos Committer and PMC Member

2018-03-11 Thread Qian Zhang
Congratulations Chun!


Regards,
Qian Zhang

On Sun, Mar 11, 2018 at 2:08 PM, Sam <usultra...@gmail.com> wrote:

> Congrats Chun
>
>
> Regards,
> Sam Chen | APJ Country Director | DC/OS Evangelist
>
> [image: Watch the Video]
> <http://smart.mesosphere.io/v2/a/watchvideoimg/5820c36380301433d51036f3-O2QPg/>
>  [image:
> .]Build and run modern apps
> at scale using DC/OS
> <http://smart.mesosphere.io/v2/a/watchvideotext1/5820c36380301433d51036f3-O2QPg/>
>
>
> On Mar 11, 2018, at 1:14 PM, Jie Yu <yujie@gmail.com> wrote:
>
> Chun
>
>


Re: Welcome Andrew Schwartzmeyer as a new committer and PMC member!

2017-11-27 Thread Qian Zhang
Congratulations!


Regards,
Qian Zhang

On Tue, Nov 28, 2017 at 7:10 AM, Jie Yu <yujie@gmail.com> wrote:

> Congrats, Andy!
>
> - Jie
>
> On Mon, Nov 27, 2017 at 3:00 PM, Joseph Wu <jos...@mesosphere.io> wrote:
>
> > Hi devs & users,
> >
> > I'm happy to announce that Andrew Schwartzmeyer has become a new
> committer
> > and member of the PMC for the Apache Mesos project.  Please join me in
> > congratulating him!
> >
> > Andrew has been an active contributor to Mesos for about a year.  He has
> > been the primary contributor behind our efforts to change our default
> build
> > system to CMake and to port Mesos onto Windows.
> >
> > Here is his committer candidate checklist for your perusal:
> > https://docs.google.com/document/d/1MfJRYbxxoX2-A-g8NEeryUdU
> > i7FvIoNcdUbDbGguH1c/
> >
> > Congrats Andy!
> > ~Joseph
> >
>


Re: 1.4.1 release

2017-11-03 Thread Qian Zhang
And I will backport MESOS-8051 to 1.2.x, 1.3.x and 1.4.x.


Regards,
Qian Zhang

On Fri, Nov 3, 2017 at 9:01 AM, Qian Zhang <zhq527...@gmail.com> wrote:

> We want to backport https://reviews.apache.org/r/62518/ to 1.2.x, 1.3.x
> and 1.4.x, James will work on it.
>
>
> Regards,
> Qian Zhang
>
> On Fri, Nov 3, 2017 at 12:11 AM, Kapil Arya <ka...@mesosphere.io> wrote:
>
>> Please reply to this email if you have pending patches to be backported
>> to 1.4.x as we are aiming to cut a release candidate for 1.4.1 early next
>> week.
>>
>> Thanks,
>> Anand and Kapil
>>
>
>


Re: 1.4.1 release

2017-11-02 Thread Qian Zhang
We want to backport https://reviews.apache.org/r/62518/ to 1.2.x, 1.3.x and
1.4.x, James will work on it.


Regards,
Qian Zhang

On Fri, Nov 3, 2017 at 12:11 AM, Kapil Arya <ka...@mesosphere.io> wrote:

> Please reply to this email if you have pending patches to be backported to
> 1.4.x as we are aiming to cut a release candidate for 1.4.1 early next week.
>
> Thanks,
> Anand and Kapil
>


Re: Collect feedbacks on TASK_FINISHED

2017-09-24 Thread Qian Zhang
Thanks Vinod and James!

So I think the task state transition TASK_KILLING -> TASK_FINISHED is a
bug, we should change it to TASK_KILLING -> TASK_KILLED.


Regards,
Qian Zhang

On Fri, Sep 22, 2017 at 3:27 PM, James Peach <jor...@gmail.com> wrote:

>
> > On Sep 21, 2017, at 10:12 PM, Vinod Kone <vi...@mesosphere.io> wrote:
> >
> > I think it makes sense for `TASK_KILLED` to be sent in response to a KILL
> > call irrespective of the exit status. IIRC, that was the original
> intention.
>
> Those are the semantics we implement and expect in our scheduler and
> executor. The only time we emit TASK_KILLED is in response to a scheduler
> kill, and a scheduler kill always ends in a TASK_KILLED.
>
> The rationale for this is
> 1. We want to distinguish whether the task finished for its own reasons
> (ie. not due to a scheduler kill)
> 2. The scheduler told us to kill the task and we did, so it was
> TASK_KILLED (irrespective of any exit status)
>
> > On Thu, Sep 21, 2017 at 8:20 PM, Qian Zhang <zhq527...@gmail.com> wrote:
> >
> >> Hi Folks,
> >>
> >> I'd like to collect the feedbacks on the task state TASK_FINISHED.
> >> Currently the default and command executor will always send
> TASK_FINISHED
> >> as long as the exit code of task is 0, this cause an issue: when
> scheduler
> >> initiates a kill task, the executor will send SIGTERM to the task first,
> >> and if the task handles SIGTERM gracefully and exit with 0, the executor
> >> will send TASK_FINISHED for that task, so we will see the task state
> >> transition: TASK_KILLING -> TASK_FINISHED.
> >>
> >> This seems incorrect because we thought it should be TASK_KILLING ->
> >> TASK_KILLED, that's why we filed a ticket MESOS-7975
> >> <https://issues.apache.org/jira/browse/MESOS-7975> for it. However, I
> am
> >> not very sure if it is really a bug, because I think it depends on how
> we
> >> define the meaning of TASK_FINISHED, if it means the task is terminated
> >> successfully on its own without external interference, then I think it
> does
> >> not make sense for scheduler to receive a TASK_KILLING followed by a
> >> TASK_FINISHED since there is indeed an external interference (killing
> task
> >> is initiated by scheduler). However, if TASK_FINISHED means the task is
> >> terminated successfully for whatever reason (no matter it is killed or
> >> terminated on its own), then I think it is OK to receive a TASK_KILLING
> >> followed by a TASK_FINISHED.
> >>
> >> Please let us know your thoughts on this issue, thanks!
> >>
> >>
> >> Regards,
> >> Qian Zhang
> >>
>
>


Collect feedbacks on TASK_FINISHED

2017-09-21 Thread Qian Zhang
Hi Folks,

I'd like to collect the feedbacks on the task state TASK_FINISHED.
Currently the default and command executor will always send TASK_FINISHED
as long as the exit code of task is 0, this cause an issue: when scheduler
initiates a kill task, the executor will send SIGTERM to the task first,
and if the task handles SIGTERM gracefully and exit with 0, the executor
will send TASK_FINISHED for that task, so we will see the task state
transition: TASK_KILLING -> TASK_FINISHED.

This seems incorrect because we thought it should be TASK_KILLING ->
TASK_KILLED, that's why we filed a ticket MESOS-7975
<https://issues.apache.org/jira/browse/MESOS-7975> for it. However, I am
not very sure if it is really a bug, because I think it depends on how we
define the meaning of TASK_FINISHED, if it means the task is terminated
successfully on its own without external interference, then I think it does
not make sense for scheduler to receive a TASK_KILLING followed by a
TASK_FINISHED since there is indeed an external interference (killing task
is initiated by scheduler). However, if TASK_FINISHED means the task is
terminated successfully for whatever reason (no matter it is killed or
terminated on its own), then I think it is OK to receive a TASK_KILLING
followed by a TASK_FINISHED.

Please let us know your thoughts on this issue, thanks!


Regards,
Qian Zhang


Re: Welcome James Peach as a new committer and PMC memeber!

2017-09-07 Thread Qian Zhang
Congratulations James!


Regards,
Qian Zhang

On Fri, Sep 8, 2017 at 12:02 AM, Jay Guo <guojiannan1...@gmail.com> wrote:

> Congrats! Well deserved!
>
> - J
>
> On Fri, Sep 8, 2017 at 12:00 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
>> Congratulations James! Very well deserved! Looking forward for more great
>> work!
>>
>> On Thu, Sep 7, 2017 at 6:19 AM, Klaus Ma <klaus1982...@gmail.com> wrote:
>>
>>> Congrats !!
>>>
>>> 
>>> Da (Klaus), Ma (马达) | PMP® | R of IBM Cloud private
>>> IBM Spectrum Computing, IBM System
>>> +86-10-8245 4084 <+86%2010%208245%204084> | mad...@cn.ibm.com | @k82cn
>>> <http://github.com/k82cn>
>>>
>>> On Thu, Sep 7, 2017 at 3:08 PM, tommy xiao <xia...@gmail.com> wrote:
>>>
>>>> Congrats James! Well deserved!
>>>>
>>>> 2017-09-07 14:54 GMT+08:00 Ben Lin <ben@mesosphere.io>:
>>>>
>>>>> Congrats!!
>>>>>
>>>>> --
>>>>> *From:* Oucema Bellagha <oucema.bella...@hotmail.com>
>>>>> *Sent:* Thursday, September 7, 2017 2:51:44 PM
>>>>> *To:* user@mesos.apache.org
>>>>> *Subject:* Re: Welcome James Peach as a new committer and PMC memeber!
>>>>>
>>>>> Congrats my friend !
>>>>>
>>>>> --
>>>>> *From:* xuj...@apple.com <xuj...@apple.com> on behalf of Yan Xu <
>>>>> xuj...@apple.com>
>>>>> *Sent:* Wednesday, September 6, 2017 9:08:42 PM
>>>>> *To:* dev; user
>>>>> *Subject:* Welcome James Peach as a new committer and PMC memeber!
>>>>>
>>>>> Hi Mesos devs and users,
>>>>>
>>>>> Please welcome James Peach as a new Apache Mesos committer and PMC
>>>>> member.
>>>>>
>>>>> James has been an active contributor to Mesos for over two years now.
>>>>> He has made many great contributions to the project which include XFS disk
>>>>> isolator, improvement to Linux capabilities support and IPC namespace
>>>>> isolator. He's super active on the mailing lists and slack channels, 
>>>>> always
>>>>> eager to help folks in the community and he has been helping with a lot of
>>>>> Mesos reviews as well.
>>>>>
>>>>> Here is his formal committer candidate checklist:
>>>>>
>>>>> https://docs.google.com/document/d/19G5zSxhrRBdS6GXn9KjCznjX
>>>>> 3cp0mUbck6Jy1Hgn3RY/edit?usp=sharing
>>>>> <https://docs.google.com/document/d/19G5zSxhrRBdS6GXn9KjCznjX3cp0mUbck6Jy1Hgn3RY/edit?usp=sharing>
>>>>>
>>>>> Congrats James!
>>>>>
>>>>> Yan
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Deshi Xiao
>>>> Twitter: xds2000
>>>> E-mail: xiaods(AT)gmail.com
>>>>
>>>
>>>
>>
>>
>> --
>> Cheers,
>>
>> Zhitao Li
>>
>
>


Re: Welcome Greg Mann as a new committer and PMC member!

2017-06-13 Thread Qian Zhang
Congrats Greg!


Regards,
Qian Zhang

On Wed, Jun 14, 2017 at 7:24 AM, Guang Ya Liu <gyliu...@gmail.com> wrote:

> Congrats Greg!!! Very well deserved
>
> On Wed, Jun 14, 2017 at 7:04 AM, Klaus Ma <klaus1982...@gmail.com> wrote:
>
> > Congrats!
> >
> >
> > On 14 Jun 2017, at 06:29, Ben Lin <ben@mesosphere.io> wrote:
> >
> > Congrats Greg, well deserved!
> >
> > --
> > *From:* Jie Yu <yujie@gmail.com>
> > *Sent:* Wednesday, June 14, 2017 5:54:48 AM
> > *To:* user
> > *Cc:* dev
> > *Subject:* Re: Welcome Greg Mann as a new committer and PMC member!
> >
> > Congrats Greg!
> >
> > On Tue, Jun 13, 2017 at 2:42 PM, Vinod Kone <vinodk...@apache.org>
> wrote:
> >
> >> Hi folks,
> >>
> >> Please welcome Greg Mann as the newest committer and PMC member of the
> >> Apache Mesos project.
> >>
> >> Greg has been an active contributor to the Mesos project for close to 2
> >> years now and has made many solid contributions. His biggest source code
> >> contribution to the project has been around adding authentication
> support
> >> for default executor. This was a major new feature that involved quite a
> >> few moving parts. Additionally, he also worked on improving the
> >> scheduler and executor APIs.
> >>
> >> Here is his more formal checklist for your perusal.
> >>
> >> https://docs.google.com/document/d/1S6U5OFVrl7ySmpJsfD4fJ3_R
> >> 8JYRRc5spV0yKrpsGBw/edit
> >>
> >> Thanks,
> >> Vinod
> >>
> >>
> >
> >
>


Re: Plan for upgrading protobuf==3.2.0 in Mesos

2017-05-26 Thread Qian Zhang
Thanks Anand and Zhitao!

So I think we can remove the code like below, and switch to use the native
maps supported by proto3, right?
https://github.com/apache/mesos/blob/master/include/mesos/docker/v1.proto#L25:L28

And just curious, if we use the proto2 syntax in each .proto files in
Mesos, is it possible for us to use the new features (like maps) supported
in proto3? And what about the newly introduced .proto file? I think we do
not need to have "syntax = "proto2";" in it, right?


Regards,
Qian Zhang

On Sat, May 27, 2017 at 6:58 AM, Michael Park <mp...@apache.org> wrote:

> Thanks Anand and Zhitao!
>
> On Fri, May 26, 2017 at 3:40 PM Anand Mazumdar <an...@apache.org> wrote:
>
>> We recently committed this [1] and it would be part of the *next major
>> release* (1.4.0). Also, we upgraded to the newer protobuf release 3.3.0.
>>
>>
>> For Mesos developers, this means that we can use proto3 features like
>> arena
>> allocation [2], maps [3] etc. Note that we still need to use the proto2
>> syntax version for backward compatibility.
>>
>> Thanks Zhitao for the contributions!
>>
>> [1] https://issues.apache.org/jira/browse/MESOS-7228
>> [2] https://issues.apache.org/jira/browse/MESOS-5783
>> [3] https://developers.google.com/protocol-buffers/docs/proto#maps
>>
>> -anand
>>
>>
>> On Thu, Apr 27, 2017 at 10:28 AM, Anand Mazumdar <an...@apache.org>
>> wrote:
>>
>> > + dev
>> >
>> > Bumping up the thread to ensure it's not missed.
>> >
>> > -anand
>> >
>> > On Tue, Apr 25, 2017 at 11:01 AM, Zhitao Li <zhitaoli...@gmail.com>
>> wrote:
>> > > Dear framework owners and users,
>> > >
>> > > We are working on upgrading the protobuf library in Mesos to 3.2.0 in
>> > > https://issues.apache.org/jira/browse/MESOS-7228, to overcome some
>> > protobuf
>> > > limitation on message size as well as preparing for further
>> improvement.
>> > We
>> > > aim to release this with the upcoming Mesos 1.3.0.
>> > >
>> > > Because we upgraded the protoc compiler in this process, all generated
>> > java
>> > > and python code may not be compatible with protobuf 2.6.1 (the
>> previous
>> > > dependency), and we ask you to upgrade the protobuf dependency to
>> 3.2.0
>> > when
>> > > you upgrade your framework dependency to 1.3.0.
>> > >
>> > > For java, a snapshot maven artifact has been prepared (by Anand
>> > Mazumdar's
>> > > courtesy) at
>> > > https://repository.apache.org/content/repositories/
>> > snapshots/org/apache/mesos/mesos/1.3.0-SNAPSHOT/
>> > > . Please feel free to play out with it and let us know if you run into
>> > any
>> > > issues.
>> > >
>> > > Note that the binary upgrade process should still be compatible: any
>> > java or
>> > > based framework (scheduler or executor) should still work out of box
>> with
>> > > Mesos 1.3.0 once released. It is suggested to get your cluster
>> upgraded
>> > to
>> > > 1.3.0 first, then come back and upgrade your executors and schedulers.
>> > >
>> > > We understand this may expose inconvenience around updating the
>> protobuf
>> > > dependency, so please let us know if you have any concern or further
>> > > questions.
>> > >
>> > > --
>> > >
>> > > Cheers,
>> > >
>> > > Zhitao Li and Anand Mazumdar,
>> >
>>
>


Re: Welcome Gilbert Song as a new committer and PMC member!

2017-05-24 Thread Qian Zhang
Congratulations Gilbert! Well deserved!!!


Regards,
Qian Zhang

On Thu, May 25, 2017 at 6:54 AM, Klaus Ma <klaus1982...@gmail.com> wrote:

> Congratulations Gilbert!
>
> On Thu, May 25, 2017 at 3:39 AM Greg Mann <g...@mesosphere.io> wrote:
>
>> Congratulations Gilbert!! :D
>>
>> On Wed, May 24, 2017 at 12:01 PM, Avinash Sridharan <
>> avin...@mesosphere.io> wrote:
>>
>>> Congrats Gilbert !! Very well deserved !!
>>>
>>> On Wed, May 24, 2017 at 11:56 AM, Timothy Chen <tnac...@gmail.com>
>>> wrote:
>>>
>>> > Congrats! Rocking the containerizer world!
>>> >
>>> > Tim
>>> >
>>> > On Wed, May 24, 2017 at 11:23 AM, Zhitao Li <zhitaoli...@gmail.com>
>>> wrote:
>>> > > Congrats Gilbert!
>>> > >
>>> > > On Wed, May 24, 2017 at 11:08 AM, Yan Xu <y...@jxu.me> wrote:
>>> > >
>>> > >> Congrats! Well deserved!
>>> > >>
>>> > >> ---
>>> > >> Jiang Yan Xu <y...@jxu.me> | @xujyan <https://twitter.com/xujyan>
>>> > >>
>>> > >> On Wed, May 24, 2017 at 10:54 AM, Vinod Kone <vinodk...@apache.org>
>>> > wrote:
>>> > >>
>>> > >>> Congrats Gilbert!
>>> > >>>
>>> > >>> On Wed, May 24, 2017 at 1:32 PM, Neil Conway <
>>> neil.con...@gmail.com>
>>> > >>> wrote:
>>> > >>>
>>> > >>> > Congratulations Gilbert! Well-deserved!
>>> > >>> >
>>> > >>> > Neil
>>> > >>> >
>>> > >>> > On Wed, May 24, 2017 at 10:32 AM, Jie Yu <yujie@gmail.com>
>>> > wrote:
>>> > >>> > > Hi folks,
>>> > >>> > >
>>> > >>> > > I' happy to announce that the PMC has voted Gilbert Song as a
>>> new
>>> > >>> > committer
>>> > >>> > > and member of PMC for the Apache Mesos project. Please join me
>>> to
>>> > >>> > > congratulate him!
>>> > >>> > >
>>> > >>> > > Gilbert has been working on Mesos project for 1.5 years now.
>>> His
>>> > main
>>> > >>> > > contribution is his work on unified containerizer, nested
>>> container
>>> > >>> (aka
>>> > >>> > > Pod) support. He also helped a lot of folks in the community
>>> > regarding
>>> > >>> > their
>>> > >>> > > patches, questions and etc. He also played an important role
>>> > >>> organizing
>>> > >>> > > MesosCon Asia last year and this year!
>>> > >>> > >
>>> > >>> > > His formal committer checklist can be found here:
>>> > >>> > > https://docs.google.com/document/d/1iSiqmtdX_0CU-YgpViA6r6PU_
>>> > >>> > aMCVuxuNUZ458FR7Qw/edit?usp=sharing
>>> > >>> > >
>>> > >>> > > Welcome, Gilbert!
>>> > >>> > >
>>> > >>> > > - Jie
>>> > >>> >
>>> > >>>
>>> > >>
>>> > >>
>>> > >
>>> > >
>>> > > --
>>> > > Cheers,
>>> > >
>>> > > Zhitao Li
>>> >
>>>
>>>
>>>
>>> --
>>> Avinash Sridharan, Mesosphere
>>> +1 (323) 702 5245
>>>
>>
>> --
>
> Regards,
> 
> Da (Klaus), Ma (马达), PMP® | Software Architect
> IBM Platform Development & Support, STG, IBM GCG
> +86-10-8245 4084 <+86%2010%208245%204084> | mad...@cn.ibm.com |
> http://k82.me
>


Re: resourceOffer

2017-03-10 Thread Qian Zhang
Yes, an offer can only be used in one accept call rather than multiple.
When Mesos master processes your accept call, it will remove the offer from
its cache:
https://github.com/apache/mesos/blob/1.2.0/src/master/master.cpp#L3690, so
if you use the same offer in another accept call, Mesos master will not be
able to find the offer since it has already been removed, and then it will
complain it is an invalid offer:
https://github.com/apache/mesos/blob/1.2.0/src/master/master.cpp#L3696:L3697


Thanks,
Qian Zhang

On Fri, Mar 10, 2017 at 6:26 PM, Oeg Bizz <oegb...@yahoo.com> wrote:

> Jay,
>Thanks for the tip about the slack chat, I just joined.  Also, I just
> fixed the problem.  I think it was related to me sending two separate
> accept calls rather than packing both tasks and send just one.  I guess
> once you send an accept(operations, oferIds) you cannot send it again for
> the same Offer.
>
> Oscar
>
>
> On Thursday, March 9, 2017 10:38 AM, Jay Guo <guojiannan1...@gmail.com>
> wrote:
>
>
> It looks quite weird to me... could you share more details about your
> scheduler implementation? A code snippet could help a lot. Also, if
> you want more interactive and prompt communication, please join our
> slack chat https://mesos-slackin.herokuapp.com/
>
> /J
>
> On Thu, Mar 9, 2017 at 5:20 PM, Oeg Bizz <oegb...@yahoo.com> wrote:
> > Qian,
> >Added the offer_timeout flag and set it to 2 seconds and the result is
> > the same.  I do not get any offerRescinded() call or anything like that.
> >
> > To fix the receiving offers I used the driver.acceptOffers instead of
> > driver.launchTasks()  Looking around the source code I found a
> TestFramework
> > source code within Mesos and there is a comment about the launchTask to
> be
> > deprecated and the use of acceptOffers was preffered so I changed it to
> > that.
> >
> > Thanks,
> >
> > Oscar
> >
> >
> > On Wednesday, March 8, 2017 8:49 PM, Qian Zhang <zhq527...@gmail.com>
> wrote:
> >
> >
> > It seems the offer has already been removed from Mesos master when you
> tried
> > to use it to launch a subsequent task, I think you did not specify
> > "--offer-timeout" flag when starting Mesos master, right? Did your
> framework
> > receive "RESCIND" event from Mesos master for the offer that you want to
> use
> > to launch task?
> >
> > BTW, how did you resolve the stopping receiving offers issue?
> >
> >
> > Thanks,
> > Qian Zhang
> >
> > On Wed, Mar 8, 2017 at 7:57 PM, Oeg Bizz <oegb...@yahoo.com> wrote:
> >
> > Sorry about the continuous rant, but I really would love to get this
> solved.
> > I passed the stopping receiving offers, but now the second time I send
> the
> > same request I get an error message stating that the offer is no longer
> > valid even though I sent just one task to the only slave I have
> running.  Ii
> > am running Mesos 1.1.0 and here are the log files from my last run
> >
> > Thanks in advance for all your help.  BTW, is there a better way of
> > submitting questions like a chat, threads, bulletin board?
> >
> > Oscar
> >
> >
> > On Wednesday, March 8, 2017 6:31 AM, Oeg Bizz <oegb...@yahoo.com> wrote:
> >
> >
> > Vinod,
> >I think the previous set is incomplete.  I have attached a more
> compete
> > set of files.  Thanks for the help
> >
> >
> > On Tuesday, March 7, 2017 12:34 PM, Vinod Kone <vinodk...@gmail.com>
> wrote:
> >
> >
> > Can you share master log?
> >
> > @vinodkone
> >
> > On Mar 7, 2017, at 2:54 AM, Oeg Bizz <oegb...@yahoo.com> wrote:
> >
> > Hi,
> >I am new at mesos and started exploring its usability for a new
> project I
> > will be involved.  I wrote an scheduler and an executor and I am able to
> > send one task which is executed properly.  After the first task is
> finished
> > I no longer get resourceOffer() invocations to my Scheduler.  What am I
> > missing?  If I do not send a task I can the resourceOffer calls
> consistently
> > every 5 seconds or so.  Also, does Mesos send all of the resources every
> > time or just a partial list?  Thanks in advance for any help,
> >
> > Oscar
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>


Re: resourceOffer

2017-03-08 Thread Qian Zhang
It seems the offer has already been removed from Mesos master when you
tried to use it to launch a subsequent task, I think you did not specify
"--offer-timeout" flag when starting Mesos master, right? Did your
framework receive "RESCIND" event from Mesos master for the offer that you
want to use to launch task?

BTW, how did you resolve the stopping receiving offers issue?


Thanks,
Qian Zhang

On Wed, Mar 8, 2017 at 7:57 PM, Oeg Bizz <oegb...@yahoo.com> wrote:

> Sorry about the continuous rant, but I really would love to get this
> solved.  I passed the stopping receiving offers, but now the second time I
> send the same request I get an error message stating that the offer is no
> longer valid even though I sent just one task to the only slave I have
> running.  Ii am running Mesos 1.1.0 and here are the log files from my last
> run
>
> Thanks in advance for all your help.  BTW, is there a better way of
> submitting questions like a chat, threads, bulletin board?
>
> Oscar
>
>
> On Wednesday, March 8, 2017 6:31 AM, Oeg Bizz <oegb...@yahoo.com> wrote:
>
>
> Vinod,
>I think the previous set is incomplete.  I have attached a more compete
> set of files.  Thanks for the help
>
>
> On Tuesday, March 7, 2017 12:34 PM, Vinod Kone <vinodk...@gmail.com>
> wrote:
>
>
> Can you share master log?
>
> @vinodkone
>
> On Mar 7, 2017, at 2:54 AM, Oeg Bizz <oegb...@yahoo.com> wrote:
>
> Hi,
>I am new at mesos and started exploring its usability for a new project
> I will be involved.  I wrote an scheduler and an executor and I am able to
> send one task which is executed properly.  After the first task is finished
> I no longer get resourceOffer() invocations to my Scheduler.  What am I
> missing?  If I do not send a task I can the resourceOffer calls
> consistently every 5 seconds or so.  Also, does Mesos send all of the
> resources every time or just a partial list?  Thanks in advance for any
> help,
>
> Oscar
>
>
>
>
>
>


Re: Welcome Kevin Klues as a Mesos Committer and PMC member!

2017-03-01 Thread Qian Zhang
Congratulations!


Thanks,
Qian Zhang

On Thu, Mar 2, 2017 at 6:43 AM, Greg Mann <g...@mesosphere.io> wrote:

> Woowoo! Congrats Kevin!!
>
> On Wed, Mar 1, 2017 at 2:26 PM, Avinash Sridharan <avin...@mesosphere.io>
> wrote:
>
>> Awesome !! Congrats Kevin !!
>>
>> On Wed, Mar 1, 2017 at 2:07 PM, Jie Yu <yujie@gmail.com> wrote:
>>
>>> Congrats! Kevin! Well deserved!
>>>
>>> On Wed, Mar 1, 2017 at 2:05 PM, Benjamin Mahler <bmah...@apache.org>
>>> wrote:
>>>
>>> > Hi all,
>>> >
>>> > Please welcome Kevin Klues as the newest committer and PMC member of
>>> the
>>> > Apache Mesos project.
>>> >
>>> > Kevin has been an active contributor in the project for over a year,
>>> and in
>>> > this time he made a number of contributions to the project: Nvidia GPU
>>> > support [1], the containerization side of POD support (new container
>>> init
>>> > process), and support for "attach" and "exec" of commands within
>>> running
>>> > containers [2].
>>> >
>>> > Also, Kevin took on an effort with Haris Choudhary to revive the CLI
>>> [3]
>>> > via a better structured python implementation (to be more accessible to
>>> > contributors) and a more extensible architecture to better support
>>> adding
>>> > new or custom subcommands. The work also adds a unit test framework
>>> for the
>>> > CLI functionality (we had no tests previously!). I think it's great
>>> that
>>> > Kevin took on this much needed improvement with Haris, and I'm very
>>> much
>>> > looking forward to seeing this land in the project.
>>> >
>>> > Here is his committer eligibility document for perusal:
>>> > https://docs.google.com/document/d/1mlO1yyLCoCSd85XeDKIxTYyboK_
>>> > uiOJ4Uwr6ruKTlFM/edit
>>> >
>>> > Thanks!
>>> > Ben
>>> >
>>> > [1] http://mesos.apache.org/documentation/latest/gpu-support/
>>> > [2]
>>> > https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482N
>>> > SVk8V0D56sFMzU
>>> > [3]
>>> > https://docs.google.com/document/d/1r6Iv4Efu8v8IBrcUTjgYkvZ32WVsc
>>> > gYqrD07OyIglsA/
>>> >
>>>
>>
>>
>>
>> --
>> Avinash Sridharan, Mesosphere
>> +1 (323) 702 5245 <(323)%20702-5245>
>>
>
>


Re: Welcome Haosdent Huang as Mesos Committer and PMC member!

2016-12-16 Thread Qian Zhang
Congratulations Haosdent! Well deserved! :-)


Thanks,
Qian Zhang

On Sat, Dec 17, 2016 at 8:07 AM, Sam <usultra...@gmail.com> wrote:

> Congratulations Haosdent :)
>
> Sent from my iPhone
>
> > On 17 Dec 2016, at 2:59 AM, Vinod Kone <vinodk...@apache.org> wrote:
> >
> > Haosdent
>


Re: [CNI] Marathon task with IP from host-local plugin fails

2016-11-09 Thread Qian Zhang
Good to know that, you are welcome :-)


Thanks,
Qian Zhang

On Wed, Nov 9, 2016 at 7:58 PM, Frank Scholten <fr...@frankscholten.nl>
wrote:

> Hi Qian,
>
> Indeed when I changed the "type" to "bridge" I the task got an IP
> address. Thanks!
>
> Cheers,
>
> Frank
>
> On Wed, Nov 9, 2016 at 1:48 AM, Qian Zhang <zhq527...@gmail.com> wrote:
> > Your CNI network configuration seems a bit odd to me:
> > {
> > "cniVersion": "0.3.0",
> > "name": "ipv4",
> > "type": "host-local", <--- main plugin
> > "isGateway": "true",
> > "ipMasq": "true",
> > "ipam": {
> > "type": "host-local", <--- IPAM plugin
> > "subnet": "10.22.0.0/16",
> > "routes": [
> > { "dst": "0.0.0.0/0" }
> > ]
> > }
> > }
> >
> > So you set the type of both main plugin and IPAM plugin to "host-local".
> But
> > I think "host-local" is just an IPAM plugin, for the main plugin, I think
> > you need to use another one, e.g., "bridge", you can take a look at the
> link
> > below for more details about bridge plugin:
> > https://github.com/containernetworking/cni/blob/
> master/Documentation/bridge.md
> >
> >
> >
> >
> >
> > Thanks,
> > Qian Zhang
> >
> > On Wed, Nov 9, 2016 at 3:09 AM, Jie Yu <yujie@gmail.com> wrote:
> >>
> >> Do you know if the executor is registered with the agent or not? (you
> can
> >> check the agent log)
> >>
> >>
> >> On Tue, Nov 8, 2016 at 10:57 AM, Frank Scholten <fr...@frankscholten.nl
> >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I have a Mesos 1.0.1 cluster with the following host-local CNI
> >>> configuration:
> >>>
> >>> {
> >>> "cniVersion": "0.3.0",
> >>> "name": "ipv4",
> >>> "type": "host-local",
> >>> "isGateway": "true",
> >>> "ipMasq": "true",
> >>> "ipam": {
> >>> "type": "host-local",
> >>> "subnet": "10.22.0.0/16",
> >>> "routes": [
> >>> { "dst": "0.0.0.0/0" }
> >>> ]
> >>> }
> >>> }
> >>>
> >>> Now I try to deploy the following app in Marathon:
> >>>
> >>> {
> >>>   "id": "netcat",
> >>>   "cpus": 1,
> >>>   "mem": 128,
> >>>   "instances": 1,
> >>>   "ipAddress": {
> >>> "networkName": "ipv4"
> >>>   },
> >>>   "cmd": "nc -l 8080"
> >>> }
> >>>
> >>> In the agent logs I can see the task gets an IP address from the
> >>> host-local plugin
> >>>
> >>> I1108 18:46:51.919564 24354 containerizer.cpp:1319] Checkpointing
> >>> executor's forked pid 31587 to
> >>>
> >>> '/tmp/mesos/meta/slaves/32f8badc-9771-4d3c-b2f5-
> 083e63619946-S0/frameworks/32f8badc-9771-4d3c-b2f5-
> 083e63619946-/executors/netcat.b678ad45-a5e3-11e6-
> 8a46-0242ac110006/runs/324b867e-9464-4d32-bccb-
> f7f3940335dc/pids/forked.pid'
> >>> I1108 18:46:51.920114 24355 cni.cpp:716] Bind mounted
> >>> '/proc/31587/ns/net' to
> >>>
> >>> '/run/mesos/isolators/network/cni/324b867e-9464-4d32-bccb-
> f7f3940335dc/ns'
> >>> for container 324b867e-9464-4d32-bccb-f7f3940335dc
> >>> I1108 18:46:51.920394 24355 cni.cpp:1030] Invoking CNI plugin
> >>> 'host-local' with network configuration
> >>>
> >>> '{"args":{"org.apache.mesos":{"network_info":{"ip_addresses"
> :[{}],"labels":{},"name":"ipv4"}}},"cniVersion":"0.3.0",
> "ipMasq":"true","ipam":{"routes":[{"dst":"0.0.0.0\/0"}]
> ,"subnet":"10.22.0.0\/16","type":"host-local"},"isGateway":"true","name":"
> ipv4","type":"host-local"}'
> >>> to attach container 324b867e-9464-4d32-bccb-f7f3940335dc to network
> >>> 'ipv4'
> >>> I1108 18:46:51.941431 24354 cni.cpp:1109] Got assigned IPv4 address
> >>> '10.22.0.2/16' from CNI network 'ipv4' for container
> >>> 324b867e-9464-4d32-bccb-f7f3940335dc
> >>> I1108 18:46:51.941766 24352 cni.cpp:838] Unable to find DNS
> >>> nameservers for container 324b867e-9464-4d32-bccb-f7f3940335dc. Using
> >>> host '/etc/resolv.conf'
> >>>
> >>> However the task status is FAILED and it gets restarted over and over
> >>> again.
> >>>
> >>> The sandbox stdout logs are empty. This is the contents of the stderr
> >>> logs:
> >>>
> >>> I1108 18:46:32.717352 30482 exec.cpp:161] Version: 1.0.0
> >>> E1108 18:46:32.717542 30482 process.cpp:2105] Failed to shutdown
> >>> socket with fd 7: Transport endpoint is not connected
> >>> E1108 18:46:32.717600 30482 process.cpp:2105] Failed to shutdown
> >>> socket with fd 7: Transport endpoint is not connected
> >>> I1108 18:46:32.717619 30482 exec.cpp:495] Agent exited ... shutting
> down
> >>>
> >>> Any ideas?
> >>>
> >>> Cheers,
> >>>
> >>> Frank
> >>
> >>
> >
>


Re: Non-checkpointing frameworks

2016-10-17 Thread Qian Zhang
Got it, thanks Zameer!


Thanks,
Qian Zhang

On Tue, Oct 18, 2016 at 2:25 AM, Zameer Manji <zma...@apache.org> wrote:

> Qian,
>
> Turns out the --checkpoint flag was made default and removed in Mesos 0.22.
>
> On Sun, Oct 16, 2016 at 4:38 PM, Qian Zhang <zhq527...@gmail.com> wrote:
>
>> and requires operators to enable checkpointing on the slaves.
>>
>>
>> Just curious why operator needs to enable checkpointing on the slaves (I
>> do not see an agent flag for that), I think checkpointing should be enabled
>> in framework level rather than slave.
>>
>>
>> Thanks,
>> Qian Zhang
>>
>> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zma...@apache.org> wrote:
>>
>>> +1 to A and B
>>>
>>> Aurora has enabled checkpointing for years and requires operators to
>>> enable
>>> checkpointing on the slaves.
>>>
>>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>>> jo...@mesosphere.io>
>>> wrote:
>>>
>>> > I'm in favor of A & B. I find it provides a better "first experience"
>>> to
>>> > users.
>>> > From my experience you usually have to have an explicit reason to not
>>> want
>>> > to checkpoint. Most people assume the semantics provided by the
>>> checkpoint
>>> > behavior is default and it can be a frustrating experience for them to
>>> find
>>> > out that is not the case.
>>> >
>>> > —
>>> > *Joris Van Remoortere*
>>>
>>> > Mesosphere
>>> >
>>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <neil.con...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi folks,
>>> >>
>>> >> I'd like input from individuals who currently use frameworks but do
>>> >> not enable checkpointing.
>>> >>
>>> >> Background: "checkpointing" is a parameter that can be enabled in
>>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>>> >> executor PIDs, and status updates to disk for any tasks started by
>>> >> that framework. This checkpointed information means that these tasks
>>> >> can survive an agent crash: if the agent exits (whether due to
>>> >> crashing or as part of an upgrade procedure), a restarted agent can
>>> >> use this information to reconnect to executors started by the previous
>>> >> instance of the agent. The downside is that checkpointing requires
>>> >> some additional disk I/O at the agent.
>>> >>
>>> >> Checkpointing is not currently the default, but in my experience it is
>>> >> often enabled for production frameworks. As part of the work on
>>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>>> >> considering:
>>> >>
>>> >> (a) requiring that partition-aware frameworks must also enable
>>> >> checkpointing, and/or
>>> >> (b) enabling checkpointing by default
>>> >>
>>> >> If you have intentionally decided to disable checkpointing for your
>>> >> Mesos framework, I'd be curious to hear more about your use-case and
>>> >> why you haven't enabled it.
>>> >>
>>> >> Thanks!
>>> >>
>>> >> Neil
>>> >>
>>> >> --
>>> >> Zameer Manji
>>> >>
>>> >
>>>
>>> --
>>> Zameer Manji
>>>
>>


Re: Non-checkpointing frameworks

2016-10-16 Thread Qian Zhang
>
> and requires operators to enable checkpointing on the slaves.


Just curious why operator needs to enable checkpointing on the slaves (I do
not see an agent flag for that), I think checkpointing should be enabled in
framework level rather than slave.


Thanks,
Qian Zhang

On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zma...@apache.org> wrote:

> +1 to A and B
>
> Aurora has enabled checkpointing for years and requires operators to enable
> checkpointing on the slaves.
>
> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
> jo...@mesosphere.io>
> wrote:
>
> > I'm in favor of A & B. I find it provides a better "first experience" to
> > users.
> > From my experience you usually have to have an explicit reason to not
> want
> > to checkpoint. Most people assume the semantics provided by the
> checkpoint
> > behavior is default and it can be a frustrating experience for them to
> find
> > out that is not the case.
> >
> > —
> > *Joris Van Remoortere*
> > Mesosphere
> >
> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <neil.con...@gmail.com>
> > wrote:
> >
> >> Hi folks,
> >>
> >> I'd like input from individuals who currently use frameworks but do
> >> not enable checkpointing.
> >>
> >> Background: "checkpointing" is a parameter that can be enabled in
> >> FrameworkInfo; if enabled, the agent will write the framework pid,
> >> executor PIDs, and status updates to disk for any tasks started by
> >> that framework. This checkpointed information means that these tasks
> >> can survive an agent crash: if the agent exits (whether due to
> >> crashing or as part of an upgrade procedure), a restarted agent can
> >> use this information to reconnect to executors started by the previous
> >> instance of the agent. The downside is that checkpointing requires
> >> some additional disk I/O at the agent.
> >>
> >> Checkpointing is not currently the default, but in my experience it is
> >> often enabled for production frameworks. As part of the work on
> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
> >> considering:
> >>
> >> (a) requiring that partition-aware frameworks must also enable
> >> checkpointing, and/or
> >> (b) enabling checkpointing by default
> >>
> >> If you have intentionally decided to disable checkpointing for your
> >> Mesos framework, I'd be curious to hear more about your use-case and
> >> why you haven't enabled it.
> >>
> >> Thanks!
> >>
> >> Neil
> >>
> >> --
> >> Zameer Manji
> >>
> >
>


Design doc of OCI image spec support in Mesos

2016-10-10 Thread Qian Zhang
Hi Folks,

I am currently working on the OCI image spec support in Mesos (MESOS-5011),
we'd like to add OCI image spec support to the universal containerizer so
that users can launch containers with OCI images using universal
containerizer.

I have drafted a design doc, please feel free to review and comment it,
thanks!
https://docs.google.com/document/d/1Pus7D-inIBoLSIPyu3rl_apxvUhtp3rp0_b0Ttr2Xww/edit?usp=sharing


Thanks,
Qian Zhang


Re: Unified cgroups isolator

2016-09-13 Thread Qian Zhang
Thanks @haosdent's awesome work and @Jie's great shepherding and guidance
on this project!


Thanks,
Qian Zhang

On Wed, Sep 14, 2016 at 7:56 AM, Gilbert Song <gilb...@mesosphere.io> wrote:

> Awesome!
>
> Kudos to @haosdent and @qianzhang!
>
> On Tue, Sep 13, 2016 at 11:22 AM, haosdent <haosd...@gmail.com> wrote:
>
>> Really appreciate @qian and @jie's great helps on this! It makes us
>> easier to add cgroups isolation for rest subsystem.
>>
>> Additionally, if you find any changes about unified cgroups isolator
>> break your environment, please let us know. I would
>> try to fix asap.
>>
>> On Wed, Sep 14, 2016 at 1:59 AM, Jie Yu <yujie@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We just merged the unified cgroups isolator. Huge shout out to @haosdent
>>> and @qianzhang to make this happen!
>>> https://issues.apache.org/jira/browse/MESOS-4697
>>>
>>> Just to give you some context. Previously, it's a huge pain to add a new
>>> cgroups subsystem to Mesos because it requires creating a new isolator (a
>>> lot of code duplication). Now, we merge all the subsystems into one single
>>> isolator, that makes adding a new subsystem very easy.
>>>
>>> More importantly, the new cgroups isolator supports cgroups v2!
>>>
>>> - Jie
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>


Re: Mesos HA does not work (Failed to recover registrar)

2016-06-06 Thread Qian Zhang
I deleted everything in the work dir (/var/lib/mesos/master), and tried
again, the same error still happened :-(


Thanks,
Qian Zhang

On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin <
jch.mar...@gmail.com> wrote:

> Qian,
>
> Zookeeper should be able to reach a quorum with 2, no need to start 3
> simultaneously, but there is an issue with Zookeeper related to connection
> timeouts.
> https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> In some circumstances, the timeout is higher than the sync timeout, which
> cause the leader election to fail.
> Try setting the parameter cnxtimeout in zookeeper (by default it’s 5000ms)
> to the value 500 (500ms). After doing this, leader election in ZK will be
> super fast even if a node is disconnected.
>
> JC
>
> > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zhq527...@gmail.com> wrote:
> >
> > Thanks Vinod and Dick.
> >
> > I think my 3 ZK servers have formed a quorum, each of them has the
> > following config:
> >$ cat conf/zoo.cfg
> >server.1=192.168.122.132:2888:3888
> >server.2=192.168.122.225:2888:3888
> >server.3=192.168.122.171:2888:3888
> >autopurge.purgeInterval=6
> >autopurge.snapRetainCount=5
> >initLimit=10
> >syncLimit=5
> >maxClientCnxns=0
> >clientPort=2181
> >tickTime=2000
> >quorumListenOnAllIPs=true
> >dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
> >dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
> >
> > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> > leader" for one, and "Mode: follower" for the other two.
> >
> > I have already tried to manually start 3 masters simultaneously, and here
> > is what I see in their log:
> > In 192.168.122.171(this is the first master I started):
> >I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> > (id='25')
> >I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> > '/mesos/log_replicas/24' in ZooKeeper
> >I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> > '/mesos/json.info_25' in ZooKeeper
> >I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> > master!
> >
> > In 192.168.122.225 (second master I started):
> >I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> > (id='25')
> >I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> > '/mesos/json.info_25' in ZooKeeper
> >I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> >I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
> > received a broadcasted recover request from (6)@192.168.122.225:5050
> >I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
> > is master@192.168.122.171:5050 with id
> cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >
> > In 192.168.122.132 (last master I started):
> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> > '/mesos/json.info_25' in ZooKeeper
> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> (UPID=
> > master@192.168.122.171:5050) is detected
> >
> > So right after I started these 3 masters, the first one (192.168.122.171)
> > was successfully elected as leader, but after 60s, 192.168.122.171 failed
> > with the error mentioned in my first mail, and then 192.168.122.225 was
> > elected as leader, but it failed with the same error too after another
> 60s,
> > and the same thing happened to the last one (192.168.122.132). So after
> > about 180s, all my 3 master were down.
> >
> > I tried both:
> >sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> > and
> >sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
> > 192.168.122.171:2181,192.1

Re: Mesos HA does not work (Failed to recover registrar)

2016-06-05 Thread Qian Zhang
>
> You need the 2nd command line (i.e. you have to specify all the zk
> nodes on each master, it's
> not like e.g. Cassandra where you can discover other nodes from the
> first one you talk to).


I have an Open DC/OS environment which is enabled master HA (there are 3
master nodes) and works very well, and I see each Mesos master is started
to only connect with local zk:
$ cat /opt/mesosphere/etc/mesos-master | grep ZK
MESOS_ZK=zk://127.0.0.1:2181/mesos

So I think I do not have to specify all the zk on each master.







Thanks,
Qian Zhang

On Sun, Jun 5, 2016 at 4:25 PM, Dick Davies <d...@hellooperator.net> wrote:

> OK, good - that part looks as expected, you've had a successful
> election for a leader
> (and yes that sounds like your zookeeper layer is ok).
>
> You need the 2nd command line (i.e. you have to specify all the zk
> nodes on each master, it's
> not like e.g. Cassandra where you can discover other nodes from the
> first one you talk to).
>
> The error you were getting was about the internal registry /
> replicated log, which is a mesos master level thing.
> You could try when Sivaram suggested - stopping the mesos master
> processes, wiping their
> work_dirs and starting them back up.
> Perhaps some wonky state got in there while you were trying various
> options?
>
>
> On 5 June 2016 at 00:34, Qian Zhang <zhq527...@gmail.com> wrote:
> > Thanks Vinod and Dick.
> >
> > I think my 3 ZK servers have formed a quorum, each of them has the
> following
> > config:
> > $ cat conf/zoo.cfg
> > server.1=192.168.122.132:2888:3888
> > server.2=192.168.122.225:2888:3888
> > server.3=192.168.122.171:2888:3888
> > autopurge.purgeInterval=6
> > autopurge.snapRetainCount=5
> > initLimit=10
> > syncLimit=5
> > maxClientCnxns=0
> > clientPort=2181
> > tickTime=2000
> > quorumListenOnAllIPs=true
> > dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
> > dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
> >
> > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
> > leader" for one, and "Mode: follower" for the other two.
> >
> > I have already tried to manually start 3 masters simultaneously, and
> here is
> > what I see in their log:
> > In 192.168.122.171(this is the first master I started):
> > I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
> > '/mesos/log_replicas/24' in ZooKeeper
> > I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
> > '/mesos/json.info_25' in ZooKeeper
> > I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> > I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> > I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected
> leader is
> > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> > I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
> > master!
> >
> > In 192.168.122.225 (second master I started):
> > I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
> > '/mesos/json.info_25' in ZooKeeper
> > I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
> > log-replica(1)@192.168.122.171:5050 }
> > I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
> > received a broadcasted recover request from (6)@192.168.122.225:5050
> > I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> > I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected
> leader is
> > master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
> >
> > In 192.168.122.132 (last master I started):
> > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
> > (id='25')
> > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
> > '/mesos/json.info_25' in ZooKeeper
> > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master
> > (UPID=master@192.168.122.171:5050) is detected
> >
> > So right after I started these 3 masters, the first one (192.168.122.171)
> > was successfully elected as leader, but after 60s, 192.168.122.171 failed
&

Re: Mesos HA does not work (Failed to recover registrar)

2016-06-04 Thread Qian Zhang
Thanks Vinod and Dick.

I think my 3 ZK servers have formed a quorum, each of them has the
following config:
$ cat conf/zoo.cfg
server.1=192.168.122.132:2888:3888
server.2=192.168.122.225:2888:3888
server.3=192.168.122.171:2888:3888
autopurge.purgeInterval=6
autopurge.snapRetainCount=5
initLimit=10
syncLimit=5
maxClientCnxns=0
clientPort=2181
tickTime=2000
quorumListenOnAllIPs=true
dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions

And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
leader" for one, and "Mode: follower" for the other two.

I have already tried to manually start 3 masters simultaneously, and here
is what I see in their log:
In 192.168.122.171(this is the first master I started):
I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
(id='25')
I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
'/mesos/log_replicas/24' in ZooKeeper
I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
'/mesos/json.info_25' in ZooKeeper
I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
(UPID=master@192.168.122.171:5050) is detected
I0605 07:12:49.423841  1186 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.122.171:5050 }
I0605 07:12:49.424281  1187 master.cpp:1951] The newly elected leader
is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
master!

In 192.168.122.225 (second master I started):
I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
(id='25')
I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
'/mesos/json.info_25' in ZooKeeper
I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
log-replica(1)@192.168.122.171:5050 }
I0605 07:12:51.925721  2252 replica.cpp:673] Replica in EMPTY status
received a broadcasted recover request from (6)@192.168.122.225:5050
I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
(UPID=master@192.168.122.171:5050) is detected
I0605 07:12:51.928444  2246 master.cpp:1951] The newly elected leader
is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b

In 192.168.122.132 (last master I started):
I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
(id='25')
I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
'/mesos/json.info_25' in ZooKeeper
I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID=
master@192.168.122.171:5050) is detected

So right after I started these 3 masters, the first one (192.168.122.171)
was successfully elected as leader, but after 60s, 192.168.122.171 failed
with the error mentioned in my first mail, and then 192.168.122.225 was
elected as leader, but it failed with the same error too after another 60s,
and the same thing happened to the last one (192.168.122.132). So after
about 180s, all my 3 master were down.

I tried both:
sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master
and
sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
--work_dir=/var/lib/mesos/master
And I see the same error for both.

192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
running on a KVM hypervisor host.




Thanks,
Qian Zhang

On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <d...@hellooperator.net> wrote:

> You told the master it needed a quorum of 2 and it's the only one
> online, so it's bombing out.
> That's the expected behaviour.
>
> You need to start at least 2 zookeepers before it will be a functional
> group, same for the masters.
>
> You haven't mentioned how you setup your zookeeper cluster, so i'm
> assuming that's working
> correctly (3 nodes, all aware of the other 2 in their config). If not,
> you need to sort that out first.
>
>
> Also I think your zk URL is wrong - you want to list all 3 zookeeper
> nodes like this:
>
> sudo ./bin/mesos-master.sh
> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
> --work_dir=/var/lib/mesos/master
>
> when you've run that command on 2 hosts things should start working,
> you'll want all 3 up for
> redundancy.
>
> On 4 June 2016 at 16:42, Qian Zhang <zhq527...@gmail.com> wrote:
> > Hi Folks,
> >
> > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a
> > Zookeeper running, so they form a Zookeeper cluster. And then when I
> started
> > the first Mesos master in one node with:
> > sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
> > --work_dir=/var/lib/mesos/master
> >
> > I fo

Re: Mesos Calico CNI

2016-05-20 Thread Qian Zhang
haosdent is right about the flags "--network_cni_config_dir" and
"--network_cni_plugins_dir", thanks haosdent :-)

> Is there a separate cli for testing of checking the the network
isolation config and plugin files without starting up an entire
cluster?

I think currently the quickest way to try CNI network support in Mesos from
E2E is: start a Mesos master and a Mesos agent with CNI isolator/some CNI
network configs/some CNI plugins loaded, and run "mesos-executor" to launch
a container to join one CNI network.



Thanks,
Qian Zhang

On Wed, May 18, 2016 at 5:57 PM, haosdent <haosd...@gmail.com> wrote:

> >It's not yet clear to me what exactly I have to put in the directories
> pointed by --network_cni_config_dir and --network_cni_plugins_dir in
> order to create a network?
>
> According to my understanding from code, Mesos try to parse any files under
> --network_cni_config_dir except directories.
>
> --network_cni_plugins_dir is used for find execute files. Suppose you
> define your network in --network_cni_config_dir like
>
> ```
> {
>   "type": "foo",
>   ...
>   "ipam": {
> "type: "bar",
>   }
>   ...
> }
> ```
>
> After Mesos finish parsing the network definition file under
> --network_cni_config_dir, it would try find the execute file "foo" and
> "bar" under
> --network_cni_plugins_dir. Because the type of network you defined above is
> "foo", and the type of "ipam" is "bar".
>
> Just my quick reply, you could wait for Qian Zhang or Avinash's detail
> reply later.
>
>
> On Wed, May 18, 2016 at 4:40 PM, Frank Scholten <fr...@frankscholten.nl>
> wrote:
>
> > Hi Avinash,
> >
> > Thanks for your response. I am following the steps at
> > https://github.com/asridharan/mesos/blob/MESOS-4771/docs/cni.md and
> > when I run the mesos-execute command on the cluster I started at
> > https://github.com/ContainerSolutions/mesos-calico-cni-sandbox I get a
> > message saying the network does not exist. This is ok because I have
> > not created the network yet.
> >
> > It's not yet clear to me what exactly I have to put in the directories
> > pointed by --network_cni_config_dir and --network_cni_plugins_dir in
> > order to create a network?
> >
> >
> > On Tue, May 17, 2016 at 5:16 PM, Avinash Sridharan
> > <avin...@mesosphere.io> wrote:
> > > Hi Frank,
> > >  I am in the process of putting up the documentation for CNI support in
> > > Mesos. You can find the RB patch for the documentation here:
> > > https://reviews.apache.org/r/47463/
> > >
> > > You can find a rendering of the markdown on my github over here:
> > > https://github.com/asridharan/mesos/blob/MESOS-4771/docs/cni.md
> > >
> > >
> > > I have put up one example of using the `network/cni` isolator with a
> > > "bridge" plugin. Working on adding some more examples, but given that
> > > people have already started showing some interest thought would be a
> good
> > > dry run for the documentation if someone could test out the
> instructions.
> > >
> > > Would be great if you can try following the instructions and leave any
> > > feedback on the review board.
> > >
> > >
> > > Thanks,
> > > Avinash
> > >
> > > On Tue, May 17, 2016 at 6:51 AM, Frank Scholten <
> fr...@frankscholten.nl>
> > > wrote:
> > >
> > >> In the meantime I am looking at an alternative route trying to figure
> > >> out how an ipAddress value on a Marathon app get propagated into Mesos
> > >> CNI.
> > >>
> > >> Marathon reads the ipAddress value from the AppDefinition and then
> > >> publishes on the eventbus. I don't see what happens to it from that
> > >> point.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Tue, May 17, 2016 at 2:58 PM, Jay JN Guo <guojian...@cn.ibm.com>
> > wrote:
> > >> > - net::links() -> stout/net.hpp
> > >> > - Personally, I'm not very familiar with CLion build. Maybe somebody
> > else
> > >> > could answer that.
> > >> > - I think this is very much related to dev mailing list, so +dev
> > >> >
> > >> > /J
> > >> >
> > >> > Frank Scholten <fr...@frankscholten.nl> wrote on 05/17/2016
> 20:47:12:
> > >> >
> > >> >> From: Frank Scholte

Delete the /observe HTTP endpoint

2016-05-18 Thread Qian Zhang
Hi Folks,

We are going to delete the master "/observe" HTTP endpoint in the JIRA
ticket MESOS-5408 since this endpoint was introduced a long time ago
for supporting functionality that was never implemented.

Please let us know if you have any comments or concerns, thanks!


Thanks,
Qian Zhang


Re: Linking mesos tasks to Docker containers

2016-01-11 Thread Qian Zhang
I think slave state endpoint has the info you need:
...
"executors": [
{
"container": "b824a7ad-35c6-48ab-9378-b180ed431fb5",
  <- this is 
...
"tasks": [
{
"executor_id": "",
"framework_id":
"83ced7f5-69b3-409b-abe5-a582a5d278cd-",
"id":
"app-docker1.f94cba69-b85e-11e5-bdf1-0242497320ff",
"name": "app-docker1",
...

So you can use task ID to locate the task, and each task must be under an
executor, and you can get  from executors.container, then it can
link to the Docker container name.


2016-01-12 10:44 GMT+08:00 haosdent :

> could you found it in slave state http endpoint? I remember the
> hierarchy should be framework->executors->container.
>
> On Tue, Jan 12, 2016 at 10:41 AM, haosdent  wrote:
>
>>  is containerid here.
>>
>> On Tue, Jan 12, 2016 at 10:36 AM, Shuai Lin 
>> wrote:
>>
>>> Take a look at
>>> https://github.com/scrapinghub/marathon-apps-collectd-plugin, basically
>>> it works like this:
>>>
>>> - there is a custom collectd (https://github.com/collectd/collectd)
>>> plugin, written in python
>>> - the plugin uses docker inspect api to query the MESOS_TASK_ID and
>>> MARATHON_APP_ID environment variable for each container
>>> - the plugin queries cpu/mem/network stats from docker stats api
>>> - collectd sends the (marathon app, task id, stats) data to a storage
>>> backend, e.g. graphite, elasticsearch, influxdb, etc.
>>>
>>> Hope that helps.
>>>
>>> Shuai
>>>
>>> On Tue, Jan 12, 2016 at 4:12 AM, Scott Rankin  wrote:
>>>
 Hi all,

 I’m trying to put together a monitoring system for our
 Marathon/Mesos/Docker app infrastructure, and I’m having some trouble
 linking between Mesos and Docker.  Mesos has a task ID
 (like app_auth.6fb24412-af26-11e5-9900-02d62c7c9807), but the only place
 that shows up in the Docker container is in the MESOS_TASK_ID environment
 variable passed to the Docker containers.  I see that the Docker containers
 are created using the format mesos--, but I cannot find
 the  value anywhere in the metadata from the master or the
 slave.

 Am I missing something, or is there any way that I can configure Mesos
 to add a label or something to the Docker container so that our monitoring
 scripts can tie the Docker container to the Mesos task?

 Thanks!
 Scott

 SCOTT *RANKIN*
 VP, Technology

 *Motus, LLC*
 Two Financial Center, 60 South Street, Boston, MA 02111
 617.467.1931 (W) | sran...@motus.com 



 Follow us on LinkedIn  |
 Visit us at motus.com 



 This email message contains information that Motus, LLC considers
 confidential and/or proprietary, or may later designate as confidential and
 proprietary. It is intended only for use of the individual or entity named
 above and should not be forwarded to any other persons or entities without
 the express consent of Motus, LLC, nor should it be used for any purpose
 other than in the course of any potential or actual business relationship
 with Motus, LLC. If the reader of this message is not the intended
 recipient, or the employee or agent responsible to deliver it to the
 intended recipient, you are hereby notified that any dissemination,
 distribution, or copying of this communication is strictly prohibited. If
 you have received this communication in error, please notify sender
 immediately and destroy the original message.

 Internal Revenue Service regulations require that certain types of
 written advice include a disclaimer. To the extent the preceding message
 contains advice relating to a Federal tax issue, unless expressly stated
 otherwise the advice is not intended or written to be used, and it cannot
 be used by the recipient or any other taxpayer, for the purpose of avoiding
 Federal tax penalties, and was not written to support the promotion or
 marketing of any transaction or matter discussed herein.

>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: Can Framework accept partial offers

2015-07-24 Thread Qian Zhang
I think it is Mesos allocator to offer resources and it is up to framework
scheduler to accept resources in the offer or decline.

2015-07-07 8:18 GMT+08:00 Vinod Kone vinodk...@gmail.com:

 Mesos doesn't currently support the notion of requesting resources.
 Resources are offered by Mesos based on a fair sharing algorithm (DRF) and
 it is up to the allocator to accept (partial) resources or decline.

 On Mon, Jul 6, 2015 at 5:00 PM, Ying Ji jiyin...@gmail.com wrote:

 Thanks for quick response.  That is very helpful.

 So, if I understand correctly, the framework should keep the entire
 offer, even only partial of the offer is satisfied the requirement ? For
 example, the framework asks for totally 4GB memory as role of prod. And the
 master gives the offer: such as 2GB in host1, and 2GB in host2. For some
 reasons (probably data locality. etc), the framework thinks that 2GB in
 host1 is acceptable. But the framework has to keep the entire offer, and
 send another resource request to ask another 2GB memory. When the framework
 gets all the resource and launch the tasks, the un-used resource will be
 released ? So, although the framework asks for totally 4GB memory, it
 actually holds for 6GB until it launches the tasks ?

 Is this true ?

 Thanks

 Ying

 On Mon, Jul 6, 2015 at 4:25 PM, Connor Doyle con...@mesosphere.io
 wrote:

 Hi Ying,

 When launching tasks, the scheduler includes the resources to consume.
 The remainder is implicitly declined.
 Also, the scheduler can accept and merge multiple offers from the same
 slave.

 --
 Connor


  On Jul 6, 2015, at 16:19, Ying Ji jiyin...@gmail.com wrote:
 
  Hey, mesos experts:
 
  I have a question about mesos resource allocation. If the
 framework sends the resource request, the master will give the current best
 offer to the framework (probably not the one which can satisfy the
 framework completely). In this case, the framework can either accept the
 offer or decline the offer. My question is: can the framework accept the
 partial offer, and decline the other part ?
 
 
  Thanks
 
  Ying






Re: Can Mesos master offer resources to multiple frameworks simultaneously?

2015-06-11 Thread Qian Zhang
Thanks Adam! It is clear to me now :-)

2015-06-12 7:49 GMT+08:00 Adam Bordelon a...@mesosphere.io:

 4. By default, Mesos will not revoke (rescind) an *un*used offer being
 held by a framework, but you can enable such a timeout by specifying the
 `--offer_timeout` flag on the master.

 On Thu, Jun 11, 2015 at 4:48 PM, Adam Bordelon a...@mesosphere.io wrote:

 1. The modularized allocator will still be a C++ interface, but you could
 just create a C++ wrapper around whatever Python/Go/Java/etc.
 implementation that you prefer.

 Your assessment of 23 sounds correct.

 4. By default, Mesos will not revoke (rescind) an used offer being held
 by a framework, but you can enable such a timeout by specifying the
 `--offer_timeout` flag on the master.

 On Thu, Jun 11, 2015 at 1:41 AM, baotiao baot...@gmail.com wrote:

 Hi Qian Zhang

 I can answer the fourth question.

 if a framework has not responded to an offer for a sufficiently long
 time, Mesos rescinds the offer and re-offers the resources to other
 frameworks.
 You cant get it

 I am not clear in how Mesos divide all resources into multiple subsets?

 
 陈宗志

 Blog: baotiao.github.io




 On Jun 11, 2015, at 08:35, Qian Zhang zhq527...@gmail.com wrote:

 Thanks Alex.

 For 1. I understand currently the only choice is C++. However, as Adam
 mentioned, true pluggable allocator modules (MESOS-2160
 https://issues.apache.org/jira/browse/MESOS-2160) are landing in
 Mesos 0.23, so at that time, I assume we will have more choices, right?

 For 2 and 3, my understanding is Mesos allocator will partition all the
 available resources into multiple subsets, and there is no overlap between
 these subsets (i.e., a single resource can only be in one subset), and then
 offer these subsets to multiple frameworks (e.g., offer subset1 to
 framework1, offer subset2 to framework2, and so on), and it is up to each
 framework's scheduler to determine if it accept the resource to launch task
 or reject it. In this way, each framework's scheduler can actually make
 scheduling decision independently since they will never compete for the
 same resource.

 If my understanding is correct, then I have one more question:
 4. What if it takes very long time (e.g., mins or hours) for a
 framework's scheduler to make the scheduling decision? Does that mean
 during this long period, the resources offered to this framework will not
 be used by any other frameworks? Is there a timeout for the
 framework's scheduler to make the scheduling decision? So when the timeout
 is reached, the resources offered to it will be revoked by Mesos allocator
 and can be offered to another framework.







Re: Can Mesos master offer resources to multiple frameworks simultaneously?

2015-06-10 Thread Qian Zhang
Thanks Alex.

For 1. I understand currently the only choice is C++. However, as Adam
mentioned, true pluggable allocator modules (MESOS-2160
https://issues.apache.org/jira/browse/MESOS-2160) are landing in Mesos
0.23, so at that time, I assume we will have more choices, right?

For 2 and 3, my understanding is Mesos allocator will partition all the
available resources into multiple subsets, and there is no overlap between
these subsets (i.e., a single resource can only be in one subset), and then
offer these subsets to multiple frameworks (e.g., offer subset1 to
framework1, offer subset2 to framework2, and so on), and it is up to each
framework's scheduler to determine if it accept the resource to launch task
or reject it. In this way, each framework's scheduler can actually make
scheduling decision independently since they will never compete for the
same resource.

If my understanding is correct, then I have one more question:
4. What if it takes very long time (e.g., mins or hours) for a framework's
scheduler to make the scheduling decision? Does that mean during this long
period, the resources offered to this framework will not be used by any
other frameworks? Is there a timeout for the framework's scheduler to make
the scheduling decision? So when the timeout is reached, the resources
offered to it will be revoked by Mesos allocator and can be offered to
another framework.


Re: Can Mesos master offer resources to multiple frameworks simultaneously?

2015-06-09 Thread Qian Zhang
Thanks Adam, this is very helpful!

I have a few more questions:
1. For the pluggable allocator modules, can I write my own allocator in any
programming language (e.g., Python, Go, etc)?
2. For the default DRF allocator, when it offer resources to a framework,
will it offer all the available resources (resources not being used by any
frameworks) to it? Or just part of the available resources?
3. If there are multiple frameworks and the default DRF allocator will only
offer resources to a single framework at a time, then that means framework
2 has to wait for framework 1 until framework 1 makes its placement
decision?


Re: Can Mesos master offer resources to multiple frameworks simultaneously?

2015-06-08 Thread Qian Zhang
Hi Michael,

I think it may not be the latter becauseI see this in the comments of the
function resourceOffers():

 * Note that resources may be concurrently offered to more than one
   * framework at a time (depending on the allocator being used). In
   * that case, the first framework to launch tasks using those
   * resources will be able to use them while the other frameworks
   * will have those resources rescinded (or if a framework has
   * already launched tasks with those resources then those tasks will
   * fail with a TASK_LOST status and a message saying as much).
   */

So Mesos can support both 1 and 2 which actually depends on which allocator
being used, right?


2015-06-06 19:06 GMT+08:00 Michael Hausenblas michael.hausenb...@gmail.com
:


  1. Mesos master offers all the resources to all the frameworks
 simultaneously.
  2. Mesos master offers resources to one framework at a time, e.g., it
 offers r1, r2, r3 to f1, and f1 accepts r1, and then it offers r2 and r3 to
 f2, ...

 The latter, yes.

 For a quick overview,  I suggest you have a look at
 http://mesos.apache.org/documentation/latest/mesos-architecture/ which
 covers the resource offer cycle.

 If you want to dive deeper, you might want to read:

 1. http://mesos.berkeley.edu/mesos_tech_report.pdf
 2. https://www.cs.berkeley.edu/~alig/papers/drf.pdf


 Note that there's a feature in the works that would be closer to your 1.,
 see https://issues.apache.org/jira/browse/MESOS-1607

 Cheers,
 Michael

 --
 Michael Hausenblas
 Ireland, Europe
 http://mhausenblas.info/

  On 6 Jun 2015, at 12:51, Qian Zhang zhq527...@gmail.com wrote:
 
  Hi,
 
  I am new to Mesos, and I'd like to know if there are a lot resources in
 the Mesos cluster, how will Mesos master offer these resources to the
 multiple frameworks? I guess there can be two ways:
  1. Mesos master offers all the resources to all the frameworks
 simultaneously.
  2. Mesos master offers resources to one framework at a time, e.g., it
 offers r1, r2, r3 to f1, and f1 accepts r1, and then it offers r2 and r3 to
 f2, ...
 
  If it is 1, then I'd like to know how Mesos master resolves the
 conflicts, e.g., multiple frameworks accept the same resource.
  If it is 2, then I see it is actually a serial process since Mesos
 master handle the frameworks one by one, then what is advantage of Mesos
 against traditional monolithic resource scheduler?
 
 
  Thanks,
  Qian




Re: Can Mesos master offer resources to multiple frameworks simultaneously?

2015-06-06 Thread Qian Zhang
Thanks Michael, I will take a close look at those paper.

2015-06-06 19:06 GMT+08:00 Michael Hausenblas michael.hausenb...@gmail.com
:


  1. Mesos master offers all the resources to all the frameworks
 simultaneously.
  2. Mesos master offers resources to one framework at a time, e.g., it
 offers r1, r2, r3 to f1, and f1 accepts r1, and then it offers r2 and r3 to
 f2, ...

 The latter, yes.

 For a quick overview,  I suggest you have a look at
 http://mesos.apache.org/documentation/latest/mesos-architecture/ which
 covers the resource offer cycle.

 If you want to dive deeper, you might want to read:

 1. http://mesos.berkeley.edu/mesos_tech_report.pdf
 2. https://www.cs.berkeley.edu/~alig/papers/drf.pdf


 Note that there's a feature in the works that would be closer to your 1.,
 see https://issues.apache.org/jira/browse/MESOS-1607

 Cheers,
 Michael

 --
 Michael Hausenblas
 Ireland, Europe
 http://mhausenblas.info/

  On 6 Jun 2015, at 12:51, Qian Zhang zhq527...@gmail.com wrote:
 
  Hi,
 
  I am new to Mesos, and I'd like to know if there are a lot resources in
 the Mesos cluster, how will Mesos master offer these resources to the
 multiple frameworks? I guess there can be two ways:
  1. Mesos master offers all the resources to all the frameworks
 simultaneously.
  2. Mesos master offers resources to one framework at a time, e.g., it
 offers r1, r2, r3 to f1, and f1 accepts r1, and then it offers r2 and r3 to
 f2, ...
 
  If it is 1, then I'd like to know how Mesos master resolves the
 conflicts, e.g., multiple frameworks accept the same resource.
  If it is 2, then I see it is actually a serial process since Mesos
 master handle the frameworks one by one, then what is advantage of Mesos
 against traditional monolithic resource scheduler?
 
 
  Thanks,
  Qian




Failed to make check and run example framework

2015-06-01 Thread Qian Zhang
Hi,

I followed the exact steps in http://mesos.apache.org/gettingstarted/ to
try Mesos, what I am using is a RHEL 6.5 x86_64 virtual machine. But make
check failed:
[--] 1 test from PerfEventIsolatorTest
[ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
F0601 22:45:27.851017 12655 isolator_tests.cpp:710] CHECK_SOME(isolator):
Perf is not supported
*** Check failure stack trace: ***
@ 0x7ffe4fa7dd60  google::LogMessage::Fail()
@ 0x7ffe4fa7dcb9  google::LogMessage::SendToLog()
@ 0x7ffe4fa7d697  google::LogMessage::Flush()
@ 0x7ffe4fa8061f  google::LogMessageFatal::~LogMessageFatal()
@   0x97fe94  _CheckFatal::~_CheckFatal()
@   0xbf7829
 
mesos::internal::tests::PerfEventIsolatorTest_ROOT_CGROUPS_Sample_Test::TestBody()
@  0x10e5fa5
 testing::internal::HandleSehExceptionsInMethodIfSupported()
@  0x10e0abe
 testing::internal::HandleExceptionsInMethodIfSupported()
@  0x10c783e  testing::Test::Run()
@  0x10c8078  testing::TestInfo::Run()
@  0x10c86ac  testing::TestCase::Run()
@  0x10cda39  testing::internal::UnitTestImpl::RunAllTests()
@  0x10e71f9
 testing::internal::HandleSehExceptionsInMethodIfSupported()
@  0x10e1899
 testing::internal::HandleExceptionsInMethodIfSupported()
@  0x10cc5c1  testing::UnitTest::Run()
@   0xc9fe96  main
@ 0x7ffe4bb21d1d  __libc_start_main
@   0x85ba69  (unknown)
I0601 22:45:29.976155 14327 exec.cpp:450] Slave exited, but framework has
checkpointing enabled. Waiting 15mins to reconnect with slave
20150601-224223-2574952640-39385-12655-S0
make[3]: *** [check-local] Aborted (core dumped)
make[3]: Leaving directory `/root/mesos-0.22.1/build/src'
make[2]: *** [check-am] Error 2
make[2]: Leaving directory `/root/mesos-0.22.1/build/src'
make[1]: *** [check] Error 2
make[1]: Leaving directory `/root/mesos-0.22.1/build/src'
make: *** [check-recursive] Error 1
[root@mesos build]# echo $?
2


And the example C++ framework also failed:
# ./src/test-framework --master=127.0.0.1:5050
I0601 23:02:44.646636 14828 sched.cpp:157] Version: 0.22.1
I0601 23:02:44.662256 14849 sched.cpp:254] New master detected at
master@127.0.0.1:5050
I0601 23:02:44.664237 14849 sched.cpp:264] No credentials provided.
Attempting to register without authentication
I0601 23:02:44.670964 14853 sched.cpp:448] Framework registered with
20150601-225015-16777343-5050-14668-
Registered!
Received offer 20150601-225015-16777343-5050-14668-O0 with cpus(*):4;
mem(*):2806; disk(*):40810; ports(*):[31000-32000]
Launching task 0 using offer 20150601-225015-16777343-5050-14668-O0
Launching task 1 using offer 20150601-225015-16777343-5050-14668-O0
Launching task 2 using offer 20150601-225015-16777343-5050-14668-O0
Launching task 3 using offer 20150601-225015-16777343-5050-14668-O0
Task 0 is in state TASK_LOST
Aborting because task 0 is in unexpected state TASK_LOST with reason 1 from
source 1 with message 'Executor terminated'
I0601 23:02:44.880982 14848 sched.cpp:1623] Asked to abort the driver
I0601 23:02:44.881239 14848 sched.cpp:856] Aborting framework
'20150601-225015-16777343-5050-14668-'
I0601 23:02:44.881921 14828 sched.cpp:1589] Asked to stop the driver



Any help will be appreciated, thanks!


Re: Failed to make check and run example framework

2015-06-01 Thread Qian Zhang
Thanks Haosdent and Ian.

But in my machine, I already have perf installed, so I am not sure why
those test cases still failed.
# rpm -qa | grep perf
perf-2.6.32-431.el6.x86_64
# which perf
/usr/bin/perf

And can you please let me know why my example C++ framework works as
normal? I see the following message:
Task 0 is in state TASK_LOST
Aborting because task 0 is in unexpected state TASK_LOST with reason 1
from source 1 with message 'Executor terminated'
It seems task is in an unexpected state, right? And after
./src/test-framework --master=127.0.0.1:5050 is executed, I ran echo
$?, and its output is 1 which means something wrong, otherwise echo $?
should output 0, right?


Re: Failed to make check and run example framework

2015-06-01 Thread Qian Zhang
I reran the check with GTEST_FILTER=-Perf* make check, but it failed
again in another place:

[ RUN  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup
-bash: /sys/fs/cgroup/cpu/mesos/container/cgroup.procs: No such file or
directory
mkdir: cannot create directory `/sys/fs/cgroup/cpu/mesos/container/user':
No such file or directory
../../src/tests/isolator_tests.cpp:1127: Failure
Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  +
path::join(flags.cgroups_hierarchy, userCgroup) + ')
  Actual: 256
Expected: 0
-bash: /sys/fs/cgroup/cpu/mesos/container/user/cgroup.procs: No such file
or directory
../../src/tests/isolator_tests.cpp:1136: Failure
Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  +
path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
  Actual: 256
Expected: 0
-bash: /sys/fs/cgroup/cpuacct/mesos/container/cgroup.procs: No such file or
directory
mkdir: cannot create directory
`/sys/fs/cgroup/cpuacct/mesos/container/user': No such file or directory
../../src/tests/isolator_tests.cpp:1127: Failure
Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'mkdir  +
path::join(flags.cgroups_hierarchy, userCgroup) + ')
  Actual: 256
Expected: 0
-bash: /sys/fs/cgroup/cpuacct/mesos/container/user/cgroup.procs: No such
file or directory
../../src/tests/isolator_tests.cpp:1136: Failure
Value of: os::system( su -  + UNPRIVILEGED_USERNAME +  -c 'echo $$  +
path::join(flags.cgroups_hierarchy, userCgroup, cgroup.procs) + ')
  Actual: 256
Expected: 0
[  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where
TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (443 ms)
[--] 1 test from UserCgroupIsolatorTest/1 (443 ms total)

[--] 1 test from UserCgroupIsolatorTest/2, where TypeParam =
mesos::internal::slave::CgroupsPerfEventIsolatorProcess
userdel: user 'mesos.test.unprivileged.user' does not exist
[ RUN  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup
F0602 10:13:49.849755  4279 isolator_tests.cpp:1054] CHECK_SOME(isolator):
Perf is not supported
*** Check failure stack trace: ***
2015-06-02
10:13:49,863:4279(0x7fd4cf5fe700):ZOO_ERROR@handle_socket_error_msg@1697:
Socket [127.0.0.1:45737] zk retcode=-4, errno=111(Connection refused):
server refused to accept the client
@ 0x7fd569fc2d60  google::LogMessage::Fail()
@ 0x7fd569fc2cb9  google::LogMessage::SendToLog()
@ 0x7fd569fc2697  google::LogMessage::Flush()
@ 0x7fd569fc561f  google::LogMessageFatal::~LogMessageFatal()
@   0x97fe94  _CheckFatal::~_CheckFatal()
@   0xc10e55
 
mesos::internal::tests::UserCgroupIsolatorTest_ROOT_CGROUPS_UserCgroup_Test::TestBody()
@  0x10e5fa5
 testing::internal::HandleSehExceptionsInMethodIfSupported()
@  0x10e0abe
 testing::internal::HandleExceptionsInMethodIfSupported()
@  0x10c783e  testing::Test::Run()
@  0x10c8078  testing::TestInfo::Run()
@  0x10c86ac  testing::TestCase::Run()
@  0x10cda39  testing::internal::UnitTestImpl::RunAllTests()
@  0x10e71f9
 testing::internal::HandleSehExceptionsInMethodIfSupported()
@  0x10e1899
 testing::internal::HandleExceptionsInMethodIfSupported()
@  0x10cc5c1  testing::UnitTest::Run()
@   0xc9fe96  main
@   0x38d141ed1d  (unknown)
@   0x85ba69  (unknown)
I0602 10:13:52.680408  5947 exec.cpp:450] Slave exited, but framework has
checkpointing enabled. Waiting 15mins to reconnect with slave
20150602-101045-2574952640-43322-4279-S0
make[3]: *** [check-local] Aborted (core dumped)
make[3]: Leaving directory `/root/mesos-0.22.1/build/src'
make[2]: *** [check-am] Error 2
make[2]: Leaving directory `/root/mesos-0.22.1/build/src'
make[1]: *** [check] Error 2
make[1]: Leaving directory `/root/mesos-0.22.1/build/src'
make: *** [check-recursive] Error 1


make check failed on RHEL 7

2014-08-20 Thread Qian Zhang
Hi All,

I am trying mesos-0.19.1 on RHEL 7, however, when I ran make check, it
failed:

make[3]: Entering directory `/root/mesos-0.19.1/build/src'
./mesos-tests
Source directory: /root/mesos-0.19.1
Build directory: /root/mesos-0.19.1/build
Note: Google Test filter =
*-SlaveCount/Registrar_BENCHMARK_Test.performance/0:SlaveCount/Registrar_BENCHMARK_Test.performance/1:SlaveCount/Registrar_BENCHMARK_Test.performance/2:SlaveCount/Registrar_BENCHMARK_Test.performance/3:
[==] Running 339 tests from 59 test cases.
[--] Global test environment set-up.
[--] 1 test from DRFAllocatorTest
[ RUN  ] DRFAllocatorTest.DRFAllocatorProcess
../../src/tests/allocator_tests.cpp:117: Failure
Failed to wait 10secs for offers1
../../src/tests/allocator_tests.cpp:112: Failure
Actual function call count doesn't match EXPECT_CALL(sched1,
resourceOffers(_, _))...
 Expected: to be called once
   Actual: never called - unsatisfied and active
../../src/tests/allocator_tests.cpp:109: Failure
Actual function call count doesn't match EXPECT_CALL(sched1, registered(_,
_, _))...
 Expected: to be called once
   Actual: never called - unsatisfied and active
../../src/tests/allocator_tests.cpp:92: Failure
Actual function call count doesn't match EXPECT_CALL(allocator,
slaveAdded(_, _, _))...
 Expected: to be called once
   Actual: never called - unsatisfied and active
../../src/tests/allocator_tests.cpp:107: Failure
Actual function call count doesn't match EXPECT_CALL(allocator,
frameworkAdded(_, _, _))...
 Expected: to be called once
   Actual: never called - unsatisfied and active
make[3]: *** [check-local] Segmentation fault
make[3]: Leaving directory `/root/mesos-0.19.1/build/src'
make[2]: *** [check-am] Error 2
make[2]: Leaving directory `/root/mesos-0.19.1/build/src'
make[1]: *** [check] Error 2
make[1]: Leaving directory `/root/mesos-0.19.1/build/src'
make: *** [check-recursive] Error 1

Any ideas about what happened? Any help will be appreciated.


Thanks,
Qian