Re: GLOG settings

2015-12-10 Thread Zameer Manji
That's a really handy tip, thanks for sharing!

As for GLOG --> SLF4J (or any other Java logging framework), I don't think
there is an option available because the native library logs to stderr
directly.

On Thu, Dec 10, 2015 at 9:06 AM, Charles Allen <
charles.al...@metamarkets.com> wrote:

> Just a general FYI, you can tune logging settings by using the GLOG_
> prefix to environment variables. For example GLOG_v=9 turns up the
> verbosity level. This works if you're running a java framework with the
> native java library. Unfortunately, I haven't found a way to plug GLOG -->
> SLF4J
>
> --
> Zameer Manji
>
>


Re: Safe update of agent attributes

2016-02-22 Thread Zameer Manji
Zhitao,

In my experience the best way to manage these attributes is to ensure
attribute changes are minimal (ie one attribute at a time) and roll them
out slowly across a cluster. This way you can catch unsafe mutations
quickly and rollback if needed.

I don't think there is a whitelist/blacklist of attributes to reference so
I think this is the safest way to go.

On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li  wrote:

> Hi,
>
> We recently discovered that updating attributes on Mesos agents is a very
> risk operation, and has a potential to send agent(s) into a crash loop if
> not done properly with errors like "Failed to perform recovery:
> Incompatible slave info detected". This combined with --recovery_timeout
> made the situation even worse.
>
> In our setup, some of the attributes are generated from automated
> configuration management system, so this opens a possibility that "bad"
> configuration could be left on the machine and causing big trouble on next
> agent upgrade, if the USR1 signal was not sent on time.
>
> Some questions:
>
> 1. Does anyone have a good practice recommended on managing these
> attributes safely?
> 2. Has Mesos considered to fallback to old metadata if it detects
> incompatibility, so agents would keep running with old attributes instead
> of falling into crash loop?
>
> Thanks.
>
> --
> Cheers,
>
> Zhitao Li
>
> --
> Zameer Manji
>
>


Re: Safe update of agent attributes

2016-02-23 Thread Zameer Manji
Is incompatible slave info signaled by a certain exit code?

On Tue, Feb 23, 2016 at 11:15 AM, Vinod Kone  wrote:

>
> On Tue, Feb 23, 2016 at 8:44 AM, Zhitao Li  wrote:
>
>> Can we consider to add a new option like "--auto_recovery_cleanup" which
>> would automatically perform the clean up if detected incompatible slave
>> info, or change the default behavior for "--recover"?
>>
>
> Wouldn't you want to know that an incompatible slave info was introduced?
> What if it was un-intentional? Note that whether slave automatically
> deletes the working directory (what you are proposing) or not, the tasks
> will be gone.
>
> One option would be for your startup script (the one that wraps
> mesos-slave binary) to contain this logic.
>
> --
> Zameer Manji
>
>


Re: Reserved memory and disk for mesos

2016-03-15 Thread Zameer Manji
Arkal,

There are many services that might need to run on a host adjacent to the
slave. For example, you might need a dns resolver, ntp, postfix, sshd, etc.
All of these services may need disk space for logging or other uses.

On Tue, Mar 15, 2016 at 11:12 AM, Arkal Arjun Rao  wrote:

> Hi All,
>
> I read this line in the book "Building Applications on  Mesos - David
> Greenberg":
>
> " The slave will reserve 1 GB or 50% of detected memory, whichever
> is smaller, in order to run itself and other operating system services.
> Likewise, it will reserve 5 GB or 50% of detected disk, whichever is
> smaller. "
>
> I haven't been able to figure out what "operating system services" refers
> to, and whether is is really necessary to block off the entire 5Gb of disk
>  for that purpose. Could anyone shed some light on what the disk is used
> for, in better detail?
>
> Ideally, I'd like as much of the disk as possible for my framework.
>
> Thanks in advance,
> Arjun
> --
> Arjun Arkal Rao
>
> PhD Student,
> Haussler Lab,
> UC Santa Cruz,
> USA
>
> aa...@ucsc.edu
>
> --
> Zameer Manji
>
> 
>


Re: HTTP API

2016-03-19 Thread Zameer Manji
+1

I am also interested in knowing the state of the HTTP API. I have heard
that it stabilizing the API might be tied with Mesos 1.0 but I don't have a
source for that. Can a PMC member comment on what the plan is?

On Mon, Mar 14, 2016 at 2:30 PM, Dario Rexin  wrote:

> Hi all,
>
> since the introduction of the HTTP API in 0.24 around 7.5 months have
> passed. What are the plans to make this API stable? There are already
> features (inverse offers) that are exclusively available through this API,
> so it would be great to have a timeline, as I think for most people it’s
> impossible to use experimental features in production.
>
> Thanks,
> Dario
>
> --
> Zameer Manji
>
>


Re: HTTP API

2016-03-19 Thread Zameer Manji
On Thu, Mar 17, 2016 at 10:03 AM, Vinod Kone  wrote:

> Other than the issues listed above, we like frameworks to start testing
> this API in their staging/testing clusters. This would give us the most
> confidence to call it production ready. Can you help?
>

As a committer of Apache Aurora, I am interested in removing the dependency
in libmesos and creating a Java Scheduler Driver that communicates with the
HTTP API. However, it only seems worthwhile to do once the API has
stabilized. I'll wait for the API to be finalized and then assess what work
needs to be done for the framework.

-- 
Zameer Manji


Re: Mesos 0.28 SSL in official packages

2016-04-11 Thread Zameer Manji
I have suggested this before and I will suggest it again here.

I think the Apache Mesos project should build and distribute packages
instead of relying on the generosity of a commercial vendor. The Apache
Aurora project does this already with good success. As a user of Apache
Mesos I don't care about Mesosphere Inc and I feel uncomfortable that the
project is so dependent on its employees.

Doing this would allow users to contribute packaging fixes directly to the
project, such as enabling SSL.

On Mon, Apr 11, 2016 at 3:02 AM, Adam Bordelon  wrote:

> Hi Kamil,
>
> Technically, there are no "official" Apache-built packages for Apache
> Mesos.
>
> At least once company (Mesosphere) chooses to build and distribute
> Mesos packages, but does not currently offer SSL builds. It wouldn't
> be hard to add an SSL build to our regular builds, but it hasn't been
> requested enough to prioritize it.
>
> cc: Joris, Kapil
>
> On Thu, Apr 7, 2016 at 7:42 AM, haosdent  wrote:
> > Hi, ssl didn't enable default. You need compile it by following this doc
> > http://mesos.apache.org/documentation/latest/ssl/
> >
> > On Thu, Apr 7, 2016 at 10:04 PM, Kamil Wokitajtis 
> > wrote:
> >>
> >> This is my first post, so Hi everyone!
> >>
> >> Is SSL enabled in official packages (CentOS in my case)?
> >> I can see libssl in ldd output, but I cannot see libevent.
> >> I had to compile mesos from sources to run it over ssl.
> >> I would prefer to install it from packages.
> >>
> >> Regards,
> >> Kamil
> >
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
>
> --
> Zameer Manji
>
>


Re: Mesos 0.28 SSL in official packages

2016-04-12 Thread Zameer Manji
For the record, I am not a committer on the Apache Mesos project and I do
not have the time to contribute packaging tools for the project. I think
existing committers who are Mesosphere employees can kick start this effort
by asking their employer to contribute the existing tools to the project.
Then any committer can incorporate creating packages as a step in the
release process.

On Tue, Apr 12, 2016 at 10:10 AM, Kapil Arya  wrote:

> At Mesosphere, we are planning to enable SSL into the nightlies starting
> sometime later this week. The goal is to have both SSL and non-SSL Mesos
> packages for Mesos 0.29.0 onwards in the Mesosphere deb/rpm repos. I will
> send out another email as soon as the stuff is ready for the community.
>
> Best,
> Kapil
>
> On Tue, Apr 12, 2016 at 12:22 PM, Steven Borrelli 
> wrote:
>
>> I’d be willing to assist in the effort to have standard packages (and
>> additional packages for modules like net-modules).
>>
>> Steven Borrelli
>> st...@borrelli.org
>>
>>
>>
>> > On Apr 12, 2016, at 11:10 AM, Adam Bordelon  wrote:
>> >
>> > We've discussed Apache-built/distributed packages before, and nobody
>> > has any objections, but we need somebody to take on the work to get
>> > the package builds setup. I believe Vinod had some thoughts on how to
>> > get started, but any Apache committer (Zameer?) should have access to
>> > builds.apache.org
>> > I think we're all also in favor of improved documentation/website for
>> > the Apache Mesos project, but we need your help for that too. File
>> > JIRAs, submit patches!
>> >
>> > On Tue, Apr 12, 2016 at 5:42 AM, June Taylor  wrote:
>> >> I heartily agree on both points. While I've found Mesosphere's
>> documentation
>> >> very helpful, it is often mixed up with the DCOS commercial offering.
>> That
>> >> may be something we're interested in down the road, but right now we
>> are
>> >> trying to stand up a relatively small cluster using straight
>> >> Mesos/Marathon/Chronos, etc., and finding good documentation is very
>> >> challenging due to the overlap with DCOS.
>> >>
>> >> The Apache documentation also, I think, is suffering precisely because
>> >> Mesosphere has been serving up better materials on their own. It's
>> certainly
>> >> useful, but I'm a bit uncomfortable with that arrangement.
>> >>
>> >>
>> >> Thanks,
>> >> June Taylor
>> >> System Administrator, Minnesota Population Center
>> >> University of Minnesota
>> >>
>> >> On Tue, Apr 12, 2016 at 7:04 AM, Paul Bell  wrote:
>> >>>
>> >>> FWIW, I quite agree with Zameer's point.
>> >>>
>> >>> That said, I want to make abundantly clear that in my experience the
>> folks
>> >>> at Mesosphere are wonderfully helpful.
>> >>>
>> >>> But what happens if down the road Mesosphere is acquired or there
>> occurs
>> >>> some other event that could represent, if not a conflict of interest,
>> then
>> >>> simply a different strategic direction?
>> >>>
>> >>> My 2 cents.
>> >>>
>> >>> -Paul
>> >>>
>> >>> On Mon, Apr 11, 2016 at 5:19 PM, Zameer Manji 
>> wrote:
>> >>>>
>> >>>> I have suggested this before and I will suggest it again here.
>> >>>>
>> >>>> I think the Apache Mesos project should build and distribute packages
>> >>>> instead of relying on the generosity of a commercial vendor. The
>> Apache
>> >>>> Aurora project does this already with good success. As a user of
>> Apache
>> >>>> Mesos I don't care about Mesosphere Inc and I feel uncomfortable
>> that the
>> >>>> project is so dependent on its employees.
>> >>>>
>> >>>> Doing this would allow users to contribute packaging fixes directly
>> to
>> >>>> the project, such as enabling SSL.
>> >>>>
>> >>>> On Mon, Apr 11, 2016 at 3:02 AM, Adam Bordelon 
>> >>>> wrote:
>> >>>>>
>> >>>>> Hi Kamil,
>> >>>>>
>> >>>>> Technically, there are no "official" Apache-built packages for
>> Apache
>> >>>>> Mesos.
>> 

Re: [Proposal] Remove the default value for agent work_dir

2016-04-12 Thread Zameer Manji
+1

I have seen this confuse many users of Apache Aurora many times.
Eliminating the default will cause operators to select a location with the
appropriate persistence properties.

On Tue, Apr 12, 2016 at 3:58 PM, Greg Mann  wrote:

> Hey folks!
> A number of situations have arisen in which the default value of the Mesos
> agent `--work_dir` flag (/tmp/mesos) has caused problems on systems in
> which the automatic cleanup of '/tmp' deletes agent metadata. To resolve
> this, we would like to eliminate the default value of the agent
> `--work_dir` flag. You can find the relevant JIRA here
> <https://issues.apache.org/jira/browse/MESOS-5064>.
>
> We considered simply changing the default value to a more appropriate
> location, but decided against this because the expected filesystem
> structure varies from platform to platform, and because it isn't guaranteed
> that the Mesos agent would have access to the default path on a particular
> platform.
>
> Eliminating the default `--work_dir` value means that the agent would exit
> immediately if the flag is not provided, whereas currently it launches
> successfully in this case. This will break existing infrastructure which
> relies on launching the Mesos agent without specifying the work directory.
> I believe this is an acceptable change because '/tmp/mesos' is not a
> suitable location for the agent work directory except for short-term local
> testing, and any production scenario that is currently using this location
> should be altered immediately.
>
> If you have any thoughts/opinions/concerns regarding this change, please
> let us know!
>
> Cheers,
> Greg
>
> --
> Zameer Manji
>
>


Re: IP address as resource

2016-05-06 Thread Zameer Manji
Bharath,

Aurora is currently adding support for arbitrary resources with this exact
usecase in mind. The code isn't complete yet and it hasn't been tried out
in production. I suggest reaching out to the user@
<http://aurora.apache.org/community/> for Aurora to get the latest update.

On Fri, May 6, 2016 at 6:36 AM, Bharath Ravi Kumar 
wrote:

> Hi,
>
> I'm aware of mesos' IP-per-container capability and the authors' reasons
> for not modeling an IP address as a resource on a host. However, for
> operational simplicity, I prefer an implementation that does not interact
> with multiple other services (e.g. an IPAM). I'm hence considering the
> following approach:
>
> a) Model the IP addresses available on a host as resources.
> b) Using the IP address (from the set) accepted by a framework, launch a
> task using the docker containerizer, with the IP address selected by the
> framework.
> c) For tasks that are not resource intensive, fall back on port range
> reservation and docker host mode networking.
>
> It appears that Marathon doesn't support arbitrary resources, but Apache
> Aurora might(?) . I'd like to know if anyone else has attempted this
> approach with either framework, any potential downsides to this approach,
> and any alternatives that are similar to this.
>
> Thanks,
> Bharath
>
> --
> Zameer Manji
>
>


Re: 1.0 Release Candidate

2016-05-25 Thread Zameer Manji
I might be in the minority here, but I think cutting an RC for 1.0 right
now is very aggressive. Does there exist even a single framework that uses
the Scheduler HTTP API or the Executor HTTP API? Does anyone even use these
APIs in production? Is there a single entity that uses the Operator API to
manage agents?

I think cutting an RC right now is 100% premature until the community can
provide clear answers to these questions.

I think Mesos project has been historically successful because its features
were developed in a slow methodical manner and then battle tested by at
least a user before the feature was declared 'stable' and ready for use for
everyone. I think not following those steps here for the HTTP APIs is a
huge error.

On Wed, May 25, 2016 at 12:51 PM, Vinod Kone  wrote:

> Post 1.0. Jie might be able to shed more light regarding the plans for
> Docker Containerizer.
>
> On Wed, May 25, 2016 at 12:10 PM, Jeff Schroeder <
> jeffschroe...@computer.org> wrote:
>
>> Does this mean the work to deprecate the docker containerizer will be
>> post-1.0, or have those plans changed?
>>
>>
>> On Wednesday, May 25, 2016, Vinod Kone  wrote:
>>
>>> Hi folks,
>>>
>>> As discussed in the previous community sync, we plan to cut a release
>>> candidate for our next release (1.0) early next week.
>>>
>>> 1.0 is mainly centered around new APIs for Mesos. Please take a look at
>>> MESOS-338  for
>>> blocking issues. We got some great design and testing feedback for the v1
>>> scheduler and executor APIs. Please do the same for the in-progress v1
>>> operator API
>>> 
>>> .
>>>
>>> Since this is a 1.0, we would like to do the release a little
>>> differently.
>>>
>>> First, the voting period for vetting the release candidate would be a
>>> few weeks (2-3 weeks) instead of the typical 3 days.
>>>
>>> Second, we are wiling to make major changes (scalability fixes, API
>>> fixes) if there are any issues reported by the community.
>>>
>>> We are doing these because we really want the community to thoroughly
>>> test the 1.0 release and give feedback.
>>>
>>> Thanks,
>>>
>>
>>
>> --
>> Text by Jeff, typos by iPhone
>>
>
>


Debugging Scheduler HTTP API Failures

2016-08-14 Thread Zameer Manji
Hey,

I'm using the Mesos HTTP API for the first time. I am currently
encountering an issue where after a successful SUBSCRIBE call and receiving
a SUBSCRIBED and HEARTBEAT event, a subsequent TEARDOWN call fails with
HTTP 400 with a message of "The stream ID included in this request didn't
match the stream ID currently associated with framework ID".

Here is a detailed breakdown of what happens with logs:

A new framework sends an SUBSCRIBE call with the following body:


framework_id {
  value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
}
type: SUBSCRIBE
subscribe {
  framework_info {
user: "user"
name: "name"
id {
  value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
}
  }
}


It then receives a 200 OK response with the following headers:
`{content-type=[application/x-protobuf], date=[Sat, 13 Aug 2016 02:42:48
GMT], transfer-encoding=[chunked],
mesos-stream-id=[71a0294f-e9c4-4efe-b237-fb120836aaf8]}`

Over this connection it receives a successful subscribed event:

type: SUBSCRIBED
subscribed {
  framework_id {
value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
  }
  heartbeat_interval_seconds: 15.0
}


It also receives a single heart beat event.

Then it tries to send the following request:

Sending: framework_id {
  value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
}
type: TEARDOWN

with the following headers:
`{accept=[application/x-protobuf], accept-encoding=[gzip],
mesos-stream-id=[71a0294f-e9c4-4efe-b237-fb120836aaf8]}`

The response is a 400 with the body: `The stream ID included in this
request didn't match the stream ID currently associated with framework ID
'0dffbee9-a514-4ffa-87e1-2850dd4dcf00'`.


The master logs contains:

I0813 02:42:48.376819 13934 http.cpp:381] HTTP POST for
/master/api/v1/scheduler from 192.168.33.1:60780 with
User-Agent='Google-HTTP-Java-Client/1.20.0 (gzip)'
I0813 02:42:48.376998 13934 master.cpp:2146] Received subscription request
for HTTP framework 'name'
I0813 02:42:48.377104 13934 master.cpp:2244] Subscribing framework 'name'
with checkpointing disabled and capabilities [  ]
I0813 02:42:48.377378 13934 hierarchical.cpp:271] Added framework
0dffbee9-a514-4ffa-87e1-2850dd4dcf00
I0813 02:42:49.475163 13929 http.cpp:381] HTTP POST for
/master/api/v1/scheduler from 192.168.33.1:60782 with
User-Agent='Google-HTTP-Java-Client/1.20.0 (gzip)'
I0813 02:42:51.133513 13930 master.cpp:1284] Framework
0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name) disconnected
I0813 02:42:51.133597 13930 master.cpp:2725] Disconnecting framework
0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
I0813 02:42:51.133618 13930 master.cpp:2749] Deactivating framework
0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
I0813 02:42:51.133644 13930 master.cpp:1297] Giving framework
0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name) 0ns to failover
I0813 02:42:51.133692 13932 hierarchical.cpp:382] Deactivated framework
0dffbee9-a514-4ffa-87e1-2850dd4dcf00
I0813 02:42:51.137265 13931 master.cpp:5561] Framework failover timeout,
removing framework 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
I0813 02:42:51.137339 13931 master.cpp:6296] Removing framework
0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
I0813 02:42:51.137464 13931 hierarchical.cpp:333] Removed framework
0dffbee9-a514-4ffa-87e1-2850dd4dcf00

Note the immediate disconnection after the second POST is intentional.

This is with Mesos 1.0.0 on Ubuntu Trusty.

What can I do to debug this issue? The logs do not provide a lot of
information to act on. The stream id generated by mesos is not in the logs,
nor anything indicating that an HTTP 400 was sent.

--
Zameer Manji


Re: Debugging Scheduler HTTP API Failures

2016-08-14 Thread Zameer Manji
Dario,

I do not think the case sensitivity matters here. If the master was
expecting a header that was exactly 'Mesos-Stream-Id' and did not see it, I
would expect to get the error response: `All non-subscribe calls should
include the 'Mesos-Stream-Id' header`. That is the error response that you
get when you do not set the header.

Possibly related, I expected to see the stream id in the mesos logs. I see this
log message
<https://github.com/apache/mesos/blob/c9b70582e9fccab8f6863b0bd3a812b5969a8c24/src/master/master.cpp#L7473-L7474>
in
the code, but I do not see it in the logs.


On Sun, Aug 14, 2016 at 6:12 PM, Dario Rexin  wrote:

> Oh, sorry, I didn't see you actually set the header (wall of text ;) ).
> That's an interesting issue, do you set the header case sensitive? I know
> headers shouldn't be case sensitive, but maybe there's a bug in the Mesos
> code. I have not seen this issue before.
>
> On Aug 14, 2016, at 5:58 PM, Zameer Manji  wrote:
>
> Hey,
>
> I'm using the Mesos HTTP API for the first time. I am currently
> encountering an issue where after a successful SUBSCRIBE call and receiving
> a SUBSCRIBED and HEARTBEAT event, a subsequent TEARDOWN call fails with
> HTTP 400 with a message of "The stream ID included in this request didn't
> match the stream ID currently associated with framework ID".
>
> Here is a detailed breakdown of what happens with logs:
>
> A new framework sends an SUBSCRIBE call with the following body:
>
> 
> framework_id {
>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
> }
> type: SUBSCRIBE
> subscribe {
>   framework_info {
> user: "user"
> name: "name"
> id {
>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
> }
>   }
> }
> 
>
> It then receives a 200 OK response with the following headers:
> `{content-type=[application/x-protobuf], date=[Sat, 13 Aug 2016 02:42:48
> GMT], transfer-encoding=[chunked], mesos-stream-id=[71a0294f-
> e9c4-4efe-b237-fb120836aaf8]}`
>
> Over this connection it receives a successful subscribed event:
> 
> type: SUBSCRIBED
> subscribed {
>   framework_id {
> value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>   }
>   heartbeat_interval_seconds: 15.0
> }
> 
>
> It also receives a single heart beat event.
>
> Then it tries to send the following request:
> 
> Sending: framework_id {
>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
> }
> type: TEARDOWN
> 
> with the following headers:
> `{accept=[application/x-protobuf], accept-encoding=[gzip],
> mesos-stream-id=[71a0294f-e9c4-4efe-b237-fb120836aaf8]}`
>
> The response is a 400 with the body: `The stream ID included in this
> request didn't match the stream ID currently associated with framework ID
> '0dffbee9-a514-4ffa-87e1-2850dd4dcf00'`.
>
>
> The master logs contains:
> 
> I0813 02:42:48.376819 13934 http.cpp:381] HTTP POST for
> /master/api/v1/scheduler from 192.168.33.1:60780 with
> User-Agent='Google-HTTP-Java-Client/1.20.0 (gzip)'
> I0813 02:42:48.376998 13934 master.cpp:2146] Received subscription request
> for HTTP framework 'name'
> I0813 02:42:48.377104 13934 master.cpp:2244] Subscribing framework 'name'
> with checkpointing disabled and capabilities [  ]
> I0813 02:42:48.377378 13934 hierarchical.cpp:271] Added framework
> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00
> I0813 02:42:49.475163 13929 http.cpp:381] HTTP POST for
> /master/api/v1/scheduler from 192.168.33.1:60782 with
> User-Agent='Google-HTTP-Java-Client/1.20.0 (gzip)'
> I0813 02:42:51.133513 13930 master.cpp:1284] Framework
> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name) disconnected
> I0813 02:42:51.133597 13930 master.cpp:2725] Disconnecting framework
> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
> I0813 02:42:51.133618 13930 master.cpp:2749] Deactivating framework
> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
> I0813 02:42:51.133644 13930 master.cpp:1297] Giving framework
> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name) 0ns to failover
> I0813 02:42:51.133692 13932 hierarchical.cpp:382] Deactivated framework
> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00
> I0813 02:42:51.137265 13931 master.cpp:5561] Framework failover timeout,
> removing framework 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
> I0813 02:42:51.137339 13931 master.cpp:6296] Removing framework
> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
> I0813 02:42:51.137464 13931 hierarchical.cpp:333] Removed framework
> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00
> 
> Note the immediate disconnection after the second POST is intentional.
>
> This is with Mesos 1.0.0 on Ubuntu Trusty.
>
> What can I do to debug this issue? The logs do not provide a lot of
> information to act on. The stream id generated by mesos is not in the logs,
> nor anything indicating that an HTTP 400 was sent.
>
> --
> Zameer Manji
>
>


Re: Debugging Scheduler HTTP API Failures

2016-08-14 Thread Zameer Manji
Dario,

The logs show that no disconnections occur until after the second POST
request. I would expect a log entry indicating a disconnect between the two
POST requests if the stream id changed.

On Sun, Aug 14, 2016 at 8:28 PM, Dario Rexin  wrote:

> You’re absolutely right. I just tried the exact same steps and it worked
> fine for me. I also don’t see the log message. Do you have any reconnection
> logic in place? Is it possible, that your framework reconnected before you
> send the call? The Stream Id would change in that case.
>
>
> On Aug 14, 2016, at 6:48 PM, Zameer Manji  wrote:
>
> Dario,
>
> I do not think the case sensitivity matters here. If the master was
> expecting a header that was exactly 'Mesos-Stream-Id' and did not see it, I
> would expect to get the error response: `All non-subscribe calls should
> include the 'Mesos-Stream-Id' header`. That is the error response that you
> get when you do not set the header.
>
> Possibly related, I expected to see the stream id in the mesos logs. I see 
> this
> log message
> <https://github.com/apache/mesos/blob/c9b70582e9fccab8f6863b0bd3a812b5969a8c24/src/master/master.cpp#L7473-L7474>
>  in
> the code, but I do not see it in the logs.
>
>
> On Sun, Aug 14, 2016 at 6:12 PM, Dario Rexin  wrote:
>
>> Oh, sorry, I didn't see you actually set the header (wall of text ;) ).
>> That's an interesting issue, do you set the header case sensitive? I know
>> headers shouldn't be case sensitive, but maybe there's a bug in the Mesos
>> code. I have not seen this issue before.
>>
>> On Aug 14, 2016, at 5:58 PM, Zameer Manji  wrote:
>>
>> Hey,
>>
>> I'm using the Mesos HTTP API for the first time. I am currently
>> encountering an issue where after a successful SUBSCRIBE call and receiving
>> a SUBSCRIBED and HEARTBEAT event, a subsequent TEARDOWN call fails with
>> HTTP 400 with a message of "The stream ID included in this request didn't
>> match the stream ID currently associated with framework ID".
>>
>> Here is a detailed breakdown of what happens with logs:
>>
>> A new framework sends an SUBSCRIBE call with the following body:
>>
>> 
>> framework_id {
>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>> }
>> type: SUBSCRIBE
>> subscribe {
>>   framework_info {
>> user: "user"
>> name: "name"
>> id {
>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>> }
>>   }
>> }
>> 
>>
>> It then receives a 200 OK response with the following headers:
>> `{content-type=[application/x-protobuf], date=[Sat, 13 Aug 2016 02:42:48
>> GMT], transfer-encoding=[chunked], mesos-stream-id=[71a0294f-e9c4
>> -4efe-b237-fb120836aaf8]}`
>>
>> Over this connection it receives a successful subscribed event:
>> 
>> type: SUBSCRIBED
>> subscribed {
>>   framework_id {
>> value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>   }
>>   heartbeat_interval_seconds: 15.0
>> }
>> 
>>
>> It also receives a single heart beat event.
>>
>> Then it tries to send the following request:
>> 
>> Sending: framework_id {
>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>> }
>> type: TEARDOWN
>> 
>> with the following headers:
>> `{accept=[application/x-protobuf], accept-encoding=[gzip],
>> mesos-stream-id=[71a0294f-e9c4-4efe-b237-fb120836aaf8]}`
>>
>> The response is a 400 with the body: `The stream ID included in this
>> request didn't match the stream ID currently associated with framework ID
>> '0dffbee9-a514-4ffa-87e1-2850dd4dcf00'`.
>>
>>
>> The master logs contains:
>> 
>> I0813 02:42:48.376819 13934 http.cpp:381] HTTP POST for
>> /master/api/v1/scheduler from 192.168.33.1:60780 with
>> User-Agent='Google-HTTP-Java-Client/1.20.0 (gzip)'
>> I0813 02:42:48.376998 13934 master.cpp:2146] Received subscription
>> request for HTTP framework 'name'
>> I0813 02:42:48.377104 13934 master.cpp:2244] Subscribing framework 'name'
>> with checkpointing disabled and capabilities [  ]
>> I0813 02:42:48.377378 13934 hierarchical.cpp:271] Added framework
>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00
>> I0813 02:42:49.475163 13929 http.cpp:381] HTTP POST for
>> /master/api/v1/scheduler from 192.168.33.1:60782 with
>> User-Agent='Google-HTTP-Java-Client/1.20.0 (gzip)'
>> I0813 02:42:51.133513 13930 master.cpp:1284] Framework
&g

Re: Debugging Scheduler HTTP API Failures

2016-08-14 Thread Zameer Manji
Here is a MWE: https://github.com/zmanji/mesos-mwe

Follow the instructions in the README to reproduce.

On Sun, Aug 14, 2016 at 9:04 PM, Dario Rexin  wrote:

> Can you post the code somewhere?
>
>
> On Aug 14, 2016, at 8:54 PM, Zameer Manji  wrote:
>
> Dario,
>
> The logs show that no disconnections occur until after the second POST
> request. I would expect a log entry indicating a disconnect between the two
> POST requests if the stream id changed.
>
> On Sun, Aug 14, 2016 at 8:28 PM, Dario Rexin  wrote:
>
>> You’re absolutely right. I just tried the exact same steps and it worked
>> fine for me. I also don’t see the log message. Do you have any reconnection
>> logic in place? Is it possible, that your framework reconnected before you
>> send the call? The Stream Id would change in that case.
>>
>>
>> On Aug 14, 2016, at 6:48 PM, Zameer Manji  wrote:
>>
>> Dario,
>>
>> I do not think the case sensitivity matters here. If the master was
>> expecting a header that was exactly 'Mesos-Stream-Id' and did not see it, I
>> would expect to get the error response: `All non-subscribe calls should
>> include the 'Mesos-Stream-Id' header`. That is the error response that you
>> get when you do not set the header.
>>
>> Possibly related, I expected to see the stream id in the mesos logs. I
>> see this log message
>> <https://github.com/apache/mesos/blob/c9b70582e9fccab8f6863b0bd3a812b5969a8c24/src/master/master.cpp#L7473-L7474>
>>  in
>> the code, but I do not see it in the logs.
>>
>>
>> On Sun, Aug 14, 2016 at 6:12 PM, Dario Rexin  wrote:
>>
>>> Oh, sorry, I didn't see you actually set the header (wall of text ;) ).
>>> That's an interesting issue, do you set the header case sensitive? I know
>>> headers shouldn't be case sensitive, but maybe there's a bug in the Mesos
>>> code. I have not seen this issue before.
>>>
>>> On Aug 14, 2016, at 5:58 PM, Zameer Manji  wrote:
>>>
>>> Hey,
>>>
>>> I'm using the Mesos HTTP API for the first time. I am currently
>>> encountering an issue where after a successful SUBSCRIBE call and receiving
>>> a SUBSCRIBED and HEARTBEAT event, a subsequent TEARDOWN call fails with
>>> HTTP 400 with a message of "The stream ID included in this request didn't
>>> match the stream ID currently associated with framework ID".
>>>
>>> Here is a detailed breakdown of what happens with logs:
>>>
>>> A new framework sends an SUBSCRIBE call with the following body:
>>>
>>> 
>>> framework_id {
>>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>> }
>>> type: SUBSCRIBE
>>> subscribe {
>>>   framework_info {
>>> user: "user"
>>> name: "name"
>>> id {
>>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>> }
>>>   }
>>> }
>>> 
>>>
>>> It then receives a 200 OK response with the following headers:
>>> `{content-type=[application/x-protobuf], date=[Sat, 13 Aug 2016
>>> 02:42:48 GMT], transfer-encoding=[chunked], mesos-stream-id=[71a0294f-e9c4
>>> -4efe-b237-fb120836aaf8]}`
>>>
>>> Over this connection it receives a successful subscribed event:
>>> 
>>> type: SUBSCRIBED
>>> subscribed {
>>>   framework_id {
>>> value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>>   }
>>>   heartbeat_interval_seconds: 15.0
>>> }
>>> 
>>>
>>> It also receives a single heart beat event.
>>>
>>> Then it tries to send the following request:
>>> 
>>> Sending: framework_id {
>>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>> }
>>> type: TEARDOWN
>>> 
>>> with the following headers:
>>> `{accept=[application/x-protobuf], accept-encoding=[gzip],
>>> mesos-stream-id=[71a0294f-e9c4-4efe-b237-fb120836aaf8]}`
>>>
>>> The response is a 400 with the body: `The stream ID included in this
>>> request didn't match the stream ID currently associated with framework ID
>>> '0dffbee9-a514-4ffa-87e1-2850dd4dcf00'`.
>>>
>>>
>>> The master logs contains:
>>> 
>>> I0813 02:42:48.376819 13934 http.cpp:381] HTTP POST for
>>> /master/api/v1/scheduler from 192.168.33.1:60780 with
>>> User-Agent='Google-HTTP-Java-

Re: Debugging Scheduler HTTP API Failures

2016-08-15 Thread Zameer Manji
Dario was right.

I filed MESOS-6041 <https://issues.apache.org/jira/browse/MESOS-6041> to
enhance the error message.

On Sun, Aug 14, 2016 at 10:35 PM, Dario Rexin  wrote:

> Zameer,
>
> the header value is enclosed in []. This is because headers can have
> multiple values and the library you use pus them into a list. You have to
> take the first item from that list and then it should work.
>
> On Aug 14, 2016, at 10:19 PM, Zameer Manji  wrote:
>
> Here is a MWE: https://github.com/zmanji/mesos-mwe
>
> Follow the instructions in the README to reproduce.
>
> On Sun, Aug 14, 2016 at 9:04 PM, Dario Rexin  wrote:
>
>> Can you post the code somewhere?
>>
>>
>> On Aug 14, 2016, at 8:54 PM, Zameer Manji  wrote:
>>
>> Dario,
>>
>> The logs show that no disconnections occur until after the second POST
>> request. I would expect a log entry indicating a disconnect between the two
>> POST requests if the stream id changed.
>>
>> On Sun, Aug 14, 2016 at 8:28 PM, Dario Rexin  wrote:
>>
>>> You’re absolutely right. I just tried the exact same steps and it worked
>>> fine for me. I also don’t see the log message. Do you have any reconnection
>>> logic in place? Is it possible, that your framework reconnected before you
>>> send the call? The Stream Id would change in that case.
>>>
>>>
>>> On Aug 14, 2016, at 6:48 PM, Zameer Manji  wrote:
>>>
>>> Dario,
>>>
>>> I do not think the case sensitivity matters here. If the master was
>>> expecting a header that was exactly 'Mesos-Stream-Id' and did not see it, I
>>> would expect to get the error response: `All non-subscribe calls should
>>> include the 'Mesos-Stream-Id' header`. That is the error response that you
>>> get when you do not set the header.
>>>
>>> Possibly related, I expected to see the stream id in the mesos logs. I
>>> see this log message
>>> <https://github.com/apache/mesos/blob/c9b70582e9fccab8f6863b0bd3a812b5969a8c24/src/master/master.cpp#L7473-L7474>
>>>  in
>>> the code, but I do not see it in the logs.
>>>
>>>
>>> On Sun, Aug 14, 2016 at 6:12 PM, Dario Rexin  wrote:
>>>
>>>> Oh, sorry, I didn't see you actually set the header (wall of text ;) ).
>>>> That's an interesting issue, do you set the header case sensitive? I know
>>>> headers shouldn't be case sensitive, but maybe there's a bug in the Mesos
>>>> code. I have not seen this issue before.
>>>>
>>>> On Aug 14, 2016, at 5:58 PM, Zameer Manji  wrote:
>>>>
>>>> Hey,
>>>>
>>>> I'm using the Mesos HTTP API for the first time. I am currently
>>>> encountering an issue where after a successful SUBSCRIBE call and receiving
>>>> a SUBSCRIBED and HEARTBEAT event, a subsequent TEARDOWN call fails with
>>>> HTTP 400 with a message of "The stream ID included in this request didn't
>>>> match the stream ID currently associated with framework ID".
>>>>
>>>> Here is a detailed breakdown of what happens with logs:
>>>>
>>>> A new framework sends an SUBSCRIBE call with the following body:
>>>>
>>>> 
>>>> framework_id {
>>>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>>> }
>>>> type: SUBSCRIBE
>>>> subscribe {
>>>>   framework_info {
>>>> user: "user"
>>>> name: "name"
>>>> id {
>>>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>>> }
>>>>   }
>>>> }
>>>> 
>>>>
>>>> It then receives a 200 OK response with the following headers:
>>>> `{content-type=[application/x-protobuf], date=[Sat, 13 Aug 2016
>>>> 02:42:48 GMT], transfer-encoding=[chunked], mesos-stream-id=[71a0294f-e9c4
>>>> -4efe-b237-fb120836aaf8]}`
>>>>
>>>> Over this connection it receives a successful subscribed event:
>>>> 
>>>> type: SUBSCRIBED
>>>> subscribed {
>>>>   framework_id {
>>>> value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>>>   }
>>>>   heartbeat_interval_seconds: 15.0
>>>> }
>>>> 
>>>>
>>>> It also receives a single heart beat event.
>>>>
>>>> Then it tries to send the following request:
>>

Re: what is the status on this?

2016-09-06 Thread Zameer Manji
If we use the replicated log for leader election, how will frameworks
detect the leading master? Right now the scheduler driver uses the
MasterInfo in ZK to discover the leader and detect leadership changes.

On Mon, Sep 5, 2016 at 10:18 AM, Dario Rexin  wrote:

> If we go and change this, why not simply remove any dependencies to
> external systems and simply use the replicated log for leader election?
>
> On Sep 5, 2016, at 9:02 AM, Alex Rukletsov  wrote:
>
> Kant—
>
> thanks a lot for the feedback! Are you interested in helping out with
> Consul module once Jay and Joseph are done with modularizing patches?
>
> On Mon, Sep 5, 2016 at 8:50 AM, Jay JN Guo  wrote:
>
>> Patches are currently under review by @Joseph and can be found at the
>> links provided by @haosdent.
>>
>> I took a quick look at Consul key/value HTTP APIs and they look very
>> similar to Etcd APIs. You could actually reuse our Etcd module
>> implementation once we manage to push the module into Mesos community.
>>
>> The only technical problem I could see for now is that Consul does not
>> support `POST` with incremental key index. We may need to leverage
>> `?cas=` operation in Consul to emulate the behaviour of joining a
>> key group.
>>
>> We could have a discussion on how to implement Consul HA module.
>>
>> cheers,
>> /J
>>
>>
>> - Original message -
>> From: haosdent 
>> To: user 
>> Cc: Jay JN Guo/China/IBM@IBMCN
>> Subject: Re: what is the status on this?
>> Date: Sun, Sep 4, 2016 6:10 PM
>>
>> Jay has some patches for de-couple Mesos with Zookeeper
>>
>> https://issues.apache.org/jira/browse/MESOS-5828
>> https://issues.apache.org/jira/browse/MESOS-5829
>>
>> I think it should be possible to support consul by custom modules after
>> jay's work done.
>>
>> On Sun, Sep 4, 2016 at 6:02 PM, kant kodali  wrote:
>>
>> Hi Alex,
>>
>> We have some experienced devops people here and they all had one thing in
>> common which is Zookeeper is a pain to maintain. In fact we refused to
>> bring in new tech stacks that require Zookeeper such as Kafka for example.
>> so we desperately in search for alternative preferably using consul. I just
>> hear lot of positive response when comes it consul. It will be great to see
>> mesos and consul working together in which we would be ready to jump at it
>> and make a switch for YARN to Mesos.
>>
>> Thanks,
>> Kant
>>
>>
>>
>>
>> On Wed, Aug 31, 2016 1:03 AM, Alex Rukletsov a...@mesosphere.com wrote:
>>
>> Kant—
>>
>> mind telling us what is your use case and why this ticket is important
>> for you? It will help us prioritize work.
>>
>> On Fri, Aug 26, 2016 at 2:46 AM, tommy xiao  wrote:
>>
>> Hi guys, i always focus on t his case. but good news is etcd always have
>> patchs. so the coming consul is very easy, just need some time to do coding
>> on it. if you have interesting it? let us collaborate it.
>>
>> 2016-08-26 8:11 GMT+08:00 Joseph Wu :
>>
>> There is no timeline as no one has done any work on the issue.
>>
>>
>> On Thu, Aug 25, 2016 at 4:54 PM, kant kodali  wrote:
>>
>> Hi Guys,
>>
>> I see this ticket and other related tickets should be part of sprints in
>> 2015 and it is still not resolved yet. can we have a timeline on this? This
>> would be really helpful
>>
>> https://issues.apache.org/jira/browse/MESOS-3797
>>
>> Thanks!
>>
>>
>>
>> --
>> Deshi Xiao
>> Twitter: xds2000
>> E-mail: xiaods(AT)gmail.com
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>>
>>
>>
>


Re: Marathon API and support for pods

2016-09-21 Thread Zameer Manji
Hey,

Have you considered sending this to your framework's mailing list? As a
Mesos user, I don't think framework specific documents like this need to be
shared with the entire community.


On Wed, Sep 21, 2016 at 2:59 AM, James DeFelice 
wrote:

> Hi folks,
>
> First of all the Marathon team would like to thank those who provided
> feedback on the v3 API proposal (linked below) that was circulated last
> month. Developing a new API for Marathon is a big undertaking and getting
> your feedback early in the process has been helpful.
>
> "A vision for pods in Marathon
> [https://docs.google.com/document/d/1uPH58NWN_
> OuynptsqTOq8v5qlkivq2mUb2M9J5jZTMU/edit#heading=h.sqydeepp9s4m]
>
> We've done some additional discovery over the several weeks and have made
> some changes to our API roadmap. It takes time to get an API right and the
> decisions we make now will have lasting impact for months/years to come;
> let's ensure that we spend enough time now thinking through the long-term
> implications of particular API design choices. We'd also like to facilitate
> a deeper discussion with the community about what problems v3 should solve
> before committing to API decisions. The current plan is to resume work on
> the v3 API later this fall. Stay tuned for additional announcements
> regarding v3 API proposals.
>
> Furthermore, the v3 API is about more than just pods. We, the team and
> community at large, should ensure that a new API is conceptually consistent
> across API types and satisfies not only our short-term goals, but is
> forward-compatible with our long term roadmap.
>
> That said, we have customer demand for pods now! In order to deliver pods
> functionality without committing to a v3 API the team has decided to
> introduce pods via a new /v2/pods API endpoint. What does this mean for v2
> API users?
>
> First, if your organization doesn't have an interest in pods then nothing
> forces you to change. Additional support for pods in the Marathon v2 API
> should not cause breakage for existing v2 API users.
>
> Second, by integrating with the existing v2 API and backend we'll be
> minimizing overall architectural changes. Needless to say there will be
> changes to backend components that had been previously optimized for single
> tasks vs. task groups. But the overall architecture of the system will not
> change. This is important in order to preserve the performance,
> scalability, and stability gains that Marathon has recently made.
>
> In addition, introducing pods in v2 allows the Marathon team to gather
> early feedback from the community and our customers about how the API does
> and does not meet their needs. This is very valuable input that will help
> to shape the future v3 API.
>
> Below is a link to a proposal for pods in the Marathon v2 API. This
> initial implementation for pods support should be viewed as an MVP that
> will be enhanced in coming releases. Your feedback is most welcome and
> strongly encouraged. Please comment directly in the document with any
> questions or concerns.
>
> "Marathon: Pods in v2"
> [https://docs.google.com/document/d/1Zno6QK2yGF4koB8BYT88EtB2-
> T7t3aAYRQ27pUD76gw/edit#heading=h.ywxj299mstr7]
>
> Many thanks,
>
> the Marathon team.
>


Re: Updating ExecutorInfo after framework failover or best practice

2016-09-29 Thread Zameer Manji
I would not consider it a "workaround" to make the executor URI stable
between failures. I think that's a requirement for a HA system. If you are
serving the resource from the scheduler itself then yes you need to set up
DNS or some sort of proxy that can direct the fetch request to the current
scheduler.

Alternatively you could put it in a well known location (ie HDFS or S3) and
pass that URI instead. The current scheduler can mutate that storage system
on startup if it is serving a new jar. If you do this you can also then
decouple the resource serving from the scheduler itself which I think is a
nice to have.


On Thu, Sep 29, 2016 at 10:25 AM, Vinod Kone  wrote:

> We cannot easily make ExecutorInfo mutable because there might be existing
> tasks with executors with the old ExecutorInfo. If there are two different
> ExecutorInfos for the same ExecutorID it gets confusing for Mesos (e.g.,
> SHUTDOWN executor id 'foo' kills which executor?).
>
> One possible solution is to not re-use ExecutorID, but that depends on
> what semantics you want for your executor.
>
> On Thu, Sep 29, 2016 at 3:01 AM, Kota UENISHI  technologies.com> wrote:
>
>> Hi there,
>>
>> I'm going to implement scheduler failover into my framework, and hit
>> an issue - while I know it's how Mesos works for now:
>>
>> My framework lets Mesos agents fetch my custom executor jar file from
>> scheduler process's HTTP endpoint. Suppose framework process restarted
>> by Marathon or whatever in a different machine after failure, the URL
>> of the HTTP endpoint to download executor jar file from changes to
>> that of new scheduler process. This causes ExecutorInfo validation
>> failure, like [1]. And I think this is why Spark's
>> MesosClusterDispatcher is not ready for HA yet.
>>
>> As a (major?) workaround, [1] avoids this by assuming URL identity by
>> DNS or load balancer-ish stuff. Another short-sighted kludge
>> workaround would be relaxing the ExecutorInfo validation for the
>> failover case - which I believe solves many framework developers'
>> headache.
>>
>> Also, best workaround in Mesos code would be just clearing
>> ExecutorInfo after Master found scheduler failover. I think
>> ExecutorInfo must be 1:1 with FrameworkInfo, but I does not have to be
>> immutable. Under partition, it may diverge across masters but LWW
>> merge after partition heal would be enough to keep it unique.
>>
>> Thoughts?
>>
>> [1] https://github.com/mesosphere/kubernetes-mesos/issues/15
>>
>> Kota UENISHI
>>
>
>


Mesos 1.1.0 release date

2016-10-03 Thread Zameer Manji
Hey,

Does anyone know when Mesos 1.1.0 will be released? I noticed that master
provides
<https://github.com/apache/mesos/tree/2e013890e47c30053b7b83cd205b432376589216/src/java/src/org/apache/mesos/v1/scheduler>
JNI bindings to the mesos V1 HTTP API and I would like to use them soon.

-- 
Zameer Manji


Re: Non-checkpointing frameworks

2016-10-15 Thread Zameer Manji
+1 to A and B

Aurora has enabled checkpointing for years and requires operators to enable
checkpointing on the slaves.

On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere 
wrote:

> I'm in favor of A & B. I find it provides a better "first experience" to
> users.
> From my experience you usually have to have an explicit reason to not want
> to checkpoint. Most people assume the semantics provided by the checkpoint
> behavior is default and it can be a frustrating experience for them to find
> out that is not the case.
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway 
> wrote:
>
>> Hi folks,
>>
>> I'd like input from individuals who currently use frameworks but do
>> not enable checkpointing.
>>
>> Background: "checkpointing" is a parameter that can be enabled in
>> FrameworkInfo; if enabled, the agent will write the framework pid,
>> executor PIDs, and status updates to disk for any tasks started by
>> that framework. This checkpointed information means that these tasks
>> can survive an agent crash: if the agent exits (whether due to
>> crashing or as part of an upgrade procedure), a restarted agent can
>> use this information to reconnect to executors started by the previous
>> instance of the agent. The downside is that checkpointing requires
>> some additional disk I/O at the agent.
>>
>> Checkpointing is not currently the default, but in my experience it is
>> often enabled for production frameworks. As part of the work on
>> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> considering:
>>
>> (a) requiring that partition-aware frameworks must also enable
>> checkpointing, and/or
>> (b) enabling checkpointing by default
>>
>> If you have intentionally decided to disable checkpointing for your
>> Mesos framework, I'd be curious to hear more about your use-case and
>> why you haven't enabled it.
>>
>> Thanks!
>>
>> Neil
>>
>> --
>> Zameer Manji
>>
>


Re: Non-checkpointing frameworks

2016-10-17 Thread Zameer Manji
Qian,

Turns out the --checkpoint flag was made default and removed in Mesos 0.22.

On Sun, Oct 16, 2016 at 4:38 PM, Qian Zhang  wrote:

> and requires operators to enable checkpointing on the slaves.
>
>
> Just curious why operator needs to enable checkpointing on the slaves (I
> do not see an agent flag for that), I think checkpointing should be enabled
> in framework level rather than slave.
>
>
> Thanks,
> Qian Zhang
>
> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji  wrote:
>
>> +1 to A and B
>>
>> Aurora has enabled checkpointing for years and requires operators to
>> enable
>> checkpointing on the slaves.
>>
>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere <
>> jo...@mesosphere.io>
>> wrote:
>>
>> > I'm in favor of A & B. I find it provides a better "first experience" to
>> > users.
>> > From my experience you usually have to have an explicit reason to not
>> want
>> > to checkpoint. Most people assume the semantics provided by the
>> checkpoint
>> > behavior is default and it can be a frustrating experience for them to
>> find
>> > out that is not the case.
>> >
>> > —
>> > *Joris Van Remoortere*
>>
>> > Mesosphere
>> >
>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway 
>> > wrote:
>> >
>> >> Hi folks,
>> >>
>> >> I'd like input from individuals who currently use frameworks but do
>> >> not enable checkpointing.
>> >>
>> >> Background: "checkpointing" is a parameter that can be enabled in
>> >> FrameworkInfo; if enabled, the agent will write the framework pid,
>> >> executor PIDs, and status updates to disk for any tasks started by
>> >> that framework. This checkpointed information means that these tasks
>> >> can survive an agent crash: if the agent exits (whether due to
>> >> crashing or as part of an upgrade procedure), a restarted agent can
>> >> use this information to reconnect to executors started by the previous
>> >> instance of the agent. The downside is that checkpointing requires
>> >> some additional disk I/O at the agent.
>> >>
>> >> Checkpointing is not currently the default, but in my experience it is
>> >> often enabled for production frameworks. As part of the work on
>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> >> considering:
>> >>
>> >> (a) requiring that partition-aware frameworks must also enable
>> >> checkpointing, and/or
>> >> (b) enabling checkpointing by default
>> >>
>> >> If you have intentionally decided to disable checkpointing for your
>> >> Mesos framework, I'd be curious to hear more about your use-case and
>> >> why you haven't enabled it.
>> >>
>> >> Thanks!
>> >>
>> >> Neil
>> >>
>> >> --
>> >> Zameer Manji
>> >>
>> >
>>
>> --
>> Zameer Manji
>>
>


Re: [VOTE] Release Apache Mesos 1.1.0 (rc1)

2016-10-24 Thread Zameer Manji
 - **Experimental** A new default executor is introduced
>> which
>> frameworks can use to launch task groups as nested containers. All the
>> nested containers share resources likes cpu, memory, network and
>> volumes.
>>
>>   * [MESOS-6014] - **Experimental** A new port-mapper CNI plugin, the
>> `mesos-cni-port-mapper` has been introduced. For Mesos containers,
>> with the
>> CNI port-mapper plugin, users can now expose container ports through
>> host
>> ports using DNAT. This is especially useful when Mesos containers are
>> attached to isolated CNI networks such as private bridge networks,
>> and the
>> services running in the container needs to be exposed outside these
>> isolated networks.
>>
>>
>> The CHANGELOG for the release is available at:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>> lain;f=CHANGELOG;hb=1.1.0-rc1
>> 
>> 
>>
>> The candidate for Mesos 1.1.0 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/mesos-1.1.0.tar.gz
>>
>> The tag to be voted on is 1.1.0-rc1:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.0-rc1
>>
>> The MD5 checksum of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/mesos
>> -1.1.0.tar.gz.md5
>>
>> The signature of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc1/mesos
>> -1.1.0.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is up in Maven in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1158
>>
>> Please vote on releasing this package as Apache Mesos 1.1.0!
>>
>> The vote is open until Fri Oct 21 21:57:02 CEST 2016 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.1.0
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Alex & Till
>>
>>
>
>
> --
> David Robinson
> SRE - Mesos
> @daverobinson
>
> --
> Zameer Manji
>


Re: Mesos V1 Operator HTTP API - Java Proto Classes

2016-11-16 Thread Zameer Manji
I think this is a bug, I feel the jar should include all v1 protobuf files.

Vijay, I encourage you to file a ticket.

On Tue, Nov 15, 2016 at 8:04 PM, Vijay Srinivasaraghavan <
vijikar...@yahoo.com.invalid> wrote:

> I believe the HTTP API will use the same underlying message format (proto
> def) and hence the request/response value objects (java) needs to be
> auto-generated from the proto files for it to be used in Jersey based java
> rest client?
>
> On Tuesday, November 15, 2016 12:37 PM, Tomek Janiszewski <
> jani...@gmail.com> wrote:
>
>
>  I suspect jar is deprecated and includes only old API used by mesoslib.
> The
> goal is to create HTTP API and stop supporting native libs (jars, so, etc).
> I think you shouldn't use that jar in your project.
>
> wt., 15.11.2016, 20:38 użytkownik Vijay Srinivasaraghavan <
> vijikar...@yahoo.com> napisał:
>
> > Hello,
> >
> > I am writing a rest client for "operator APIs" and found that some of the
> > protobuf java classes (like "include/mesos/v1/quota/quota.proto",
> > "include/mesos/v1/master/master.proto") are not included in the mesos
> jar
> > file. While investigating, I have found that the "Make" file does not
> > include these proto definition files.
> >
> > I have updated the Make file and added the protos that I am interested in
> > and built a new jar file. Is there any reason why these proto definitions
> > are not included in the original build apart from the reason that the
> APIs
> > are still evolving?
> >
> > Regards
> > Vijay
> >
>
> --
> Zameer Manji
>


Re: Mesos on AWS

2016-12-16 Thread Zameer Manji
Hey,

Could you detail on what you mean by "delays and health check problems"?
Are you using your own framework or an existing one? How are you launching
the tasks?

Could you share logs from Mesos that show timeouts to ZK?

For reference, I operate a large Mesos cluster and I have never encountered
problems when running 1k tasks concurrently so I think sharing data would
help everyone debug this problem.

On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov 
wrote:

> ​Hi,
>
> Does any body try to run Mesos on AWS instances? Can you give me
> recommendations.
>
> I am developing elastic (scale aws instances on demand) Mesos cluster.
> Currently I have 3 master instances. I run about 1000 tasks simultaneously.
> I see delays and health check problems.
>
> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>
> At the moment I increase time out in ZooKeeper cluster. What can I do to
> decrease timeouts?
>
> Also how can I increase performance? The main bottleneck is what I have
> the big amount of tasks(run simultaneously) for an hour after I shutdown
> them or restart (depends how good them perform).
>
> -Kiril​
>
> --
> Zameer Manji
>


Understanding Mesos Maintenance

2017-03-01 Thread Zameer Manji
Hey,

I'm trying to understand some nuances of the maintenance API. Here are my
questions:

1. The documentation mentions that accepting or declining and inverse offer
is a "hint" to the operator. How do operators view if a framework has
declined, accepted or ignored an inverse offer?

2. Should a framework accept an inverse offer and then start removing tasks
from an agent or should the framework only accept the inverse offer after
the removal of tasks is complete? I think the former makes sense, but it
implies that operators need to poll the state of the agent to ensure there
are no active tasks whereas the latter implies operators only need to check
if all inverse offers were accepted.

3. After accepting the inverse offer, will a framework get another inverse
offer for the same agent? Currently I'm trying to determine if inverse
offer information needs to be persisted so a framework can continue it's
draining work between failovers or if it can just wait for an inverse offer
after starting up.

4. Is it possible for the agent to automatically transition from DRAIN to
DOWN if at the start of the unavailability period the agent is free of
tasks or is that still the operator's responsibility?

-- 
Zameer Manji


Re: isolation and task binary distributives

2017-03-03 Thread Zameer Manji
Another approach would be to create a bind mount from the host to each
container (say `/var/cache/data`). The first executor can copy data there
and the subsequent executors can check for it's presence and re-use that
data.

On Fri, Mar 3, 2017 at 8:32 AM, tommy xiao  wrote:

> create a  persistent disk to store the distribution and run executor
> isolation.
>
> 2017-03-03 21:40 GMT+08:00 Egor Ryashin :
>
>> Hi All,
>>
>> I'm writing custom scheduler which will be sending short-running tasks.
>> I need those tasks to be properly isolated, that means I shouldn't run
>> them in the same executor. Those tasks require large binary distributives
>> and I suppose each run in a separate executor will spawn large sandboxes
>> with copies of those distributives on disk. Is there an easy way to
>> maintain isolation for those tasks meanwhile sharing a distributive between
>> them?
>>
>> Thanks,
>> Egor
>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>
> --
> Zameer Manji
> <http://gmail.com>
>


Re: Understanding Mesos Maintenance

2017-03-03 Thread Zameer Manji
Ben,

Thanks for responding to my questions. I have a follow up on #3.

I have a framework which accepts inverse offers but does not do anything to
the associated tasks. I noticed that the framework **does not** receive
another inverse offer  within the allocation period. At what interval will
an inverse offer be resent to the framework if it was accepted? I took a
glance at `src/tests/master_maintenance_tests.cpp` and did not notice any
tests testing for this.

Are you sure that inverse offers are resent after they have been accepted
but before the tasks are removed from the host?


On Thu, Mar 2, 2017 at 4:14 PM, Benjamin Mahler  wrote:

> Hey Zameer, great questions. Let us know if there's anything you think
> could be improved or documented better.
>
> Re 1:
>
> The 'Viewing maintenance status' section of the documentation should
> clarify this:
> http://mesos.apache.org/documentation/latest/maintenance/
>
> Re 2:
>
> Both of these sound reasonable but the scheduler should not accept the
> maintenance if it's not yet safe for the machine to be downed. Otherwise a
> task failure may be mistakenly interpreted as a go ahead to down the
> machine, despite the scheduler needing to get the task back running. If
> expensive or long running work needs to finish (e.g. migrate data, replace
> instances in a manner that doesn't violate SLA, etc.) then I would suggest
> waiting until the work completes safely before accepting.
>
> We likely need a third state like, TENTATIVELY_ACCEPT to signal to
> operators / mesos that the framework intends to comply, but hasn't finished
> whatever it needs to do yet for it to be safe to down the machine.
>
> Also, one of the challenges here is when to take the action. Should the
> scheduler prepare itself for maintenance as soon as it safely can? Or as
> late (but not too late!) as it safely can? If the scheduler runs
> long-running services, as soon as safely possible makes sense. If the
> scheduler runs short running batch jobs, as late as safely possible
> provides work-conservation.
>
> Re 3:
>
> The framework will receive another inverse offer if the framework still
> has resources allocated on that agent. If receiving a regular offer for
> available resources on the agent, an 'Unavailability' [1] will be included
> if the machine is scheduled for maintenance, so that the scheduler can be
> aware of the maintenance when placing new work.
>
> Re 4:
>
> It's not possible currently, and it's the operator's responsibility (the
> intention was for "operator" to be maintenance tooling). Ideally we can add
> automation of this decision into mesos, if decision criteria that is widely
> applicable can be established (e.g. if nothing is running and all relevant
> frameworks have accepted). Feel free to file a ticket for this or any other
> improvements!
>
> Ben
>
> [1] https://github.com/apache/mesos/blob/8f487beb9f8aaed8f27
> b0404279b1a2f97672ba1/include/mesos/v1/mesos.proto#L1416-L1426
>
> On Wed, Mar 1, 2017 at 5:41 PM, Zameer Manji  wrote:
>
>> Hey,
>>
>> I'm trying to understand some nuances of the maintenance API. Here are my
>> questions:
>>
>> 1. The documentation mentions that accepting or declining and inverse
>> offer is a "hint" to the operator. How do operators view if a framework has
>> declined, accepted or ignored an inverse offer?
>>
>> 2. Should a framework accept an inverse offer and then start removing
>> tasks from an agent or should the framework only accept the inverse offer
>> after the removal of tasks is complete? I think the former makes sense, but
>> it implies that operators need to poll the state of the agent to ensure
>> there are no active tasks whereas the latter implies operators only need to
>> check if all inverse offers were accepted.
>>
>> 3. After accepting the inverse offer, will a framework get another
>> inverse offer for the same agent? Currently I'm trying to determine if
>> inverse offer information needs to be persisted so a framework can continue
>> it's draining work between failovers or if it can just wait for an inverse
>> offer after starting up.
>>
>> 4. Is it possible for the agent to automatically transition from DRAIN to
>> DOWN if at the start of the unavailability period the agent is free of
>> tasks or is that still the operator's responsibility?
>>
>> --
>> Zameer Manji
>>
>> --
>> Zameer Manji
>>
>


Re: Understanding Mesos Maintenance

2017-03-03 Thread Zameer Manji
Thanks for clearing that up.

I was accidentally setting a long refuse time.

On Fri, Mar 3, 2017 at 6:08 PM, Joseph Wu  wrote:

> Inverse offers have the same offer cycle as normal offers.  They can
> be Accepted/Declined with a timeout (default 5 seconds).
>
> On Fri, Mar 3, 2017 at 5:29 PM, Zameer Manji  wrote:
> > Ben,
> >
> > Thanks for responding to my questions. I have a follow up on #3.
> >
> > I have a framework which accepts inverse offers but does not do anything
> to
> > the associated tasks. I noticed that the framework **does not** receive
> > another inverse offer  within the allocation period. At what interval
> will
> > an inverse offer be resent to the framework if it was accepted? I took a
> > glance at `src/tests/master_maintenance_tests.cpp` and did not notice
> any
> > tests testing for this.
> >
> > Are you sure that inverse offers are resent after they have been accepted
> > but before the tasks are removed from the host?
> >
> >
> > On Thu, Mar 2, 2017 at 4:14 PM, Benjamin Mahler 
> wrote:
> >>
> >> Hey Zameer, great questions. Let us know if there's anything you think
> >> could be improved or documented better.
> >>
> >> Re 1:
> >>
> >> The 'Viewing maintenance status' section of the documentation should
> >> clarify this:
> >> http://mesos.apache.org/documentation/latest/maintenance/
> >>
> >> Re 2:
> >>
> >> Both of these sound reasonable but the scheduler should not accept the
> >> maintenance if it's not yet safe for the machine to be downed.
> Otherwise a
> >> task failure may be mistakenly interpreted as a go ahead to down the
> >> machine, despite the scheduler needing to get the task back running. If
> >> expensive or long running work needs to finish (e.g. migrate data,
> replace
> >> instances in a manner that doesn't violate SLA, etc.) then I would
> suggest
> >> waiting until the work completes safely before accepting.
> >>
> >> We likely need a third state like, TENTATIVELY_ACCEPT to signal to
> >> operators / mesos that the framework intends to comply, but hasn't
> finished
> >> whatever it needs to do yet for it to be safe to down the machine.
> >>
> >> Also, one of the challenges here is when to take the action. Should the
> >> scheduler prepare itself for maintenance as soon as it safely can? Or as
> >> late (but not too late!) as it safely can? If the scheduler runs
> >> long-running services, as soon as safely possible makes sense. If the
> >> scheduler runs short running batch jobs, as late as safely possible
> provides
> >> work-conservation.
> >>
> >> Re 3:
> >>
> >> The framework will receive another inverse offer if the framework still
> >> has resources allocated on that agent. If receiving a regular offer for
> >> available resources on the agent, an 'Unavailability' [1] will be
> included
> >> if the machine is scheduled for maintenance, so that the scheduler can
> be
> >> aware of the maintenance when placing new work.
> >>
> >> Re 4:
> >>
> >> It's not possible currently, and it's the operator's responsibility (the
> >> intention was for "operator" to be maintenance tooling). Ideally we can
> add
> >> automation of this decision into mesos, if decision criteria that is
> widely
> >> applicable can be established (e.g. if nothing is running and all
> relevant
> >> frameworks have accepted). Feel free to file a ticket for this or any
> other
> >> improvements!
> >>
> >> Ben
> >>
> >> [1]
> >> https://github.com/apache/mesos/blob/8f487beb9f8aaed8f27b040
> 4279b1a2f97672ba1/include/mesos/v1/mesos.proto#L1416-L1426
> >>
> >> On Wed, Mar 1, 2017 at 5:41 PM, Zameer Manji  wrote:
> >>>
> >>> Hey,
> >>>
> >>> I'm trying to understand some nuances of the maintenance API. Here are
> my
> >>> questions:
> >>>
> >>> 1. The documentation mentions that accepting or declining and inverse
> >>> offer is a "hint" to the operator. How do operators view if a
> framework has
> >>> declined, accepted or ignored an inverse offer?
> >>>
> >>> 2. Should a framework accept an inverse offer and then start removing
> >>> tasks from an agent or should the framework only accept the inverse
> offer

Re: Updating FrameworkInfo settings

2015-02-24 Thread Zameer Manji
I would like to point out that using a new FrameworkID is not a solution to
this problem. This means that a cluster operator has to drain the entire
cluster to enable checkpointing, or lose all previous tasks. Both scenarios
are not desirable.

Fortunately it is possible to do this without changing the FrameworkID. I
have cced Steve from TellApart who has enabled checkpointing without
changing the FrameworkID on a production cluster. I hope he can share his
process here.

On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen  wrote:

> Mesos checkpoints the FrameworkInfo into disk, and recovers it on relaunch.
>
> I don't think we expose any API to remove the framework manually though if
> you really want to keep the FrameworkID. If you hit the failover timeout
> the framework will get removed from the master and slave.
>
> I think for now the best way is just use a new FrameworkID when you want
> to change the FrameworkInfo.
>
> Tim
>
>
>
> On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr  wrote:
>
>> Hey folks,
>>
>> Is there a best practice for rolling out FrameworkInfo changes? We need
>> to set checkpoint to true, so I redeployed our framework with the new
>> settings (with tasks still running), but when I hit a slave's stats.json
>> endpoint, it appears that the old FrameworkInfo data is still there (which
>> makes sense since there's active executors running). I then tried draining
>> the tasks and completely restarting a Mesos slave, but still no luck.
>>
>> Is there anything additional / special I need to do here? Is some part of
>> Mesos caching FrameworkInfo based on the framework ID?
>>
>> Another wrinkle with our setup is we have a rather large failover_timeout
>> set for the framework -- maybe that's affecting things too?
>>
>> Thanks,
>> Tom
>>
>
>


-- 
Zameer Manji


Re: Updating FrameworkInfo settings

2015-02-24 Thread Zameer Manji
For anyone who is going to read this information in the future, this works
because the information in the replicated log can be recovered by the
master. In future releases of Mesos the master might store information
which cannot be recovered so please take extra care if you are going to do
this.

On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz  wrote:

> Definitely don't change the frameworkID, we did that once and it was a
> disaster, for reasons described already.
>
> Here's what we did to force it on (as I can recall)
> - Change the startup flags for all masters to use the in memory DB instead
> of the replicated log (--registry=in_memory)
> - Restart all masters (not all at once, let them fail over)
> - Delete the replicated log on all masters
> - Ensure the framework is now registered with checkpoint = true (the
> slaves won't be yet howerver)
> - Remove the --registry flag from the masters and do a rolling restart
> again
> - Do another rolling restart of the masters
> *- At this point the framework will be persisted as checkpoint = true*
> - Now, restart your slaves.  Restarting them should cause them to pick up
> the new framework.  I'm not 100% sure if I deleted their state or not when
> I did this part, if it doesn't seem to take, try deleting their slave info
> on each one.
>
> On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji 
> wrote:
>
>> I would like to point out that using a new FrameworkID is not a solution
>> to this problem. This means that a cluster operator has to drain the entire
>> cluster to enable checkpointing, or lose all previous tasks. Both scenarios
>> are not desirable.
>>
>> Fortunately it is possible to do this without changing the FrameworkID. I
>> have cced Steve from TellApart who has enabled checkpointing without
>> changing the FrameworkID on a production cluster. I hope he can share his
>> process here.
>>
>> On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen  wrote:
>>
>>> Mesos checkpoints the FrameworkInfo into disk, and recovers it on
>>> relaunch.
>>>
>>> I don't think we expose any API to remove the framework manually though
>>> if you really want to keep the FrameworkID. If you hit the failover timeout
>>> the framework will get removed from the master and slave.
>>>
>>> I think for now the best way is just use a new FrameworkID when you want
>>> to change the FrameworkInfo.
>>>
>>> Tim
>>>
>>>
>>>
>>> On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr  wrote:
>>>
>>>> Hey folks,
>>>>
>>>> Is there a best practice for rolling out FrameworkInfo changes? We need
>>>> to set checkpoint to true, so I redeployed our framework with the new
>>>> settings (with tasks still running), but when I hit a slave's
>>>> stats.json endpoint, it appears that the old FrameworkInfo data is
>>>> still there (which makes sense since there's active executors running). I
>>>> then tried draining the tasks and completely restarting a Mesos slave, but
>>>> still no luck.
>>>>
>>>> Is there anything additional / special I need to do here? Is some part
>>>> of Mesos caching FrameworkInfo based on the framework ID?
>>>>
>>>> Another wrinkle with our setup is we have a rather large
>>>> failover_timeout set for the framework -- maybe that's affecting
>>>> things too?
>>>>
>>>> Thanks,
>>>> Tom
>>>>
>>>
>>>
>>
>>
>> --
>> Zameer Manji
>>
>
>


-- 
Zameer Manji


Re: Mesos Slave Attributes

2015-03-12 Thread Zameer Manji
The mesos master does pass attributes on the slave to the scheduler
framework. They are available on the Offer struct
<https://github.com/apache/mesos/blob/acaee563a66e5528ae5c5e417f2a811f8ee466b2/include/mesos/mesos.proto#L618>
.

If you want to check the slave attributes you can check the state.json on
the slave by looking at this endpoint: http://
:/slave(1)/state.json

On Thu, Mar 12, 2015 at 11:00 AM, James Vanns  wrote:

> I think I have a simple question here -- that doesn't seem too apparent in
> any documentation I've come across; does the mesos master pass-through
> slave attributes to the scheduler framework like it does with resource
> offers?
>
> Where can I see slave attributes -- the UI doesn't appear to display them.
> Is their a REST endpoint I should be querying?
>
> Cheers,
>
> Jim
>
> --
> Zameer Manji
>
>


Official RPMs

2015-09-18 Thread Zameer Manji
Hey,

Does the Apache Mesos project provide OS packages for installation? I
haven't been able to find any for the 0.24 release and I think having them
would make installing Mesos a lot easier.

-- 
Zameer Manji


Re: Official RPMs

2015-09-22 Thread Zameer Manji
/radgruchalski/
>>> >
>>> > Confidentiality:
>>> > This communication is intended for the above-named person and may be
>>> > confidential and/or legally privileged.
>>> > If it has come to you in error you must take no action based on it,
>>> nor must
>>> > you copy or show it to anyone; please delete/destroy and inform the
>>> sender
>>> > immediately.
>>> >
>>> > On Saturday, 19 September 2015 at 04:09, craig w wrote:
>>> >
>>> > Mesosphere provides packages, you can find more information here:
>>> > https://mesosphere.com/downloads/
>>> >
>>> > As of right now, they don't seem to have a 0.24.0 package.
>>> >
>>> > On Fri, Sep 18, 2015 at 8:51 PM, Brian Hicks 
>>> wrote:
>>> >
>>> > We've got some experimental packages at bintray.com/asteris/mantl-rpm,
>>> > source is at github.com/asteris-llc/mesos-packaging. They can really
>>> use
>>> > some testing if you wanted to give them a try. Configuration is a bit
>>> > different than the Mesosphere packages, see the repo for details.
>>> >
>>> > On Sep 18, 2015 7:01 PM, "Zameer Manji"  wrote:
>>> >
>>> > Hey,
>>> >
>>> > Does the Apache Mesos project provide OS packages for installation? I
>>> > haven't been able to find any for the 0.24 release and I think having
>>> them
>>> > would make installing Mesos a lot easier.
>>> >
>>> > --
>>> > Zameer Manji
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > https://github.com/mindscratch
>>> > https://www.google.com/+CraigWickesser
>>> > https://twitter.com/mind_scratch
>>> > https://twitter.com/craig_links
>>> >
>>> >
>>>
>>> --
>>> Zameer Manji
>>>
>>>


Re: Official RPMs

2015-09-22 Thread Zameer Manji
e gentoo community will no doubt win
>>
>> ;-)
>>
>> James
>>
>>
>>
>>
>>
>>> On Sat, Sep 19, 2015 at 3:39 AM, Carlos Sanchez >> <mailto:car...@apache.org>> wrote:
>>>
>>> I'm using the same repo with some changes to build SSL enabled
>>> packages
>>>
>>>
>>> https://github.com/carlossg/mesos-deb-packaging/compare/master...carlossg:ssl
>>>
>>>
>>> On Sat, Sep 19, 2015 at 4:22 AM, Rad Gruchalski
>>> mailto:ra...@gruchalski.com>> wrote:
>>>  > Should be rather easy to package it with this little tool from
>>> Mesosphere:
>>>  > https://github.com/mesosphere/mesos-deb-packaging. I’ve done it
>>> myself for
>>>  > ubuntu 12.04 and 14.04.
>>>  > The only thing that needs to be changed are the dependencies, for
>>> ubuntu
>>>  > this was:
>>>  >
>>>  > diff --git a/build_mesos b/build_mesos
>>>  > index 81561bc..f756ef0 100755
>>>  > --- a/build_mesos
>>>  > +++ b/build_mesos
>>>  > @@ -313,9 +313,10 @@ function deb_ {
>>>  > --deb-recommends zookeeperd
>>>  > --deb-recommends zookeeper-bin
>>>  > -d 'java-runtime-headless'
>>>  > -   -d libcurl3
>>>  > -   -d libsvn1
>>>  > -   -d libsasl2-modules
>>>  > +   -d libcurl4-nss-dev
>>>  > +   -d libsasl2-dev
>>>  > +   -d libapr1-dev
>>>  > +   -d libsvn-dev
>>>  >
>>>  > It does look like the tool can build RPMs.
>>>  >
>>>  > Kind regards,
>>>  > Radek Gruchalski
>>>  > ra...@gruchalski.com <mailto:ra...@gruchalski.com>
>>>  > de.linkedin.com/in/radgruchalski/
>>> <http://de.linkedin.com/in/radgruchalski/>
>>>  >
>>>  > Confidentiality:
>>>  > This communication is intended for the above-named person and may
>>> be
>>>  > confidential and/or legally privileged.
>>>  > If it has come to you in error you must take no action based on
>>> it, nor must
>>>  > you copy or show it to anyone; please delete/destroy and inform
>>> the sender
>>>  > immediately.
>>>  >
>>>  > On Saturday, 19 September 2015 at 04:09, craig w wrote:
>>>  >
>>>  > Mesosphere provides packages, you can find more information here:
>>>  > https://mesosphere.com/downloads/
>>>  >
>>>  > As of right now, they don't seem to have a 0.24.0 package.
>>>  >
>>>  > On Fri, Sep 18, 2015 at 8:51 PM, Brian Hicks
>>> mailto:br...@brianthicks.com>> wrote:
>>>  >
>>>  > We've got some experimental packages at
>>> bintray.com/asteris/mantl-rpm <http://bintray.com/asteris/mantl-rpm
>>> >,
>>>  > source is at github.com/asteris-llc/mesos-packaging
>>> <http://github.com/asteris-llc/mesos-packaging>. They can really use
>>>  > some testing if you wanted to give them a try. Configuration is a
>>> bit
>>>  > different than the Mesosphere packages, see the repo for details.
>>>  >
>>>  > On Sep 18, 2015 7:01 PM, "Zameer Manji" >> <mailto:zma...@apache.org>> wrote:
>>>  >
>>>  > Hey,
>>>  >
>>>  > Does the Apache Mesos project provide OS packages for
>>> installation? I
>>>  > haven't been able to find any for the 0.24 release and I think
>>> having them
>>>  > would make installing Mesos a lot easier.
>>>  >
>>>  > --
>>>  > Zameer Manji
>>>  >
>>>  >
>>>  >
>>>  >
>>>  > --
>>>  >
>>>  > https://github.com/mindscratch
>>>  > https://www.google.com/+CraigWickesser
>>>  > https://twitter.com/mind_scratch
>>>  > https://twitter.com/craig_links
>>>  >
>>>  >
>>>
>>> --
>>> Zameer Manji
>>>
>>>


Re: Official RPMs

2015-09-25 Thread Zameer Manji
 team only has to support  a single document to clarify what any distro
> needs
> >> to robustly support mesos for their user community. This will
> facilitate a
> >> wider variety of experimentation, at the companion repos too. This
> Option
> >> #1 approach will further accelerate adoption of Mesos on a very wide
> variety
> >> of platforms and architectures, imho. It sets the stage for valid
> benchmark
> >> performance comparison between distros; something that the gentoo
> community
> >> will no doubt win
> >>
> >> ;-)
> >>
> >> James
> >>
> >>
> >>
> >>
> >>>
> >>> On Sat, Sep 19, 2015 at 3:39 AM, Carlos Sanchez  >>> <mailto:car...@apache.org>> wrote:
> >>>
> >>> I'm using the same repo with some changes to build SSL enabled
> >>> packages
> >>>
> >>>
> >>>
> https://github.com/carlossg/mesos-deb-packaging/compare/master...carlossg:ssl
> >>>
> >>>
> >>> On Sat, Sep 19, 2015 at 4:22 AM, Rad Gruchalski
> >>> mailto:ra...@gruchalski.com>> wrote:
> >>>  > Should be rather easy to package it with this little tool from
> >>> Mesosphere:
> >>>  > https://github.com/mesosphere/mesos-deb-packaging. I’ve done it
> >>> myself for
> >>>  > ubuntu 12.04 and 14.04.
> >>>  > The only thing that needs to be changed are the dependencies,
> for
> >>> ubuntu
> >>>  > this was:
> >>>  >
> >>>  > diff --git a/build_mesos b/build_mesos
> >>>  > index 81561bc..f756ef0 100755
> >>>  > --- a/build_mesos
> >>>  > +++ b/build_mesos
> >>>  > @@ -313,9 +313,10 @@ function deb_ {
> >>>  > --deb-recommends zookeeperd
> >>>  > --deb-recommends zookeeper-bin
> >>>  > -d 'java-runtime-headless'
> >>>  > -   -d libcurl3
> >>>  > -   -d libsvn1
> >>>  > -   -d libsasl2-modules
> >>>  > +   -d libcurl4-nss-dev
> >>>  > +   -d libsasl2-dev
> >>>  > +   -d libapr1-dev
> >>>  > +   -d libsvn-dev
> >>>  >
> >>>  > It does look like the tool can build RPMs.
> >>>  >
> >>>  > Kind regards,
> >>>  > Radek Gruchalski
> >>>  > ra...@gruchalski.com <mailto:ra...@gruchalski.com>
> >>>  > de.linkedin.com/in/radgruchalski/
> >>> <http://de.linkedin.com/in/radgruchalski/>
> >>>  >
> >>>  > Confidentiality:
> >>>  > This communication is intended for the above-named person and
> may
> >>> be
> >>>  > confidential and/or legally privileged.
> >>>  > If it has come to you in error you must take no action based on
> >>> it, nor must
> >>>  > you copy or show it to anyone; please delete/destroy and inform
> >>> the sender
> >>>  > immediately.
> >>>  >
> >>>  > On Saturday, 19 September 2015 at 04:09, craig w wrote:
> >>>  >
> >>>  > Mesosphere provides packages, you can find more information
> here:
> >>>  > https://mesosphere.com/downloads/
> >>>  >
> >>>  > As of right now, they don't seem to have a 0.24.0 package.
> >>>  >
> >>>  > On Fri, Sep 18, 2015 at 8:51 PM, Brian Hicks
> >>> mailto:br...@brianthicks.com>> wrote:
> >>>  >
> >>>  > We've got some experimental packages at
> >>> bintray.com/asteris/mantl-rpm <
> http://bintray.com/asteris/mantl-rpm>,
> >>>  > source is at github.com/asteris-llc/mesos-packaging
> >>> <http://github.com/asteris-llc/mesos-packaging>. They can really
> use
> >>>  > some testing if you wanted to give them a try. Configuration is
> a
> >>> bit
> >>>  > different than the Mesosphere packages, see the repo for
> details.
> >>>  >
> >>>  > On Sep 18, 2015 7:01 PM, "Zameer Manji"  >>> <mailto:zma...@apache.org>> wrote:
> >>>  >
> >>>  > Hey,
> >>>  >
> >>>  > Does the Apache Mesos project provide OS packages for
> >>> installation? I
> >>>  > haven't been able to find any for the 0.24 release and I think
> >>> having them
> >>>  > would make installing Mesos a lot easier.
> >>>  >
> >>>  > --
> >>>  > Zameer Manji
> >>>  >
> >>>  >
> >>>  >
> >>>  >
> >>>  > --
> >>>  >
> >>>  > https://github.com/mindscratch
> >>>  > https://www.google.com/+CraigWickesser
> >>>  > https://twitter.com/mind_scratch
> >>>  > https://twitter.com/craig_links
> >>>  >
> >>>  >
> >>>
> >>>
> >>
> >
>
> --
> Zameer Manji
>
>