Re: [DISCUSS] State of the Community

2018-05-22 Thread David McLaughlin
I feel like not getting code reviews is often a symptom of some other
fundamental issue with how change is introduced to a community.

When I joined the Aurora team at Twitter there were some principals in
place for getting your changes accepted to the community and I still feel
like when you follow them, getting code reviews rarely requires more than a
gentle ping. Maybe none of these have been formally communicated or shared
externally, but some of the principals I've picked up include:

* Introduce problems before solutions.
* Get buy-in that this is a problem worth solving.
* Work towards abstractions that work for the community and not just for
your specific use case.
* Solicit early feedback on potential solutions.
* Get explicit buy-in for the solution (these +1s would be the people you
add to the reviewers list later). This usually means writing a Design Doc.
* Plan your work carefully to avoid the dreaded code dumps (where
possible). For large efforts work towards multiple, small patches that are
easy to review.
* Follow-up on review feedback quickly to avoid demanding expensive paging
and context-switches from your reviewers later.
* Build trust by thinking through production rollout and rollback
scenarios.

There is obviously more than just this list, but a lot of the patches that
struggle to get reviews (or get hard -1s after a bunch of work is done)
fail on one (or more) of these fundamental ideas.

It's also worth calling out that having informal discussions on Slack is
fine, but should also be done on the dev lists, and ideally in the form of
a written document. This is the best way to include those of who feel like
Slack is a massive productivity drain :)

I guess this is my long-winded way of saying that I'm a -1 on moving to
lazy consensus.

I wonder if a lot of the concerns can be solved by just improving
communication? Maybe we can revive the weekly developer meeting that we
used to run in IRC.


On Tue, May 22, 2018 at 8:58 AM, Renan DelValle  wrote:

> Thanks for your input Stephan, very much appreciated! Replies inline:
>
> On Tue, May 22, 2018 at 12:12 AM, Stephan Erb  >
> wrote:
>
> > Hey Renan,
> >
> > thanks again for bringing this up. In my experience, the pain comes from
> > building, testing & voting rather the packaging scripts themselves. I
> > therefore think we should discontinue building, but continue to maintain
> > the scripts so that users can build them on their own when necessary.
> >
>
> Fully agree on this. I will even go as far as making unofficial builds
> available for the time being if no one is opposed and if it's not against
> Apache policy to do so.
>
> >
> > We must be careful though with linking the ‘nightly jenkins builds’ on
> the
> > website. We got called out for this once in the past and had to take the
> > link down.
> >
>
> Noted, thanks for bringing this up!
>
> >
> > We also see a lack of involvement in code reviews. I think we should
> > consider setting up a more formal lazy consensus policy
> > https://www.apache.org/foundation/voting.html#LazyConsensus : For
> > example,  patches maybe merged even with a single ‘ship it’ from a
> > committer, if there is neither a ship-it nor a veto from other committers
> > within 7 days.
> >
>
> I think this is a very valid way forward at this point. How does everyone
> else feel about this?
>
> >
> > Best regards,
> > Stephan
> >
>
>
> -Renan
>
> >
> > From: Santhosh Kumar Shanmugham 
> > Reply-To: "user@aurora.apache.org" 
> > Date: Thursday, 17. May 2018 at 22:13
> > To: "d...@aurora.apache.org" 
> > Cc: "user@aurora.apache.org" 
> > Subject: Re: [DISCUSS] State of the Community
> >
> > Hello Renan,
> >
> > I understand your frustration.
> >
> > I am a strong +1 for automating the release and voting process. I
> performed
> > a release a while back and the process definitely needs it improve
> > documentation
> > at the least. If one of the members who are more familiar with this
> > process can
> > create a backlog, I will be happy to chip in.
> >
> > -Santhosh
> >
> > On Thu, May 17, 2018 at 12:56 PM, Renan DelValle   > re...@apache.org>> wrote:
> > All,
> >
> > Discussion has been open for 13 days and only one user has chimed in.
> > Unfortunately it looks like talking point number one will be a serious
> > concern going forward. I will give until tomorrow 12 PM San Francisco
> time
> > for folks to voice their opinion on these issues.
> >
> > After tomorrow I will call a vote to cease distributions of official
> binary
> > packages from versions 0.21.0 onwards until the process is automated and
> > voting for the voting for the binary packages can be combined with the
> > tar.gz release.
> >
> > Since no feedback was received regarding talking point three, the idea
> will
> > be dropped.
> >
> > -Renan
> >
> >
> > On Fri, May 4, 2018 at 

Re: Slack IRC Gateway support ending

2018-03-14 Thread David McLaughlin
I don't have a strong opinion here, the whole chat space is very flavor of
the month. Does Apache have a policy?



On Tue, Mar 13, 2018 at 3:04 PM, Renan DelValle  wrote:

> Hi all,
>
> Slack has announced that their gateway for IRC will no longer be available
> after May 15th, 2018. [1]
>
> mslackbot was last seen in our IRC channel on February 9th, 2018. [2]
>
> I would like to hear some feedback from the community as to how we should
> proceed.
>
> My personal experience has been that folks find it difficult to find their
> way into our Slack channel, but once they do, the experience is better than
> IRC.
>
> Should also be noted that I haven't seen participation form our IRC
> channel in at least six months.
>
> Still, Slack is a walled garden and we're an open source community, so I'd
> like some input on wether:
>
> A. We should continue discussions in Slack and advertising Slack channel
> in our website. If yes, in what capacity (official or unofficial).
> B. We should we continue to link to our IRC channel from our website
> considering it has been inactive for so long.
>
> - Renan
>
> [1] https://get.slack.help/hc/en-us/articles/201727913-Connect-t
> o-Slack-over-IRC-and-XMPP
> [2] https://wilderness.apache.org/channels/?f=aurora/2018-02-09
>


Re: Vagrant dashboard issue

2018-02-13 Thread David McLaughlin
Hi Eyal,

This broke our cluster at Twitter too, so I think master is currently
broken. I think something went wrong with this commit: https://github.com/
apache/aurora/commit/cb0faf831747baa93ba9cff2423862eccb9b99bb

I believe Jordan Ly and Stephan are discussing a remediation (i.e. whether
to revert or roll forward). Maybe they can update.

Cheers,
David

On Tue, Feb 13, 2018 at 11:38 AM, Eyal Cidon  wrote:

> Hi all,
>
> I am trying to follow the instructions in these two guides:
> http://aurora.apache.org/documentation/latest/getting-started/vagrant/
> http://aurora.apache.org/documentation/latest/getting-started/tutorial/
>
> I am having issues with the vagrant VM, the aurora scheduler dash board
> get's stuck on loading and I can't figure out why. Also the Mesos agent is
> unreachable. I attached screen shots below.
>
> I am running this from my Mac, I cloned aurora from the github mirror and
> I made no modifications. My Vagrant is version 2.0.2 and VirtualBox is
> 5.2.6 guest edition.
>
> Would appreciate any help.
>
> Thank you,
>
> Eyal
>
>


Re: Staggered deployments

2018-01-31 Thread David McLaughlin
It's not currently possible with a single API call, you'd have to submit
multiple calls to startJobUpdate, changing the JobUpdateSettings each time.
When you say the size of the batch could change each step - could it change
dynamically (e.g. after you've submitted the call to the Scheduler), or is
all the information known upfront?

On Wed, Jan 31, 2018 at 6:15 PM, Renan DelValle 
wrote:

> Hi all,
>
> We have a use case where we want to deploy X amount of instances in N
> amount of steps. The size of the batch could potentially change every step.
> For example, we might start a with deploying 1 instances, followed by batch
> size of 50, followed by a batch of 49 to deploy 100 instances.
>
> Is there any way of achieving this behavior through existing Aurora thrift
> calls/primitives?
>
> Any insights would be greatly appreciated.
>
> -Renan
>


Re: shutdown vs kill API is Mesos

2018-01-15 Thread David McLaughlin
We are working with Mesos folks to resolve it. There is a Mesos performance
working group that folks can join if they'd like to contribute:
http://mesos.apache.org/blog/performance-working-group-progress-report/

I'm not sure what you mean by branch. Everything we used to scale test is
on master.

On Mon, Jan 15, 2018 at 10:08 AM, Meghdoot bhattacharya <
meghdoo...@yahoo.com> wrote:

> David, should twitter try against mesos 1.5 to see if things are better
> with the new api instead of libmesos. This is going to be a drift over time
> that will stop us from adopting new features.
>
> If it was sometime back it would be good to rerun the tests and open a
> ticket in Mesos if issues exist. All aurora users can then push for
> resolution.
>
> Also details on branch etc that has the api integration?
>
> Thx
>
> On Jan 12, 2018, at 11:39 AM, David McLaughlin <dmclaugh...@apache.org>
> wrote:
>
> I'm not sure I agree with the summary. Bill's proposal was using shutdown
> only when using the new API. I would also support this if it's possible.
>
> On Fri, Jan 12, 2018 at 11:14 AM, Mohit Jaggi <mohit.ja...@uber.com>
> wrote:
>
>> Summary so far:
>> - Bill supports making this change
>> - This change cannot be made in a backward compatible manner
>> - David (Twitter) does not want to use HTTP APIs due to performance
>> concerns. I conclude that folks from Twitter don't support this change
>>
>> Question:
>> - Are there other users that want this change?
>>
>>
>>
>


Re: shutdown vs kill API is Mesos

2018-01-12 Thread David McLaughlin
I'm not sure I agree with the summary. Bill's proposal was using shutdown
only when using the new API. I would also support this if it's possible.

On Fri, Jan 12, 2018 at 11:14 AM, Mohit Jaggi  wrote:

> Summary so far:
> - Bill supports making this change
> - This change cannot be made in a backward compatible manner
> - David (Twitter) does not want to use HTTP APIs due to performance
> concerns. I conclude that folks from Twitter don't support this change
>
> Question:
> - Are there other users that want this change?
>
>
>


Re: shutdown vs kill API is Mesos

2018-01-11 Thread David McLaughlin
Sorry, the other approach outlined by Bill would in theory work too, but it
sounds like in practice it also needs more changes on the Mesos side.

On Thu, Jan 11, 2018 at 1:55 PM, David McLaughlin <dmclaugh...@apache.org>
wrote:

> Right. In order to keep the current abstraction in Aurora (both APIs), we
> obviously have to bind to the lower common denominator API methods. So the
> only way to integrate with shutdown will be to fix the performance issues
> so we can switch to the new API.
>
> The performance issue we ran into at Twitter was that with status updates
> that were similar to our production volume, they started to get dropped and
> tasks end up being LOST and unnecessarily killed. So it's a definite
> blocker for us to adopt in its current state. We have someone who has
> fixing this on the Mesos side in their backlog, but it's currently not the
> highest priority for us.
>
> On Thu, Jan 11, 2018 at 1:45 PM, Renan DelValle <renanidelva...@gmail.com>
> wrote:
>
>> The HTTP API is what is used under the hood for V0 and V1 (instead of
>> libmesos), I believe that's what David was referencing when he mentioned
>> the HTTP performance issues. Here's a better explanation from the original
>> patch submitted by Zameer: https://github.com/apa
>> che/aurora/commit/705dbc7cd7c3ff477bcf766cdafe49a68ab47dee#
>> diff-75bd5a98db87502a2332e9110d2eafc6
>>
>> I'm not sure about the Shutdown call, as you mentioned, the versioned
>> driver seems to have the method but the driver interface does not. This
>> might get tricky from here on in since Mesos has V1 only compatible calls.
>>
>> On Thu, Jan 11, 2018 at 1:24 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Thanks Renan. I saw that code. "Driver" interface does not have
>>> SHUTDOWN...so it is not "compatible". I was trying to change to
>>> VersionedSchedulerDriverService all over the code (that wreaks havoc
>>> across the tests!) but Mesos's Java wrapper doesn't seem to have that
>>> call either. Perhaps, that is why David referred to the HTTP API.
>>>
>>> On Thu, Jan 11, 2018 at 1:14 PM, Renan DelValle <
>>> renanidelva...@gmail.com> wrote:
>>>
>>>> https://github.com/apache/aurora/blob/aae2b0dc73b7534c66982e
>>>> d07b1f029150e245de/src/main/java/org/apache/aurora/scheduler
>>>> /mesos/SchedulerDriverModule.java
>>>>
>>>> https://github.com/apache/aurora/blob/aae2b0dc73b7534c66982e
>>>> d07b1f029150e245de/src/main/java/org/apache/aurora/scheduler
>>>> /mesos/VersionedSchedulerDriverService.java#L50
>>>>
>>>> On Tue, Jan 9, 2018 at 1:21 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> David,
>>>>> Where can I find this code?
>>>>>
>>>>> Mohit.
>>>>>
>>>>> On Sat, Dec 9, 2017 at 4:27 PM, David McLaughlin <
>>>>> dmclaugh...@apache.org> wrote:
>>>>>
>>>>>> The new API is present in Aurora in a compatibility layer, but the
>>>>>> HTTP performance issues still exist so we can't make it the default.
>>>>>>
>>>>>> On Sat, Dec 9, 2017 at 4:24 PM, Bill Farner <wfar...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Aurora pre-dates SHUTDOWN by several years, so the option was not
>>>>>>> present.  Additionally, the SHUTDOWN call is not available in the API 
>>>>>>> used
>>>>>>> by Aurora.  Last i knew, Aurora could not use the "new" API because of
>>>>>>> performance issues in the implementation, but i do not know where that
>>>>>>> stands today.
>>>>>>>
>>>>>>> https://mesos.apache.org/documentation/latest/scheduler-http
>>>>>>> -api/#shutdown
>>>>>>>
>>>>>>>> NOTE: This is a new call that was not present in the old API
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Dec 9, 2017 at 4:11 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Folks,
>>>>>>>> Our Mesos team is wondering why Aurora chose KILL over SHUTDOWN for
>>>>>>>> killing tasks. As Aurora has an executor per task, won't SHUTDOWN work
>>>>>>>> better? It will avoid zombie executors.
>>>>>>>>
>>>>>>>> Mohit.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: shutdown vs kill API is Mesos

2018-01-11 Thread David McLaughlin
Right. In order to keep the current abstraction in Aurora (both APIs), we
obviously have to bind to the lower common denominator API methods. So the
only way to integrate with shutdown will be to fix the performance issues
so we can switch to the new API.

The performance issue we ran into at Twitter was that with status updates
that were similar to our production volume, they started to get dropped and
tasks end up being LOST and unnecessarily killed. So it's a definite
blocker for us to adopt in its current state. We have someone who has
fixing this on the Mesos side in their backlog, but it's currently not the
highest priority for us.

On Thu, Jan 11, 2018 at 1:45 PM, Renan DelValle <renanidelva...@gmail.com>
wrote:

> The HTTP API is what is used under the hood for V0 and V1 (instead of
> libmesos), I believe that's what David was referencing when he mentioned
> the HTTP performance issues. Here's a better explanation from the original
> patch submitted by Zameer: https://github.com/apache/aurora/commit/
> 705dbc7cd7c3ff477bcf766cdafe49a68ab47dee#diff-
> 75bd5a98db87502a2332e9110d2eafc6
>
> I'm not sure about the Shutdown call, as you mentioned, the versioned
> driver seems to have the method but the driver interface does not. This
> might get tricky from here on in since Mesos has V1 only compatible calls.
>
> On Thu, Jan 11, 2018 at 1:24 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> Thanks Renan. I saw that code. "Driver" interface does not have
>> SHUTDOWN...so it is not "compatible". I was trying to change to
>> VersionedSchedulerDriverService all over the code (that wreaks havoc
>> across the tests!) but Mesos's Java wrapper doesn't seem to have that
>> call either. Perhaps, that is why David referred to the HTTP API.
>>
>> On Thu, Jan 11, 2018 at 1:14 PM, Renan DelValle <renanidelva...@gmail.com
>> > wrote:
>>
>>> https://github.com/apache/aurora/blob/aae2b0dc73b7534c66982e
>>> d07b1f029150e245de/src/main/java/org/apache/aurora/scheduler
>>> /mesos/SchedulerDriverModule.java
>>>
>>> https://github.com/apache/aurora/blob/aae2b0dc73b7534c66982e
>>> d07b1f029150e245de/src/main/java/org/apache/aurora/scheduler
>>> /mesos/VersionedSchedulerDriverService.java#L50
>>>
>>> On Tue, Jan 9, 2018 at 1:21 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> David,
>>>> Where can I find this code?
>>>>
>>>> Mohit.
>>>>
>>>> On Sat, Dec 9, 2017 at 4:27 PM, David McLaughlin <
>>>> dmclaugh...@apache.org> wrote:
>>>>
>>>>> The new API is present in Aurora in a compatibility layer, but the
>>>>> HTTP performance issues still exist so we can't make it the default.
>>>>>
>>>>> On Sat, Dec 9, 2017 at 4:24 PM, Bill Farner <wfar...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Aurora pre-dates SHUTDOWN by several years, so the option was not
>>>>>> present.  Additionally, the SHUTDOWN call is not available in the API 
>>>>>> used
>>>>>> by Aurora.  Last i knew, Aurora could not use the "new" API because of
>>>>>> performance issues in the implementation, but i do not know where that
>>>>>> stands today.
>>>>>>
>>>>>> https://mesos.apache.org/documentation/latest/scheduler-http
>>>>>> -api/#shutdown
>>>>>>
>>>>>>> NOTE: This is a new call that was not present in the old API
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 9, 2017 at 4:11 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Folks,
>>>>>>> Our Mesos team is wondering why Aurora chose KILL over SHUTDOWN for
>>>>>>> killing tasks. As Aurora has an executor per task, won't SHUTDOWN work
>>>>>>> better? It will avoid zombie executors.
>>>>>>>
>>>>>>> Mohit.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Aurora taking really long to reschedule a full cluster

2017-11-29 Thread David McLaughlin
You should not need to adjust max_schedule_attempts_per_sec as it defaults
to 40. Which should give you pretty close to 2400 schedule attempts per
minute. (Our max is set to 30 and in our scale tests we hit 1800 tasks
scheduled per minute pretty consistently).

Can you provide more info on how you are scheduling? Are you scheduling
from scratch (using job create or update start?) How many job keys?

On Wed, Nov 29, 2017 at 4:42 PM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> This was on 0.17. No logs sorry, I'll run the same test again in a week or
> so. I can share the new ones and even kill the leader in the middle of the
> process.
>
> Tasks continued to run, I remember I dig through the logs to see how long
> it took for a particular task to show up again as assigned. I'll adjust the
> max_tasks_per_schedule_attempt and test it again.
>
> Thanks!
>
> On Wed, Nov 29, 2017 at 12:03 PM, Bill Farner  wrote:
>
>> That works out to scheduling about 1 task/sec, which is at least one
>> order of magnitude lower than i would expect.  Are you sure tasks were
>> scheduling and continuing to run, rather than exiting/failing and
>> triggering more scheduling work?
>>
>> What build is this from?  Can you share (scrubbed) scheduler logs from
>> this period?
>>
>> On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia <
>> mauriciogaravag...@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> Recently, running some reliability tests, we restarted all the nodes in
>>> a cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
>>> everything, we have a change of leader in the middle of the scheduling and
>>> that slowed it down even more. So we started looking which aurora
>>> parameters needed more tuning.
>>>
>>> The value of max_tasks_per_schedule_attempt is set to the default now,
>>> that probably needs to be increased, is there a rule of thumb to tune it
>>> based on cluster size, # of jobs, # of frameworks, etc?
>>>
>>> Regarding the JVM, we are running it with Xmx=24G; so far we haven't
>>> seen pressure there.
>>>
>>> Any input on where to look at would be really appreciated :)
>>>
>>> Mauricio
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Aurora pauses adding offers

2017-11-27 Thread David McLaughlin
Any log write latency will be reflected in the overall latency of the
request. Increased request latency is one of the main ways any server has
of telling a client that it's under load. It's then up to the client to
react to this.

If you want to throw error codes, you can put a proxy in front of Aurora
that has request timeouts - which would send 503s to clients. But the issue
with that is the requests are mostly non-idempotent so you'll need to build
reconciliation logic into it.

On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi  wrote:

> Imagine something like Spinnaker using Aurora underneath to schedule
> services. That layer often "amplifies" human effort and may result in a lot
> of load on Aurora. Usually that is fine but if Aurora slowed down due to
> transient problems, it can signal that to upstream software in the same way
> that busy web servers do during cyber Monday sales :-)
>
> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner  wrote:
>
>> I want to let upstream software "know" that Aurora is slowing down and
>>> that it should back off
>>
>>
>> Can you offer more detail about how Aurora is being used in this regard?
>> I haven't seen use cases in the past that would be amenable to this
>> behavior, so i would like to understand better.
>>
>>
>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi 
>> wrote:
>>
>>> Thanks Bill. We havn't been able to track down a specific root
>>> cause(although ZK node is known to have issues now and then but we don't
>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>> soon. In addition I want to let upstream software "know" that Aurora is
>>> slowing down and that it should back off. To achieve this I want to send
>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>> "defense" already exists?
>>>
>>>
>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner  wrote:
>>>
 The next level is to determine why the storage lock is being held.
 Common causes include:

 1. storage snapshot slowness, when scheduler state is very large, O(gb)
 1a. long GC pauses in the scheduler, often induced by (1)
 2. scheduler replicated log on slow disks
 3. network issues between schedulers, schedulers to zookeeper, or
 between zookeepers

 As an immediate (partial) remedy, i suggest you upgrade to eliminate
 the use of SQL/mybatis in the scheduler.  This helped twitter improve (1)
 and (1a).

 commit f2755e1
 Author: Bill Farner 
 Date:   Tue Oct 24 23:34:09 2017 -0700

 Exclusively use Map-based in-memory stores for primary storage


 On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi 
 wrote:

> and in log_storage_write_lock_wait_ns_per_event
>
> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi 
> wrote:
>
>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>> that cause or effect? :-)
>>
>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi 
>> wrote:
>>
>>> Thanks Bill. Please see inline:
>>>
>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner 
>>> wrote:
>>>
 I suspect they are getting enqueued


 Just to be sure - the offers do eventually get through though?


>>> In one instance the offers did get through but it took several
>>> minutes. In other instances we reloaded the scheduler to let another one
>>> become the leader.
>>>
>>>
 The most likely culprit is contention for the storage write lock,  
 observable
 via spikes in stat log_storage_write_lock_wait_ns_total.

>>>
>>> Thanks. I will check that one.
>>>
>>>

 I see that a lot of getJobUpdateDetails() and
> getTasksWithoutConfigs() calls are being made at that time


 This sounds like API activity.  This shouldn't interfere with offer
 processing directly, but could potentially slow down the scheduler as a
 whole.


>>> So these won't contend for locks with offer processing and task
>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>> machine. I also noticed a spike in mybatis active and bad connections.
>>> Can't say if the spike in active is due to many bad connections or vice
>>> versa or there was a 3rd source causing both of these. Are there any
>>> metrics or logs that might help here?
>>>
>>>
 I also notice a lot of "Timeout reached for task..." around the
> same time. Can this happen if task is in PENDING state and does not 
> reach
> ASSIGNED due to lack of offers?

Re: How to fetch historical data for Aurora SLA metrics like MTTA, MTTS and MTTR?

2017-08-22 Thread David McLaughlin
We do the same thing at Twitter, we have a local agent that polls the the
metrics endpoint and sends them to our internal timeseries database.

Cheers,
David

On Tue, Aug 22, 2017 at 4:05 PM, Derek Slager  wrote:

> Our approach is similar. We poll /vars.json on an interval and send a
> selection of those metrics to Riemann. We configure alerts there, and also
> pass these metrics through to InfluxDB for historical reporting (mostly via
> Grafana dashboards). This has worked well for us.
>
> --
> Derek Slager
> CTO
> Amperity
>
> On Tue, Aug 22, 2017 at 3:23 PM, De, Bipra  wrote:
>
>> Hello Friends,
>>
>> Greetings!!
>>
>> We are currently using *Aurora 0.17.0* and have a use-case wherein we
>> want to continuously monitor the below SLA metrics for our clusters to
>> detect any anomalies :
>>
>>- Median Time To Assigned (MTTA
>>
>> 
>>)
>>- Median Time To Starting (MTTS
>>
>> 
>>)
>>- Median Time To Running (MTTR
>>
>> 
>>)
>>
>> Currently, the *sla_stat_refresh_interval* for us is set to default *1
>> min*.
>>
>> Now, while using the */vars* api endpoint to fetch the SLA metrics,
>> aurora samples the data for metrics calculation of the above metrics only
>> for the last one min at every 1 minute interval. It won’t give us the
>> historical data for these metrics.
>>
>> Does aurora expose any api endpoint to provide the historical data for
>> these metrics over some configurable period of time? Is there any metric in 
>> */graphview
>> *endpoint for this?
>>
>> Also, it will be great if anyone can suggest some ideas for monitoring
>> around these metrics. I am , at present,  planning to keep polling the
>> /vars endpoint regularly for data collection and use ELK stack for graphing
>> and alerting.
>>
>> Thanks for your time in advance !!
>>
>> Regards,
>>
>> Bipra.
>>
>
>


Re: Aurora HTTP BA Issues

2016-10-16 Thread David McLaughlin
I *just* seen that in the e2e test. Have we documented that anywhere?

On Sun, Oct 16, 2016 at 10:24 AM, Stephan Erb <s...@apache.org> wrote:

> By default we should be using the fallback implementation of the
> requests Python module: http://docs.python-requests.org/en/master/user/
> authentication/#netrc-authentication
>
> So just adding ~/.netrc file should therefore be sufficient to pass
> credentials via basic auth.
>
>
>
> On So, 2016-10-16 at 10:07 -0700, David McLaughlin wrote:
> > I noticed this while I was reviewing the patch to add cookies - I
> > couldn't find any HTTP basic auth support in the client. We'd need a
> > patch to wire that up (with some method of getting the user
> > credentials).
> >
> > I think this confirms my concern about adding auth support on only
> > one side of the CLI/Scheduler :)
> >
> > On Sun, Oct 16, 2016 at 9:46 AM, Stephan Erb <s...@apache.org> wrote:
> > > Hi,
> > >
> > > would it be possible to show us your relevant scheduler
> > > configuration and your ini file?  This will make it easier to
> > > reproduce the issue.
> > >
> > > Thanks,
> > > Stephan
> > >
> > > On Do, 2016-10-13 at 17:58 +, Ajmera, Jatan wrote:
> > > > Hi,
> > > > I was previously communicating about this issue on the slack
> > > > channel sometime back. I am having troubles when i am trying to
> > > > configure the HTTP Basic Auth for my aurora scheduler on my Mesos
> > > > Cluster. I have configured the necessary flags at deploy time
> > > > along with the other flags and have checked the ini file and made
> > > > sure it had the right permissions as well. The logs I see on the
> > > > master where the scheduler is running are as follows:
> > > >
> > > > 80006, negotiated timeout = 4000
> > > > I1013 01:51:09.751 [RedirectMonitor STARTING,
> > > > ServerSetImpl$ServerSetWatcher:317] received initial membership
> > > > [ServiceInstance(serviceEndpoint:Endpoint(host:ec2-54-244-159-
> > > > 201.us-west-2.compute.amazonaws.com, port:8081),
> > > > additionalEndpoints:{http=Endpoint(host:ec2-54-244-159-201.us-
> > > > west-2.compute.amazonaws.com, port:8081)}, status:ALIVE)]
> > > > I1013 01:51:09.765 [HttpServerLauncher STARTING, Server:345]
> > > > jetty-9.3.11.v20160721
> > > > W1013 01:51:10.278 [HttpServerLauncher STARTING, Stats:181] Re-
> > > > using already registered variable for key
> > > > shiro_authorization_failures
> > > > W1013 01:51:10.302 [HttpServerLauncher STARTING, IniRealm:139]
> > > > Users or Roles are already populated.  Configured Ini instance
> > > > will be ignored.
> > > > W1013 01:51:10.313 [HttpServerLauncher STARTING,
> > > > DefaultWebSecurityManager:173] The
> > > > org.apache.shiro.web.mgt.DefaultWebSecurityManager implementation
> > > > expects SessionManager instances that implement the
> > > > org.apache.shiro.web.session.mgt.WebSessionManager interface.
> > > > The configured instance is of type
> > > > [org.apache.shiro.session.mgt.DefaultSessionManager] which does
> > > > not implement this interface..  This may cause unexpected
> > > > behavior.
> > > > W1013 01:51:10.333 [HttpServerLauncher STARTING, IniRealm:139]
> > > > Users or Roles are already populated.  Configured Ini instance
> > > > will be ignored.
> > > > W1013 01:51:10.347 [HttpServerLauncher STARTING, IniRealm:139]
> > > > Users or Roles are already populated.  Configured Ini instance
> > > > will be ignored.
> > > > W1013 01:51:10.351 [HttpServerLauncher STARTING,
> > > > DefaultWebSecurityManager:173] The
> > > > org.apache.shiro.web.mgt.DefaultWebSecurityManager implementation
> > > > expects SessionManager instances that implement the
> > > > org.apache.shiro.web.session.mgt.WebSessionManager interface.
> > > > The configured instance is of type
> > > > [org.apache.shiro.session.mgt.DefaultSessionManager] which does
> > > > not implement this interface..  This may cause unexpected
> > > > behavior.
> > > > W1013 01:51:10.362 [HttpServerLauncher STARTING, IniRealm:139]
> > > > Users or Roles are already populated.  Configured Ini instance
> > > > will be ignored.
> > > >
> > > > And when i try to connect through the client i get a 401 Client
> > > > Error:Unauthorized. It would be really appreciated if you could
> > > > help me figure this out. Also what would be a good approach to
> > > > secure communication from the client to the scheduler.
> > > >
> > > > Thanks,
> > > >
> > >
> >
>