from:"\"Santhosh Kumar Shanmugham \\\(JIRA\\\)\""

[jira] [Commented] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down

2018-06-15 Thread Santhosh Kumar Shanmugham (JIRA)



[ 
https://issues.apache.org/jira/browse/AURORA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514181#comment-16514181
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1990:
---

https://reviews.apache.org/r/67613/

> SlaManager's AsyncHttpClient can keep scheduler from shutting down
> --
>
> Key: AURORA-1990
> URL: https://issues.apache.org/jira/browse/AURORA-1990
> Project: Aurora
>  Issue Type: Bug
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> We observed a situation where the scheduler was unable to fully shutdown 
> after a single non-daemon thread from the AsyncHttpClient's thread pool 
> stayed alive.
> We should convert the SlaManager to be an AbstractIdleService to ensure that 
> the thread pool is closed on scheduler shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down

2018-06-15 Thread Santhosh Kumar Shanmugham (JIRA)



 [ 
https://issues.apache.org/jira/browse/AURORA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1990:
-

Assignee: Santhosh Kumar Shanmugham

> SlaManager's AsyncHttpClient can keep scheduler from shutting down
> --
>
> Key: AURORA-1990
> URL: https://issues.apache.org/jira/browse/AURORA-1990
> Project: Aurora
>  Issue Type: Bug
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> We observed a situation where the scheduler was unable to fully shutdown 
> after a single non-daemon thread from the AsyncHttpClient's thread pool 
> stayed alive.
> We should convert the SlaManager to be an AbstractIdleService to ensure that 
> the thread pool is closed on scheduler shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down

2018-06-15 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1990:
-

 Summary: SlaManager's AsyncHttpClient can keep scheduler from 
shutting down
 Key: AURORA-1990
 URL: https://issues.apache.org/jira/browse/AURORA-1990
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


We observed a situation where the scheduler was unable to fully shutdown after 
a single non-daemon thread from the AsyncHttpClient's thread pool stayed alive.

We should convert the SlaManager to be an AbstractIdleService to ensure that 
the thread pool is closed on scheduler shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (AURORA-1979) Introduce new SLA-aware maintenance APIs

2018-06-05 Thread Santhosh Kumar Shanmugham (JIRA)



 [ 
https://issues.apache.org/jira/browse/AURORA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1979.
---
Resolution: Implemented

https://reviews.apache.org/r/66716/

> Introduce new SLA-aware maintenance APIs
> 
>
> Key: AURORA-1979
> URL: https://issues.apache.org/jira/browse/AURORA-1979
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Introduce the new SLA-aware thrift endpoints on the Scheduler. More 
> information in the proposal document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (AURORA-1976) SLA-based Maintenance in Scheduler

2018-06-05 Thread Santhosh Kumar Shanmugham (JIRA)



 [ 
https://issues.apache.org/jira/browse/AURORA-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1976.
---
Resolution: Implemented

> SLA-based Maintenance in Scheduler
> --
>
> Key: AURORA-1976
> URL: https://issues.apache.org/jira/browse/AURORA-1976
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Proposal to create SLA-aware maintenance support in the Scheduler.
> https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (AURORA-1978) Implement SLA computation utilities

2018-06-05 Thread Santhosh Kumar Shanmugham (JIRA)



 [ 
https://issues.apache.org/jira/browse/AURORA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1978.
---
Resolution: Implemented

https://reviews.apache.org/r/66716/

> Implement SLA computation utilities
> ---
>
> Key: AURORA-1978
> URL: https://issues.apache.org/jira/browse/AURORA-1978
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Create utilities and helper methods required for computing SLA in the 
> Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1986) Deprecate client-driven host maintenance.

2018-05-25 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1986:
-

 Summary: Deprecate client-driven host maintenance.
 Key: AURORA-1986
 URL: https://issues.apache.org/jira/browse/AURORA-1986
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


After AURORA-1978 the client-driven host maintenance is no longer required.

`sla_host_drain` is the newly introduced admin command. Merge this under the 
`host_drain` command and remove `sla_host_drain`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1985) Offers are not updated after host_activate immediately

2018-05-25 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1985:
-

 Summary: Offers are not updated after host_activate immediately
 Key: AURORA-1985
 URL: https://issues.apache.org/jira/browse/AURORA-1985
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


After activating a host the {{HostAttributes}} on {{HostOffers}} does not 
consider the offer for scheduling tasks. It takes until the next offer-cycle to 
update the offer and use it for scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-05-21 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1977.
---
Resolution: Implemented

> Introduce thrift changes for describing SLAPolicies
> ---
>
> Key: AURORA-1977
> URL: https://issues.apache.org/jira/browse/AURORA-1977
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-05-15 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16476463#comment-16476463
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1977:
---

[~jingc] I have been working on this as part of 
[https://reviews.apache.org/r/66716.] I forgot to assign this to myself. 
Apologies.

> Introduce thrift changes for describing SLAPolicies
> ---
>
> Key: AURORA-1977
> URL: https://issues.apache.org/jira/browse/AURORA-1977
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Jing Chen
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (AURORA-1979) Introduce new SLA-aware maintenance APIs

2018-05-15 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1979:
-

Assignee: Santhosh Kumar Shanmugham

> Introduce new SLA-aware maintenance APIs
> 
>
> Key: AURORA-1979
> URL: https://issues.apache.org/jira/browse/AURORA-1979
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Introduce the new SLA-aware thrift endpoints on the Scheduler. More 
> information in the proposal document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-05-15 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1977:
-

Assignee: Santhosh Kumar Shanmugham  (was: Jing Chen)

> Introduce thrift changes for describing SLAPolicies
> ---
>
> Key: AURORA-1977
> URL: https://issues.apache.org/jira/browse/AURORA-1977
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (AURORA-1978) Implement SLA computation utilities

2018-05-15 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1978:
-

Assignee: Santhosh Kumar Shanmugham

> Implement SLA computation utilities
> ---
>
> Key: AURORA-1978
> URL: https://issues.apache.org/jira/browse/AURORA-1978
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Create utilities and helper methods required for computing SLA in the 
> Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1979) Introduce new SLA-aware maintenance APIs

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1979:
-

 Summary: Introduce new SLA-aware maintenance APIs
 Key: AURORA-1979
 URL: https://issues.apache.org/jira/browse/AURORA-1979
 Project: Aurora
  Issue Type: Sub-task
  Components: Scheduler
Reporter: Santhosh Kumar Shanmugham


Introduce the new SLA-aware thrift endpoints on the Scheduler. More information 
in the proposal document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1978) Implement SLA computation utilities

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1978:
-

 Summary: Implement SLA computation utilities
 Key: AURORA-1978
 URL: https://issues.apache.org/jira/browse/AURORA-1978
 Project: Aurora
  Issue Type: Sub-task
  Components: Scheduler
Reporter: Santhosh Kumar Shanmugham


Create utilities and helper methods required for computing SLA in the Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1977:
--
Component/s: Scheduler

> Introduce thrift changes for describing SLAPolicies
> ---
>
> Key: AURORA-1977
> URL: https://issues.apache.org/jira/browse/AURORA-1977
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (AURORA-1976) SLA-based Maintenance in Scheduler

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1976:
--
Component/s: Scheduler

> SLA-based Maintenance in Scheduler
> --
>
> Key: AURORA-1976
> URL: https://issues.apache.org/jira/browse/AURORA-1976
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Proposal to create SLA-aware maintenance support in the Scheduler.
> https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1977:
-

 Summary: Introduce thrift changes for describing SLAPolicies
 Key: AURORA-1977
 URL: https://issues.apache.org/jira/browse/AURORA-1977
 Project: Aurora
  Issue Type: Sub-task
Reporter: Santhosh Kumar Shanmugham






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (AURORA-1976) SLA-based Maintenance in Scheduler

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1976:
-

Assignee: Santhosh Kumar Shanmugham

> SLA-based Maintenance in Scheduler
> --
>
> Key: AURORA-1976
> URL: https://issues.apache.org/jira/browse/AURORA-1976
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Proposal to create SLA-aware maintenance support in the Scheduler.
> https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1976) SLA-based Maintenance in Scheduler

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1976:
-

 Summary: SLA-based Maintenance in Scheduler
 Key: AURORA-1976
 URL: https://issues.apache.org/jira/browse/AURORA-1976
 Project: Aurora
  Issue Type: Story
Reporter: Santhosh Kumar Shanmugham


Proposal to create SLA-aware maintenance support in the Scheduler.

https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1966) TASK_UNKNOWN to PARTITIONED mapping puts Scheduler to kill non-exist Task indefinitely

2018-01-24 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1966:
-

 Summary: TASK_UNKNOWN to PARTITIONED mapping puts Scheduler to 
kill non-exist Task indefinitely
 Key: AURORA-1966
 URL: https://issues.apache.org/jira/browse/AURORA-1966
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


When a Task launch fails, it is moved from ASSIGNED to LOST, which performs a 
RESCHEDULE and KILL. Unfortunately the KILL of a non-existent task to the Mesos 
master results in a TASK_UNKNOWN status update, which gets mapped to 
PARTITIONED. While the transition from LOST to PARTITIONED is not allowed, some 
callbacks get executed despite the fact, resulting in a KILL and RESCHEDULE 
action. This new KILL triggers another TASK_UNKNOWN and hence PARTITIONED 
status update for the same task, putting the Scheduler to indefinitely attempt 
KILLing the non-existent task. Attempting a client job killall results in the 
same state for the scheduler.

Since the scheduler uses the LOST state for black-holing task the 
{{TaskStateMachine}} needs to take those into account.

I was able to reproduce this in the Vagrant image by faking a launch failure.
{code:java}
I0124 05:48:23.198 [qtp1791010542-40, StateMachine] 
vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
 state machine transition INIT -> PENDING I0124 05:48:23.213508 9748 
log.cpp:560] Attempting to append 1679 bytes to the log I0124 05:48:23.214570 
9748 coordinator.cpp:348] Coordinator attempting to write APPEND action at 
position 24778 I0124 05:48:23.214834 9748 replica.cpp:540] Replica received 
write request for position 24778 from __req_res__(4)@192.168.33.7:8083 I0124 
05:48:23.221982 9748 leveldb.cpp:341] Persisting action (1700 bytes) to leveldb 
took 6.772102ms I0124 05:48:23.222174 9748 replica.cpp:711] Persisted action 
APPEND at position 24778 I0124 05:48:23.222901 9748 replica.cpp:694] Replica 
received learned notice for position 24778 from 
log-network(1)@192.168.33.7:8083 I0124 05:48:23.226833 9748 leveldb.cpp:341] 
Persisting action (1702 bytes) to leveldb took 3.227779ms I0124 05:48:23.227008 
9748 replica.cpp:711] Persisted action APPEND at position 24778 I0124 
05:48:23.262 [qtp1791010542-40, RequestLog] 127.0.0.1 - - [24/Jan/2018:05:48:23 
+] "POST //aurora.local/api HTTP/1.1" 200 78 I0124 05:48:23.267 
[qtp1791010542-40, LoggingInterceptor] 
getTasksWithoutConfigs(TaskQuery(role:null, environment:null, jobName:null, 
taskIds:null, statuses:null, instanceIds:null, slaveHosts:null, 
jobKeys:[JobKey(role:vagrant, environment:test, 
name:fail-partition_aware_disabled)], offset:0, limit:0)) I0124 05:48:23.285 
[qtp1791010542-40, RequestLog] 127.0.0.1 - - [24/Jan/2018:05:48:23 +] "POST 
//aurora.local/api HTTP/1.1" 200 794 I0124 05:48:23.349 [TaskGroupBatchWorker, 
StateMachine] Callback transition PENDING to ASSIGNED, allow: true I0124 
05:48:23.353 [TaskGroupBatchWorker, StateMachine] 
vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606
 state machine transition PENDING -> ASSIGNED I0124 05:48:23.356 
[TaskGroupBatchWorker, TaskAssignerImpl] Offer on agent 192.168.33.7 (id 
fe8bc641-aa02-4363-a990-318d20de1bac-S0) is being assigned task for 
vagrant-test-fail-partition_aware_disabled-0-07bec0cb-d6a3-4caa-9b6e-60e6d0934606.
 W0124 05:48:23.445 [TaskGroupBatchWorker, TaskAssignerImpl] Failed to launch 
task. org.apache.aurora.scheduler.offers.OfferManager$LaunchException: Failed 
to launch task. at 
org.apache.aurora.scheduler.offers.OfferManagerImpl.launchTask(OfferManagerImpl.java:212)
 at 
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
 at 
org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.launchUsingOffer(TaskAssignerImpl.java:126)
 at 
org.apache.aurora.scheduler.scheduling.TaskAssignerImpl.maybeAssign(TaskAssignerImpl.java:262)
 at 
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
 at 
org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.scheduleTasks(TaskSchedulerImpl.java:154)
 at 
org.apache.aurora.scheduler.scheduling.TaskSchedulerImpl.schedule(TaskSchedulerImpl.java:108)
 at 
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
 at 
org.apache.aurora.scheduler.scheduling.TaskGroups$1.lambda$run$0(TaskGroups.java:174)
 at org.apache.aurora.scheduler.BatchWorker$Work.apply(BatchWorker.java:117) at 
org.apache.aurora.scheduler.BatchWorker.lambda$processBatch$3(BatchWorker.java:210)
 at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
 at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
 at 
org.apache.aurora.scheduler.storage.durability.DurableStorage.lambda$doInTransaction$0(DurableStorage.java:199)
 at 
org.apache.aurora.scheduler.storage.mem.MemStorage.write(MemSto

[jira] [Created] (AURORA-1965) LaunchException needs transient task timeout for reschedule

2018-01-24 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1965:
-

 Summary: LaunchException needs transient task timeout for 
reschedule
 Key: AURORA-1965
 URL: https://issues.apache.org/jira/browse/AURORA-1965
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


In {{TaskAssignerImpl}} a {{LaunchException}} either due to race-condition for 
grabbing {{Offer}} or Scheduler Driver {{acceptOffers}} failure will result in 
the task being help up in {{ASSIGNED}} state, until transient task timeout 
kicks in to move the task to LOST, resulting in a reschedule.

[https://github.com/apache/aurora/blob/dbe71374399d86d77baa35efbeca4fffa34f380e/src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java#L135]

The impact of this is minimal due to fact that multiple conditions need to 
align for this situation to happen.

This happens since the call to {{changeState}} fails since the supplied current 
state (PENDING) in the method call is different from the actual state of the 
task (ASSIGNED).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AURORA-1946) Make STARTING a transient state

2017-09-19 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172383#comment-16172383
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1946:
---

The underlying issue in the above case was that the Task's status updates to 
{{RUNNING}} and then to {{FAILED}} were never communicated to the Master and 
eventually to the Scheduler. So the issue lies with the Agent.

> Make STARTING a transient state
> ---
>
> Key: AURORA-1946
> URL: https://issues.apache.org/jira/browse/AURORA-1946
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> We saw a case where an update was stuck in {{IN_PROGRESS}} state, after a 
> task's status update from {{STARTING}} to {{FAILED}} was lost. In the ideal 
> scenario the {{Task}} should have been transitioned into {{LOST}} due to a 
> transient state. But {{STARTING}} is not a transient state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Closed] (AURORA-1946) Make STARTING a transient state

2017-09-19 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham closed AURORA-1946.
-
Resolution: Invalid

> Make STARTING a transient state
> ---
>
> Key: AURORA-1946
> URL: https://issues.apache.org/jira/browse/AURORA-1946
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> We saw a case where an update was stuck in {{IN_PROGRESS}} state, after a 
> task's status update from {{STARTING}} to {{FAILED}} was lost. In the ideal 
> scenario the {{Task}} should have been transitioned into {{LOST}} due to a 
> transient state. But {{STARTING}} is not a transient state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1946) Make STARTING a transient state

2017-09-19 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172381#comment-16172381
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1946:
---

Just realized that {{STARTING}} state although can be treated as a Transient 
state, the timeout depends on the {{HealthCheckConfig}} which dictates how long 
the {{Task}} can stay in {{STARTING}}. Further {{HealthCheckConfig}} is an 
{{Executor}} concept that the Scheduler does not care about. So it does not 
make sense to convert {{STARTING}} into a Transient state that will degrade 
into a {{LOST}} state base on a common timeout value.

> Make STARTING a transient state
> ---
>
> Key: AURORA-1946
> URL: https://issues.apache.org/jira/browse/AURORA-1946
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> We saw a case where an update was stuck in {{IN_PROGRESS}} state, after a 
> task's status update from {{STARTING}} to {{FAILED}} was lost. In the ideal 
> scenario the {{Task}} should have been transitioned into {{LOST}} due to a 
> transient state. But {{STARTING}} is not a transient state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (AURORA-1946) Make STARTING a transient state

2017-09-18 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1946:
-

 Summary: Make STARTING a transient state
 Key: AURORA-1946
 URL: https://issues.apache.org/jira/browse/AURORA-1946
 Project: Aurora
  Issue Type: Task
Reporter: Santhosh Kumar Shanmugham


We saw a case where an update was stuck in {{IN_PROGRESS}} state, after a 
task's status update from {{STARTING}} to {{FAILED}} was lost. In the ideal 
scenario the {{Task}} should have been transitioned into {{LOST}} due to a 
transient state. But {{STARTING}} is not a transient state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (AURORA-1946) Make STARTING a transient state

2017-09-18 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1946:
-

Assignee: Santhosh Kumar Shanmugham

> Make STARTING a transient state
> ---
>
> Key: AURORA-1946
> URL: https://issues.apache.org/jira/browse/AURORA-1946
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> We saw a case where an update was stuck in {{IN_PROGRESS}} state, after a 
> task's status update from {{STARTING}} to {{FAILED}} was lost. In the ideal 
> scenario the {{Task}} should have been transitioned into {{LOST}} due to a 
> transient state. But {{STARTING}} is not a transient state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (AURORA-1912) DbSnapShot may remove enum values

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1912:
--
Fix Version/s: 0.18.0

> DbSnapShot may remove enum values
> -
>
> Key: AURORA-1912
> URL: https://issues.apache.org/jira/browse/AURORA-1912
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> The dbnsapshot restore may truncate enum tables and cause referential 
> integrity issues. From the code it restores from the SQL dump by first 
> dropping all tables:
> {noformat}
> try (Connection c = ((DataSource) 
> store.getUnsafeStoreAccess()).getConnection()) {
>   LOG.info("Dropping all tables");
>   try (PreparedStatement drop = c.prepareStatement("DROP ALL 
> OBJECTS")) {
> drop.executeUpdate();
>   }
> {noformat}
> However a freshly started leader will have some data in there from preparing 
> the storage:
> {noformat}
>   @Override
>   @Transactional
>   protected void startUp() throws IOException {
> Configuration configuration = sessionFactory.getConfiguration();
> String createStatementName = "create_tables";
> configuration.setMapUnderscoreToCamelCase(true);
> // The ReuseExecutor will cache jdbc Statements with equivalent SQL, 
> improving performance
> // slightly when redundant queries are made.
> configuration.setDefaultExecutorType(ExecutorType.REUSE);
> addMappedStatement(
> configuration,
> createStatementName,
> CharStreams.toString(new InputStreamReader(
> DbStorage.class.getResourceAsStream("schema.sql"),
> StandardCharsets.UTF_8)));
> try (SqlSession session = sessionFactory.openSession()) {
>   session.update(createStatementName);
> }
> for (CronCollisionPolicy policy : CronCollisionPolicy.values()) {
>   enumValueMapper.addEnumValue("cron_policies", policy.getValue(), 
> policy.name());
> }
> for (MaintenanceMode mode : MaintenanceMode.values()) {
>   enumValueMapper.addEnumValue("maintenance_modes", mode.getValue(), 
> mode.name());
> }
> for (JobUpdateStatus status : JobUpdateStatus.values()) {
>   enumValueMapper.addEnumValue("job_update_statuses", status.getValue(), 
> status.name());
> }
> for (JobUpdateAction action : JobUpdateAction.values()) {
>   enumValueMapper.addEnumValue("job_instance_update_actions", 
> action.getValue(), action.name());
> }
> for (ScheduleStatus status : ScheduleStatus.values()) {
>   enumValueMapper.addEnumValue("task_states", status.getValue(), 
> status.name());
> }
> for (ResourceType resourceType : ResourceType.values()) {
>   enumValueMapper.addEnumValue("resource_types", resourceType.getValue(), 
> resourceType.name());
> }
> for (Mode mode : Mode.values()) {
>   enumValueMapper.addEnumValue("volume_modes", mode.getValue(), 
> mode.name());
> }
> createPoolMetrics();
>   }
> {noformat}
> Consider the case where we add a new value to an existing enum. This means 
> restoring from a snapshot will not allow us to have that value in the enum 
> table. 
> To fix this we should have a migration for every enum value we add. However 
> to me it seems that the better idea would be to update the enum tables after 
> we restore from a snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1680) Aurora 0.16.0 deprecations

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1680:
--
Fix Version/s: 0.18.0

> Aurora 0.16.0 deprecations
> --
>
> Key: AURORA-1680
> URL: https://issues.apache.org/jira/browse/AURORA-1680
> Project: Aurora
>  Issue Type: Epic
>Reporter: Maxim Khutornenko
> Fix For: 0.18.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1909) Thermos Health Check fails for MesosContainerizer if `--nosetuid-health-checks` is set

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1909:
--
Fix Version/s: 0.18.0

> Thermos Health Check fails for MesosContainerizer if 
> `--nosetuid-health-checks` is set
> --
>
> Key: AURORA-1909
> URL: https://issues.apache.org/jira/browse/AURORA-1909
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Charles Raimbert
>Assignee: Charles Raimbert
>  Labels: easyfix
> Fix For: 0.18.0
>
>
> With MesosContainerizer, the sandbox is of type FileSystemImageSandbox and 
> the health check is performed using a "mesos-containerizer launch" process, 
> but there is actually a code bug in the way of getting the user under which 
> to run the health check process:
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370
> {code}
> health_check_user = (os.getusername() if self._nosetuid_health_checks
> else assigned_task.task.job.role)
> {code}
> If the Aurora scheduler is configured with `--nosetuid-health-checks` then 
> "os.getusername()" is executed, but the python "os" module does not present a 
> "getusername()" function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1860) Fix bug in scheduler driver disconnect stats

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1860:
--
Fix Version/s: 0.18.0

> Fix bug in scheduler driver disconnect stats
> 
>
> Key: AURORA-1860
> URL: https://issues.apache.org/jira/browse/AURORA-1860
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Assignee: Ilya Pronin
>Priority: Minor
>  Labels: newbie
> Fix For: 0.18.0
>
>
> Correct the refactoring mistake introduced in 
> [https://reviews.apache.org/r/31550/] that has disabled 
> {{scheduler_framework_disconnects}} stats:
> {code:title=MesosSchedulerImpl.disconnected()}
> counters.get("scheduler_framework_disconnects").get();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1922) Expose stats on the number of jobs stored in MemCronJobStore

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1922:
--
Fix Version/s: 0.18.0

> Expose stats on the number of jobs stored in MemCronJobStore
> 
>
> Key: AURORA-1922
> URL: https://issues.apache.org/jira/browse/AURORA-1922
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Assignee: Mehrdad Nurolahzade
>Priority: Trivial
>  Labels: newbie
> Fix For: 0.18.0
>
>
> Expose stats on the size of {{jobs}} map in {{MemCronJobStore}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1923) Aurora client should not automatically retry non-idempotent operations

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1923:
--
Fix Version/s: 0.18.0

> Aurora client should not automatically retry non-idempotent operations
> --
>
> Key: AURORA-1923
> URL: https://issues.apache.org/jira/browse/AURORA-1923
> Project: Aurora
>  Issue Type: Story
>  Components: Client
>Reporter: Mehrdad Nurolahzade
>Assignee: Mehrdad Nurolahzade
> Fix For: 0.18.0
>
>
> Aurora client has a built in mechanism to automatically retry thrift API 
> operations if the connection with scheduler times out, experiences transport 
> exception, or encounters a transient exception on the scheduler side.
> Retrying thrift calls due to scheduler connection timeout and transient 
> exceptions (see [AURORA-187]) is safe. However, as Aurora has no concept of 
> idempotency, its client can retry non-idempotent operations upon encountering 
> transport exceptions which can lead to nondeterministic situations.
> For example, if client requests go through a proxy to reach scheduler, client 
> might consider a non-idempotent request failed and automatically retry it 
> while the original request has been received and processed by the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1911) HTTP Scheduler Driver does not reliably re subscribe

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1911:
--
Fix Version/s: 0.18.0

> HTTP Scheduler Driver does not reliably re subscribe
> 
>
> Key: AURORA-1911
> URL: https://issues.apache.org/jira/browse/AURORA-1911
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> I observed this issue in a large production cluster during a period of Mesos 
> Master instability:
> 1. Mesos master crashes or restarts.
> 2. {{V1Mesos}} driver detects this and reconnects.
> 3. Aurora does the {{SUBSCRIBE}} call again.
> 4. The {{SUBSCRIBE}} Call fails silently in the driver.
> 5. All future calls are silently dropped by the driver.
> 6. Aurora has no offers because it is not subscribed.
> Logs:
> {noformat}
> I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at 
> http://10.162.14.30:5050/master/api/v1/scheduler
> W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service 
> Unavailable' () for SUBSCRIBE
> 
> W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> 
> W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> 
> W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> ...
> {noformat}
> To fix this, the {{VersionedSchedulerDriver}} needs to do two things:
> 1. Block calls when unsubscribed not just disconnected.
> 2. Retry the {{SUBSCRIBE}} call repeatedly with exponential backoff.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1904) Support Mesos Maintenance

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1904:
--
Fix Version/s: 0.18.0

> Support Mesos Maintenance
> -
>
> Key: AURORA-1904
> URL: https://issues.apache.org/jira/browse/AURORA-1904
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
> Fix For: 0.18.0
>
>
> Support Mesos Maintenance primitives in Aurora per the design 
> [doc|https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumkIbHrk0-ceoD2YI/edit#heading=h.n5tvzjaj9llx].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1920) Add pluggable scheduling logic to Aurora

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1920:
--
Fix Version/s: 0.18.0

> Add pluggable scheduling logic to Aurora
> 
>
> Key: AURORA-1920
> URL: https://issues.apache.org/jira/browse/AURORA-1920
> Project: Aurora
>  Issue Type: Epic
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
> Fix For: 0.18.0
>
>
> On the mailing list recently there was a desire to have custom scheduling 
> logic (e.g. to implement scheduling spread). This ticket tracks the proposal 
> and implementation of custom scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1924) Aurora client should reconcile idempotent job creations

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1924:
--
Fix Version/s: 0.18.0

> Aurora client should reconcile idempotent job creations
> ---
>
> Key: AURORA-1924
> URL: https://issues.apache.org/jira/browse/AURORA-1924
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Scheduler
>Reporter: Mehrdad Nurolahzade
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
> Fix For: 0.18.0
>
>
> Aurora scheduler rejects a request to create a job if a job with the same key 
> already exists (see {{SchedulerThriftInterface.createJob()}}). Aurora client 
> exits with an error once it receives a response with 
> {{ResponseCode.INVALID_REQUEST}} from scheduler in this case.
> However, an attempt to create a job with the exact same configuration and 
> number of instances is essentially idempotent. Scheduler can detect this 
> situation, ignore it, and signal client to treat operation as successful; 
> client warns user about existing job but does not fail the operation.
> This helps Aurora client and scheduler reconcile state when creating jobs in 
> presence of transport layer exceptions; allowing {{aurora job create}} 
> command can then be marked as idempotent after [AURORA-1923] is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1914) Unable to specify multiple volumes per task.

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1914:
--
Fix Version/s: 0.18.0

> Unable to specify multiple volumes per task.
> 
>
> Key: AURORA-1914
> URL: https://issues.apache.org/jira/browse/AURORA-1914
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> There is an artificial constraint in the schema which prevents multiple 
> volumes per task. This was not caught before in testing. Removing the 
> constraint should solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1921) Create proposal for which components should be pluggable

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1921:
--
Fix Version/s: 0.18.0

> Create proposal for which components should be pluggable
> 
>
> Key: AURORA-1921
> URL: https://issues.apache.org/jira/browse/AURORA-1921
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
> Fix For: 0.18.0
>
>
> Create a proposal on the dev list for how pluggable scheduling should work. 
> Is TaskAssigner the class we want to customize? Should preemption also factor 
> into the customization since it very closely tied to the Scheduling logic? 
> We also need to confirm the plugin mechanism. Most likely we should base thus 
> on the current auth support which is pluggable by passing in a Guice module 
> via cmdline. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1925) Easily copy files to/from an aurora task instance

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1925:
--
Fix Version/s: 0.18.0

> Easily copy files to/from an aurora task instance
> -
>
> Key: AURORA-1925
> URL: https://issues.apache.org/jira/browse/AURORA-1925
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Jordan Ly
>Assignee: Jordan Ly
>Priority: Minor
>  Labels: features, newbie
> Fix For: 0.18.0
>
>
> "So, we have "aurora task ssh" which is handy...We have the web ui, which can 
> allow us to download files from the instance.
> I'd love to have a straightforward, non-hack way to copy files to an aurora 
> shard. Ex. aurora task copy source relative-path-in-chroot or something that 
> could mount the chroot as read/write"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1915) Add automatic browser tab open feature for aurora update start

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1915:
--
Fix Version/s: 0.18.0

> Add automatic browser tab open feature for aurora update start
> --
>
> Key: AURORA-1915
> URL: https://issues.apache.org/jira/browse/AURORA-1915
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Usability
>Reporter: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
> Fix For: 0.18.0
>
>
> Aurora client automatically open a browser tab following {{aurora job 
> create}} and {{aurora cron schedule}} commands. Provide similar ability for 
> {{aurora update start}} command as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1907) Thermos unresponsive on hosts with many active task

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1907:
--
Fix Version/s: 0.18.0

> Thermos unresponsive on hosts with many active task
> ---
>
> Key: AURORA-1907
> URL: https://issues.apache.org/jira/browse/AURORA-1907
> Project: Aurora
>  Issue Type: Story
>  Components: Observer
>Reporter: Stephan Erb
>Assignee: Stephan Erb
> Fix For: 0.18.0
>
>
> We have noticed that on hosts with lots of active tasks (~100) and many 
> terminated tasks (~1500) the Thermos UI is not usable. Thermos spins at 300% 
> CPU but does not render any HTTP requests.
> Dumping {{/threads}} indicates we might be blocked by the hundret 
> {{TaskResourceMonitor}} threads trying to read values from {{/proc}}:
> {code}
> # Thread (daemon): TaskResourceMonitor (TaskResourceMonitor[mytask-id] 
> [TID=45241], 140682825963264)
>   File: "/usr/lib/python2.7/threading.py", line 525, in __bootstrap
> self.__bootstrap_inner()
>   File: "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
> self.run()
>   File: 
> "/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
>  line 115, in identified
> return instancemethod(self, *args, **kwargs)
>   File: 
> "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
> self.__real_run(*args, **kw)
>   File: "apache/thermos/monitoring/resource.py", line 204, in run
> collector.sample()
>   File: "apache/thermos/monitoring/process_collector_psutil.py", line 70, in 
> sample
> for child in parent.children(recursive=True)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 326, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 861, in children
> table[p.ppid()].append(p)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 545, in ppid
> return self._proc.ppid()
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 962, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1459, in ppid
> return int(self._parse_stat_file()[2])
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1001, in _parse_stat_file
> return [name] + fields_after_name
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1910) framework_registered metric isn't reset when scheduler disconnects

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1910:
--
Fix Version/s: 0.18.0

> framework_registered metric isn't reset when scheduler disconnects
> --
>
> Key: AURORA-1910
> URL: https://issues.apache.org/jira/browse/AURORA-1910
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> Right now the {{framework_registered}} metric transitions from 0 -> 1 when 
> the scheduler registers successfully the first time. It never transitions 
> from 1 -> 0 when it loses a connection.
> This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the 
> gauge as the scheduler loses registration and re-registers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1850) Raw StatusResult passed to the scheduler when tasks are healthy

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1850:
--
Fix Version/s: 0.18.0

> Raw StatusResult passed to the scheduler when tasks are healthy
> ---
>
> Key: AURORA-1850
> URL: https://issues.apache.org/jira/browse/AURORA-1850
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: 
> 9ab25047-4418-4121-906c-380ded9e1962__Screen_Shot_2016-12-06_at_4.56.43_PM.png
>
>
> As part of the recent health check changes, we now pass a message to the 
> scheduler along with the RUNNING transition when the task is healthy. 
> Unfortunately it looks like this message is a stringified `StatusResult`, 
> rather than the message from the `StatusResult` (see attached screenshot for 
> details).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1928) Aurora should prioritize adding instances over updating instances during an update

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1928:
--
Fix Version/s: 0.18.0

> Aurora should prioritize adding instances over updating instances during an 
> update
> --
>
> Key: AURORA-1928
> URL: https://issues.apache.org/jira/browse/AURORA-1928
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Jordan Ly
>Assignee: Jordan Ly
>Priority: Minor
> Fix For: 0.18.0
>
>
> Often times, when adding capacity to a service, I increase the number of 
> instances and the batch_size that the updates roll out at.
> For the first deploy, this updates the old instances with the more aggressive 
> batch size before the new instances are started.
> It might be better for Aurora to prioritize the "adding instances" step of 
> upgrades.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1933) Scheduler can process rescind before offer

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1933:
--
Fix Version/s: 0.18.0

> Scheduler can process rescind before offer
> --
>
> Key: AURORA-1933
> URL: https://issues.apache.org/jira/browse/AURORA-1933
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> I observed the following in production:
> {noformat}
> Jun  6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.510 
> [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer 
> rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> Jun  6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.903 
> [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received 
> offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> Jun  6 00:31:34 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:34.815 
> [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 
> 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> {noformat}
> Notice the rescind was processed before the offer was given. This means the 
> offer is in the offer storage, but using it is invalid. It will cause 
> whatever task launched with it to fail with {{Task launched with invalid 
> offers: Offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 is no longer 
> valid}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1916) Incompatibility with mesos 1.2

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1916:
--
Fix Version/s: 0.18.0

> Incompatibility with mesos 1.2
> --
>
> Key: AURORA-1916
> URL: https://issues.apache.org/jira/browse/AURORA-1916
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.17.0
> Environment: Ubuntu 16.04, Mesos 1.2
>Reporter: Kostiantyn Bokhan
> Fix For: 0.18.0
>
>
> The list of mesos-containerizer arguments has been changed since 1.2:
> {code}
> /usr/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --[no-]help  Prints this help message (default: false)
>   --launch_info=VALUE  
>   --namespace_mnt_target=VALUE The target 'pid' of the process whose 
> mount namespace we'd like
>to enter before executing the command.
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
> {code}
> Mesos 1.1.0:
> {code}
> /usr/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --capabilities=VALUE Capabilities the command can use.
>   --command=VALUE  The command to execute.
>   --environment=VALUE  The environment variables for the command.
>   --[no-]help  Prints this help message (default: false)
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pre_exec_commands=VALUEThe additional preparation commands to 
> execute before
>executing the command.
>   --rootfs=VALUE   Absolute path to the container root 
> filesystem. The command will be 
>interpreted relative to this path
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
>   --user=VALUE The user to change to.
>   --working_directory=VALUEThe working directory for the command. It 
> has to be an absolute path 
>w.r.t. the root filesystem used for the 
> command.
> {code}
> It causes the next error:
> {code}
> Failed to parse the flags: Failed to load unknown flag 'command'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1887) Create Driver implementation around V0Mesos.

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1887:
--
Fix Version/s: 0.18.0

> Create Driver implementation around V0Mesos.
> 
>
> Key: AURORA-1887
> URL: https://issues.apache.org/jira/browse/AURORA-1887
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> Create an implementation of the {{org.apache.aurora.scheduler.mesos.Driver}} 
> interface which uses the {{V0Mesos}} shim under the hood. Provide a flag to 
> switch between the two to show there is no regression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1929:
--
Fix Version/s: 0.18.0

> Improve explicit task history pruning.
> --
>
> Key: AURORA-1929
> URL: https://issues.apache.org/jira/browse/AURORA-1929
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
> Fix For: 0.18.0
>
>
> There are currently two types of task history pruning running by aurora:
> # The implicit task history pruning running by TaskHistoryPrunner in the 
> background, which registers all inactive tasks upon terminal state change for 
> pruning.
> # The explicit task history pruning initiated by `aurora_admin prune_tasks` 
> command, which prunes inactive tasks in the cluster.
> The prune_tasks endpoint seems to be very slow when the cluster has a large 
> number of inactive tasks. 
> For example, when we use $ aurora_admin prune_tasks for 135k running tasks 
> (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed 
> seems to max out at 3k tasks per minute.
> Currently, aurora uses StreamManager to manages a single log stream append 
> transaction for task history pruning. Local storage ops can be added to the 
> transaction and then later committed as an atomic unit. However, the 
> StateManager removes tasks one by one in a 
> for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
>  and each RemoveTasks operation is coalesced with its previous operation, 
> which seems inefficient and unnecessary 
> (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).
> We need to batch all removeTasks operations and execute them all at once to 
> avoid additional cost of coalescing. The fix will also benefit implicit task 
> history pruning since it has similar underlying implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1888) Create Driver implementation around V1Mesos

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1888:
--
Fix Version/s: 0.18.0

> Create Driver implementation around V1Mesos
> ---
>
> Key: AURORA-1888
> URL: https://issues.apache.org/jira/browse/AURORA-1888
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> Create an implementation of {{org.apache.aurora.scheduler.mesos.Driver}} that 
> uses {{V1Mesos}} under the hood.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1905) Set "webui_url" field of FrameworkInfo

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1905:
--
Fix Version/s: 0.18.0

> Set "webui_url" field of FrameworkInfo
> --
>
> Key: AURORA-1905
> URL: https://issues.apache.org/jira/browse/AURORA-1905
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> Aurora should set the {{webui_url}} field of FrameworkInfo so the Mesos UI 
> can link to the current leader.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (AURORA-1925) Easily copy files to/from an aurora task instance

2017-05-10 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1925:
-

Assignee: (was: Santhosh Kumar Shanmugham)

> Easily copy files to/from an aurora task instance
> -
>
> Key: AURORA-1925
> URL: https://issues.apache.org/jira/browse/AURORA-1925
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Jordan Ly
>Priority: Minor
>  Labels: features, newbie
>
> "So, we have "aurora task ssh" which is handy...We have the web ui, which can 
> allow us to download files from the instance.
> I'd love to have a straightforward, non-hack way to copy files to an aurora 
> shard. Ex. aurora task copy source relative-path-in-chroot or something that 
> could mount the chroot as read/write
> For example, I can use this to use agents or tools not present in the default 
> image, or are something I don't want to clutter packer with.
> For example, I tried to use byteman to narrow down what's spawning threads.
> Just getting the zip to aurora is more difficult and mysterious than writing 
> a script to trace thread lifecycle. This should be easy... agreed?"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (AURORA-1925) Easily copy files to/from an aurora task instance

2017-05-10 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1925:
-

Assignee: Santhosh Kumar Shanmugham

> Easily copy files to/from an aurora task instance
> -
>
> Key: AURORA-1925
> URL: https://issues.apache.org/jira/browse/AURORA-1925
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Jordan Ly
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
>  Labels: features, newbie
>
> "So, we have "aurora task ssh" which is handy...We have the web ui, which can 
> allow us to download files from the instance.
> I'd love to have a straightforward, non-hack way to copy files to an aurora 
> shard. Ex. aurora task copy source relative-path-in-chroot or something that 
> could mount the chroot as read/write
> For example, I can use this to use agents or tools not present in the default 
> image, or are something I don't want to clutter packer with.
> For example, I tried to use byteman to narrow down what's spawning threads.
> Just getting the zip to aurora is more difficult and mysterious than writing 
> a script to trace thread lifecycle. This should be easy... agreed?"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (AURORA-1913) Aurora Client Job Update Diff output is erroneous

2017-03-29 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1913.
---
   Resolution: Fixed
Fix Version/s: 0.18.0

> Aurora Client Job Update Diff output is erroneous
> -
>
> Key: AURORA-1913
> URL: https://issues.apache.org/jira/browse/AURORA-1913
> Project: Aurora
>  Issue Type: Bug
>  Components: Client
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
> Fix For: 0.18.0
>
>
> The {{TaskConfig}} diff shown by Aurora Client is not sorting the individual 
> entities (which are sets) which results in a confusing output. We need to 
> sort the individual fields inside the config to return a more meaningful 
> output.
> Although the {{host}} and {{rack}} limit constraints have not changed, the 
> diff still outputs the below,
> {code}
> <   'constraints': set([ Constraint(name='rack', 
> constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> ---
> >   'constraints': set([ Constraint(name='host', 
> > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> 4,5c4,5
>  constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None))]),
> ---
> >Constraint(name='rack', 
> > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (AURORA-1913) Aurora Client Job Update Diff output is erroneous

2017-03-29 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1913:
-

Assignee: Santhosh Kumar Shanmugham

> Aurora Client Job Update Diff output is erroneous
> -
>
> Key: AURORA-1913
> URL: https://issues.apache.org/jira/browse/AURORA-1913
> Project: Aurora
>  Issue Type: Bug
>  Components: Client
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
>
> The {{TaskConfig}} diff shown by Aurora Client is not sorting the individual 
> entities (which are sets) which results in a confusing output. We need to 
> sort the individual fields inside the config to return a more meaningful 
> output.
> Although the {{host}} and {{rack}} limit constraints have not changed, the 
> diff still outputs the below,
> {code}
> <   'constraints': set([ Constraint(name='rack', 
> constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> ---
> >   'constraints': set([ Constraint(name='host', 
> > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> 4,5c4,5
>  constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None))]),
> ---
> >Constraint(name='rack', 
> > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (AURORA-1913) Aurora Client Job Update Diff output is erroneous

2017-03-29 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1913:
-

 Summary: Aurora Client Job Update Diff output is erroneous
 Key: AURORA-1913
 URL: https://issues.apache.org/jira/browse/AURORA-1913
 Project: Aurora
  Issue Type: Bug
  Components: Client
Reporter: Santhosh Kumar Shanmugham
Priority: Minor


The {{TaskConfig}} diff shown by Aurora Client is not sorting the individual 
entities (which are sets) which results in a confusing output. We need to sort 
the individual fields inside the config to return a more meaningful output.

Although the {{host}} and {{rack}} limit constraints have not changed, the diff 
still outputs the below,
{code}
<   'constraints': set([ Constraint(name='rack', 
constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
---
>   'constraints': set([ Constraint(name='host', 
> constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
4,5c4,5
Constraint(name='rack', 
> constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1908:
--
Description: When matching a {{ResourceRequest}} against a 
{{UnusedResource}} in {{PremeptionVictimFilter.filterPremeptionVictions}} there 
are 4 kinds of {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply 
to the entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, 
{{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can 
short-circuit and return early and move on to the next host to consider.  (was: 
When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
{{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
{{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} or 
{{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return early 
and move on to the next host to consider.)

> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{PremeptionVictimFilter.filterPremeptionVictions}} there are 4 kinds of 
> {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply to the 
> entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, 
> {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can 
> short-circuit and return early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1908:
--
Description: When matching a {{ResourceRequest}} against a 
{{UnusedResource}} in {{PremeptionVictimFilter.filterPremeptionVictions}} there 
are 4 kinds of {{Veto}} es that can be returned. 3 out of the 4 {{Veto}} es 
apply to the entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, 
{{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this 
case we can short-circuit and return early and move on to the next host to 
consider.  (was: When matching a {{ResourceRequest}} against a 
{{UnusedResource}} in {{PremeptionVictimFilter.filterPremeptionVictions}} there 
are 4 kinds of {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply 
to the entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, 
{{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can 
short-circuit and return early and move on to the next host to consider.)

> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{PremeptionVictimFilter.filterPremeptionVictions}} there are 4 kinds of 
> {{Veto}} es that can be returned. 3 out of the 4 {{Veto}} es apply to the 
> entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, 
> {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can 
> short-circuit and return early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935435#comment-15935435
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1908:
---

Plus we can improve the way resource increment happens now to explicitly check 
that either there was no Veto or the Veto was due to insufficient resources.

> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
> returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
> {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} 
> or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return 
> early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935431#comment-15935431
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1908:
---

This code here - 
https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilter.java#L214-L227

> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
> returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
> {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} 
> or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return 
> early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1908:
-

 Summary: Short-circuit preemption filtering when a Veto applies to 
entire host
 Key: AURORA-1908
 URL: https://issues.apache.org/jira/browse/AURORA-1908
 Project: Aurora
  Issue Type: Task
Reporter: Santhosh Kumar Shanmugham
Priority: Minor


When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
{{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
{{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} or 
{{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return early 
and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (AURORA-1882) Add support for Mesos ContainerLaunchInfo

2017-03-13 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1882.
---
   Resolution: Fixed
 Assignee: Santhosh Kumar Shanmugham
Fix Version/s: 0.18.0

> Add support for Mesos ContainerLaunchInfo
> -
>
> Key: AURORA-1882
> URL: https://issues.apache.org/jira/browse/AURORA-1882
> Project: Aurora
>  Issue Type: Task
>  Components: Executor, Thermos
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Critical
> Fix For: 0.18.0
>
>
> Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops 
> support for multiple switches. Without support for this Thermos Executor will 
> not be able to launch tasks successfully.
> {noformat}
> /usr/local/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --[no-]help  Prints this help message (default: false)
>   --launch_info=VALUE
>   --namespace_mnt_target=VALUE The target 'pid' of the process whose 
> mount namespace we'd like
>to enter before executing the command.
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
> {noformat}
> https://issues.apache.org/jira/browse/MESOS-6648



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Reopened] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2017-02-14 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reopened AURORA-1824:
---

> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> {noformat}
> will be translated to,
> {noformat}
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2017-02-14 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1824.
---
Resolution: Fixed

> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> {noformat}
> will be translated to,
> {noformat}
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2017-02-14 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1824:
-

Assignee: Santhosh Kumar Shanmugham

> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> {noformat}
> will be translated to,
> {noformat}
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (AURORA-1892) TaskQuery `limit` and `offset` must be applied at TaskStore

2017-02-13 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1892:
-

 Summary: TaskQuery `limit` and `offset` must be applied at 
TaskStore
 Key: AURORA-1892
 URL: https://issues.apache.org/jira/browse/AURORA-1892
 Project: Aurora
  Issue Type: Task
Reporter: Santhosh Kumar Shanmugham


{{TaksQuery}}'s {{limit}} and {{offset}} are currently applied after the 
results have been fetched from the {{TaskStore}}, which is inefficient. Make 
the {{TaskStore}} apply the {{limit}} and {{offset}} conditions at the 
{{TaskStore}} level in both {{MemTaskStore}} and {{DBTaskStore}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:01 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will be release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:00 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will be release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ chang

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 10:58 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be 
characterized as house-keeping work that is not in the critical scheduling 
path, it would make sense to rate-limit these ambient activities, so that the 
scheduler is protected from bursts of non-critical work (like - job updates 
with large of instances, network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInact

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 10:59 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{Tas

[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1837:
---

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be 
characterized as house-keeping work that is not in the critical scheduling 
path, it would make sense to rate-limit these ambient activities, so that the 
scheduler is protected from bursts of non-critical work (like - job updates 
with large of instances, network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1882) Add support for Mesos ContainerLaunchInfo

2017-01-25 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837484#comment-15837484
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1882:
---

https://reviews.apache.org/r/55916/

> Add support for Mesos ContainerLaunchInfo
> -
>
> Key: AURORA-1882
> URL: https://issues.apache.org/jira/browse/AURORA-1882
> Project: Aurora
>  Issue Type: Task
>  Components: Executor, Thermos
>Reporter: Santhosh Kumar Shanmugham
>Priority: Critical
>
> Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops 
> support for multiple switches. Without support for this Thermos Executor will 
> not be able to launch tasks successfully.
> {noformat}
> /usr/local/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --[no-]help  Prints this help message (default: false)
>   --launch_info=VALUE
>   --namespace_mnt_target=VALUE The target 'pid' of the process whose 
> mount namespace we'd like
>to enter before executing the command.
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
> {noformat}
> https://issues.apache.org/jira/browse/MESOS-6648



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1882) Add support for Mesos ContainerLaunchInfo

2017-01-24 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1882:
--
Description: 
Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops 
support for multiple switches. Without support for this Thermos Executor will 
not be able to launch tasks successfully.

{noformat}
/usr/local/libexec/mesos/mesos-containerizer launch --help
Usage: launch [options]

  --[no-]help  Prints this help message (default: false)
  --launch_info=VALUE
  --namespace_mnt_target=VALUE The target 'pid' of the process whose mount 
namespace we'd like
   to enter before executing the command.
  --pipe_read=VALUEThe read end of the control pipe. This is a 
file descriptor
   on Posix, or a handle on Windows. It's 
caller's responsibility
   to make sure the file descriptor or the 
handle is inherited
   properly in the subprocess. It's used to 
synchronize with the
   parent process. If not specified, no 
synchronization will happen.
  --pipe_write=VALUE   The write end of the control pipe. This is a 
file descriptor
   on Posix, or a handle on Windows. It's 
caller's responsibility
   to make sure the file descriptor or the 
handle is inherited
   properly in the subprocess. It's used to 
synchronize with the
   parent process. If not specified, no 
synchronization will happen.
  --runtime_directory=VALUEThe runtime directory for the container 
(used for checkpointing)
  --[no-]unshare_namespace_mnt Whether to launch the command in a new mount 
namespace. (default: false)
{noformat}

https://issues.apache.org/jira/browse/MESOS-6648

  was:
Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops 
support for multiple switches. Without support for this Thermos Executor will 
not be able to launch tasks successfully.

https://issues.apache.org/jira/browse/MESOS-6648


> Add support for Mesos ContainerLaunchInfo
> -
>
> Key: AURORA-1882
> URL: https://issues.apache.org/jira/browse/AURORA-1882
> Project: Aurora
>  Issue Type: Task
>  Components: Executor, Thermos
>Reporter: Santhosh Kumar Shanmugham
>Priority: Critical
>
> Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops 
> support for multiple switches. Without support for this Thermos Executor will 
> not be able to launch tasks successfully.
> {noformat}
> /usr/local/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --[no-]help  Prints this help message (default: false)
>   --launch_info=VALUE
>   --namespace_mnt_target=VALUE The target 'pid' of the process whose 
> mount namespace we'd like
>to enter before executing the command.
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
> {noformat}
> https://issues.apache.org/jira/browse/MESOS-6648



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (AURORA-1882) Add support for Mesos ContainerLaunchInfo

2017-01-24 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1882:
-

 Summary: Add support for Mesos ContainerLaunchInfo
 Key: AURORA-1882
 URL: https://issues.apache.org/jira/browse/AURORA-1882
 Project: Aurora
  Issue Type: Task
  Components: Executor, Thermos
Reporter: Santhosh Kumar Shanmugham
Priority: Critical


Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops 
support for multiple switches. Without support for this Thermos Executor will 
not be able to launch tasks successfully.

https://issues.apache.org/jira/browse/MESOS-6648



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-18 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829186#comment-15829186
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1879:
---

I think the issue is when we have multiple TaskGroups that have same key 
(3-tuple) but different TaskConfigs, which can happen when there are JobUpdates.

https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/metadata/NearestFit.java#L133-L138

> /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending 
> tasks with the same key
> ---
>
> Key: AURORA-1879
> URL: https://issues.apache.org/jira/browse/AURORA-1879
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Kai Huang
> Attachments: pending_tasks.png
>
>
> When we have multiple TaskGroups that have same key but different 
> TaskConfigs, the /pendingTasks endpoint will gives a 500 HTTP Error.
> This bug seems to be related to a recent commit (Added the 'reason' to the 
> /pendingTasks 
> endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1).
>  
> Attached were a screenshot of the /pendingTasks endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1850) Raw StatusResult passed to the scheduler when tasks are healthy

2016-12-09 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734721#comment-15734721
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1850:
---

https://reviews.apache.org/r/54299/

> Raw StatusResult passed to the scheduler when tasks are healthy
> ---
>
> Key: AURORA-1850
> URL: https://issues.apache.org/jira/browse/AURORA-1850
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
> Attachments: 
> 9ab25047-4418-4121-906c-380ded9e1962__Screen_Shot_2016-12-06_at_4.56.43_PM.png
>
>
> As part of the recent health check changes, we now pass a message to the 
> scheduler along with the RUNNING transition when the task is healthy. 
> Unfortunately it looks like this message is a stringified `StatusResult`, 
> rather than the message from the `StatusResult` (see attached screenshot for 
> details).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (AURORA-1850) Raw StatusResult passed to the scheduler when tasks are healthy

2016-12-07 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1850:
-

Assignee: Santhosh Kumar Shanmugham

> Raw StatusResult passed to the scheduler when tasks are healthy
> ---
>
> Key: AURORA-1850
> URL: https://issues.apache.org/jira/browse/AURORA-1850
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
> Attachments: 
> 9ab25047-4418-4121-906c-380ded9e1962__Screen_Shot_2016-12-06_at_4.56.43_PM.png
>
>
> As part of the recent health check changes, we now pass a message to the 
> scheduler along with the RUNNING transition when the task is healthy. 
> Unfortunately it looks like this message is a stringified `StatusResult`, 
> rather than the message from the `StatusResult` (see attached screenshot for 
> details).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1841) Update HealthChecks has a backward incompatibility

2016-12-06 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15727190#comment-15727190
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1841:
---

https://reviews.apache.org/r/54299/

> Update HealthChecks has a backward incompatibility
> --
>
> Key: AURORA-1841
> URL: https://issues.apache.org/jira/browse/AURORA-1841
> Project: Aurora
>  Issue Type: Bug
>  Components: Thermos
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> The current implementation of the HealthCheck based updates has a backward 
> incompatibility issue when the HealthCheckConfig is used in such a way that 
> the initial grace period is extended beyond the `initial_interval_secs`.
> This bug prematurely fails updates by using the new fail-fast feature when 
> the `initial_interval_secs` does not count for the entire task warmup time. 
> In the older version the actual grace period for update to succeed was 
> `initial_interval_secs + interval_secs * max_consecutive_failures`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1844) Force a snapshot at the end of Scheduler startup.

2016-12-02 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1844:
--
Summary: Force a snapshot at the end of Scheduler startup.  (was: Force a 
snapshot at the end of startup.)

> Force a snapshot at the end of Scheduler startup.
> -
>
> Key: AURORA-1844
> URL: https://issues.apache.org/jira/browse/AURORA-1844
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When the scheduler starts up, it replays the logs from the replicated log to 
> catch up with the current state, before announcing itself as the leader to 
> the outside world. If for any reason after this replay, the scheduler dies 
> after adding more log entires, the next startup will have to redo the work 
> again. This becomes problem when the amount of additional work added is not 
> trivial, and can take the scheduler down the path of a spiraling death. One 
> example, of this is when the TaskHistoryPruner cleans up the DB but adds to 
> the log entires. In order to avoid the repeated work, the scheduler should 
> force a snapshot after the initial replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (AURORA-1844) Force a snapshot at the end of startup.

2016-12-02 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1844:
-

 Summary: Force a snapshot at the end of startup.
 Key: AURORA-1844
 URL: https://issues.apache.org/jira/browse/AURORA-1844
 Project: Aurora
  Issue Type: Task
Reporter: Santhosh Kumar Shanmugham
Priority: Minor


When the scheduler starts up, it replays the logs from the replicated log to 
catch up with the current state, before announcing itself as the leader to the 
outside world. If for any reason after this replay, the scheduler dies after 
adding more log entires, the next startup will have to redo the work again. 
This becomes problem when the amount of additional work added is not trivial, 
and can take the scheduler down the path of a spiraling death. One example, of 
this is when the TaskHistoryPruner cleans up the DB but adds to the log 
entires. In order to avoid the repeated work, the scheduler should force a 
snapshot after the initial replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1841) Update HealthChecks has a backward incompatibility

2016-12-01 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1841:
--
Component/s: Thermos

> Update HealthChecks has a backward incompatibility
> --
>
> Key: AURORA-1841
> URL: https://issues.apache.org/jira/browse/AURORA-1841
> Project: Aurora
>  Issue Type: Bug
>  Components: Thermos
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> The current implementation of the HealthCheck based updates has a backward 
> incompatibility issue when the HealthCheckConfig is used in such a way that 
> the initial grace period is extended beyond the `initial_interval_secs`.
> This bug prematurely fails updates by using the new fail-fast feature when 
> the `initial_interval_secs` does not count for the entire task warmup time. 
> In the older version the actual grace period for update to succeed was 
> `initial_interval_secs + interval_secs * max_consecutive_failures`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (AURORA-1841) Update HealthChecks has a backward incompatibility

2016-12-01 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1841:
-

Assignee: Santhosh Kumar Shanmugham

> Update HealthChecks has a backward incompatibility
> --
>
> Key: AURORA-1841
> URL: https://issues.apache.org/jira/browse/AURORA-1841
> Project: Aurora
>  Issue Type: Bug
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> The current implementation of the HealthCheck based updates has a backward 
> incompatibility issue when the HealthCheckConfig is used in such a way that 
> the initial grace period is extended beyond the `initial_interval_secs`.
> This bug prematurely fails updates by using the new fail-fast feature when 
> the `initial_interval_secs` does not count for the entire task warmup time. 
> In the older version the actual grace period for update to succeed was 
> `initial_interval_secs + interval_secs * max_consecutive_failures`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (AURORA-1841) Update HealthChecks has a backward incompatibility

2016-12-01 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1841:
-

 Summary: Update HealthChecks has a backward incompatibility
 Key: AURORA-1841
 URL: https://issues.apache.org/jira/browse/AURORA-1841
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


The current implementation of the HealthCheck based updates has a backward 
incompatibility issue when the HealthCheckConfig is used in such a way that the 
initial grace period is extended beyond the `initial_interval_secs`.

This bug prematurely fails updates by using the new fail-fast feature when the 
`initial_interval_secs` does not count for the entire task warmup time. 

In the older version the actual grace period for update to succeed was 
`initial_interval_secs + interval_secs * max_consecutive_failures`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1824:
--
Description: 
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

DSL Translation:

{noformat}
Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}

will be translated to,

Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}
{noformat}


  was:
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

MesosContainerizer:

Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}

will be translated to,

Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}



> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> will be translated to,
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1824:
--
Description: 
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

DSL Translation:

{noformat}
Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}
{noformat}

will be translated to,

{noformat}
Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}
{noformat}


  was:
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

DSL Translation:

{noformat}
Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}

will be translated to,

Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}
{noformat}



> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> {noformat}
> will be translated to,
> {noformat}
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1824:
--
Description: 
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

MesosContainerizer:

Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}

will be translated to,

Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}


  was:
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.



> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> MesosContainerizer:
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> will be translated to,
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684470#comment-15684470
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1014:
---

Created https://issues.apache.org/jira/browse/AURORA-1014 to create a similar 
binding-helper for MesosContainerizer.

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
> Fix For: 0.17.0
>
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1014.
---
Resolution: Fixed

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
> Fix For: 0.17.0
>
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)

Santhosh Kumar Shanmugham created AURORA-1824:
-

 Summary: Create a binding-helper to resolve docker tags to 
concrete image digests for MesosContainerizer
 Key: AURORA-1824
 URL: https://issues.apache.org/jira/browse/AURORA-1824
 Project: Aurora
  Issue Type: Story
Reporter: Santhosh Kumar Shanmugham


Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1225.
---
Resolution: Fixed

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684367#comment-15684367
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1225:
---

Tested this in our internal test cluster.

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-11-01 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15624622#comment-15624622
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1225:
---

Revised proposal - 
https://docs.google.com/document/d/1KOO0LC046k75TqQqJ4c0FQcVGbxvrn71E10wAjMorVY/edit?usp=sharing

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-10-31 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1225:
-

Assignee: Santhosh Kumar Shanmugham

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-09-28 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15531456#comment-15531456
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1014:
---

Took a shot at this but unfortunately it is blocked by lack of support from 
Mesos. See - https://issues.apache.org/jira/browse/MESOS-3505

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1688) Change framework_name default value from 'TwitterScheduler' to 'aurora'

2016-09-13 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488902#comment-15488902
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1688:
---

https://reviews.apache.org/r/51874

> Change framework_name default value from 'TwitterScheduler' to 'aurora'
> ---
>
> Key: AURORA-1688
> URL: https://issues.apache.org/jira/browse/AURORA-1688
> Project: Aurora
>  Issue Type: Sub-task
>Reporter: Stephan Erb
>Assignee: Santhosh Kumar Shanmugham
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (AURORA-1688) Change framework_name default value from 'TwitterScheduler' to 'aurora'

2016-09-13 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1688:
-

Assignee: Santhosh Kumar Shanmugham

> Change framework_name default value from 'TwitterScheduler' to 'aurora'
> ---
>
> Key: AURORA-1688
> URL: https://issues.apache.org/jira/browse/AURORA-1688
> Project: Aurora
>  Issue Type: Sub-task
>Reporter: Stephan Erb
>Assignee: Santhosh Kumar Shanmugham
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-15 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421854#comment-15421854
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1711:
---

+1 on extending {{StartJobUpdateResult}}

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-15 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419776#comment-15419776
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1711 at 8/15/16 4:53 PM:


Incorporating [~maximk] and [~davmclau]'s suggestions to put forward a proposal.

Client will send the UUID as part of the metadata field in {{JobUpdateRequest}} 
like,
{code}
JobUpdateRequest {
  TaskConfig {
 JobKey : “role/env/name”
   Metadata : {
 “Disambiguator” : “FGFJAGGFIAHKHGFAK” #UUID
  }
   }
}
{code}
Scheduler returns the active updates for the {{JobKey}} along with the 
{{Disambiguator}} to the Client, who can reconcile the requests.

*Pros:*
- Logic is centralized and can benefit all the clients
- No costly diffs are needed
- No explicit API change

*Cons:*
- Scheduler and Client will need changes
- Possibility of identifier collision
- Additional query to the store

https://docs.google.com/a/twitter.com/document/d/1Ih-WXACZiPB0Z8EAQw_8cnAaf4eHMsUGG3B3OKia0k4/edit?usp=sharing


was (Author: santhk):
Incorporating [~maximk] and [~davmclau]'s suggestions to put forward a proposal.

Client will send the UUID as part of the metadata field in {{JobUpdateRequest}} 
like,
{code}
JobUpdateRequest {
  TaskConfig {
 JobKey : “role/env/name”
   Metadata : {
 “Disambiguator” : “FGFJAGGFIAHKHGFAK” #UUID
  }
   }
}
{code}
Scheduler looks up the active updates for the {{JobKey}} and compare the 
{{Disambiguator}} to make sure the request is not a duplicate.

*Pros:*
- Logic is centralized and can benefit all the clients
- No costly diffs are needed
- No explicit API change

*Cons:*
- Scheduler and Client will need changes
- Possibility of identifier collision
- Additional query to the store

https://docs.google.com/a/twitter.com/document/d/1Ih-WXACZiPB0Z8EAQw_8cnAaf4eHMsUGG3B3OKia0k4/edit?usp=sharing

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-12 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419776#comment-15419776
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1711 at 8/13/16 1:46 AM:


Incorporating [~maximk] and [~davmclau]'s suggestions to put forward a proposal.

Client will send the UUID as part of the metadata field in {{JobUpdateRequest}} 
like,
{code}
JobUpdateRequest {
  TaskConfig {
 JobKey : “role/env/name”
   Metadata : {
 “Disambiguator” : “FGFJAGGFIAHKHGFAK” #UUID
  }
   }
}
{code}
Scheduler looks up the active updates for the {{JobKey}} and compare the 
{{Disambiguator}} to make sure the request is not a duplicate.

*Pros:*
- Logic is centralized and can benefit all the clients
- No costly diffs are needed
- No explicit API change

*Cons:*
- Scheduler and Client will need changes
- Possibility of identifier collision
- Additional query to the store

https://docs.google.com/a/twitter.com/document/d/1Ih-WXACZiPB0Z8EAQw_8cnAaf4eHMsUGG3B3OKia0k4/edit?usp=sharing


was (Author: santhk):
Incorporating [~maximk] and [~davmclau]'s suggestions to put forward a proposal.

Client will send the UUID as part of the metadata field in JobUpdateRequest 
like,
{code}
JobUpdateRequest {
  TaskConfig {
 JobKey : “role/env/name”
   Metadata : {
 “Disambiguator” : “FGFJAGGFIAHKHGFAK” #UUID
  }
   }
}
{code}
Scheduler looks up the active updates for the {{JobKey}} and compare the 
{{Disambiguator}} to make sure the request is not a duplicate.

*Pros:*
- Logic is centralized and can benefit all the clients
- No costly diffs are needed
- No explicit API change

*Cons:*
- Scheduler and Client will need changes
- Possibility of identifier collision
- Additional query to the store

https://docs.google.com/a/twitter.com/document/d/1Ih-WXACZiPB0Z8EAQw_8cnAaf4eHMsUGG3B3OKia0k4/edit?usp=sharing

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-12 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419776#comment-15419776
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1711:
---

Incorporating [~maximk] and [~davmclau]'s suggestions to put forward a proposal.

Client will send the UUID as part of the metadata field in JobUpdateRequest 
like,
{code}
JobUpdateRequest {
  TaskConfig {
 JobKey : “role/env/name”
   Metadata : {
 “Disambiguator” : “FGFJAGGFIAHKHGFAK” #UUID
  }
   }
}
{code}
Scheduler looks up the active updates for the {{JobKey}} and compare the 
{{Disambiguator}} to make sure the request is not a duplicate.

*Pros:*
- Logic is centralized and can benefit all the clients
- No costly diffs are needed
- No explicit API change

*Cons:*
- Scheduler and Client will need changes
- Possibility of identifier collision
- Additional query to the store

https://docs.google.com/a/twitter.com/document/d/1Ih-WXACZiPB0Z8EAQw_8cnAaf4eHMsUGG3B3OKia0k4/edit?usp=sharing

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

100 matches

Mail list logo