[jira] [Commented] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down

2018-06-15 Thread Santhosh Kumar Shanmugham (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514181#comment-16514181
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1990:
---

https://reviews.apache.org/r/67613/

> SlaManager's AsyncHttpClient can keep scheduler from shutting down
> --
>
> Key: AURORA-1990
> URL: https://issues.apache.org/jira/browse/AURORA-1990
> Project: Aurora
>  Issue Type: Bug
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> We observed a situation where the scheduler was unable to fully shutdown 
> after a single non-daemon thread from the AsyncHttpClient's thread pool 
> stayed alive.
> We should convert the SlaManager to be an AbstractIdleService to ensure that 
> the thread pool is closed on scheduler shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down

2018-06-15 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1990:
-

Assignee: Santhosh Kumar Shanmugham

> SlaManager's AsyncHttpClient can keep scheduler from shutting down
> --
>
> Key: AURORA-1990
> URL: https://issues.apache.org/jira/browse/AURORA-1990
> Project: Aurora
>  Issue Type: Bug
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> We observed a situation where the scheduler was unable to fully shutdown 
> after a single non-daemon thread from the AsyncHttpClient's thread pool 
> stayed alive.
> We should convert the SlaManager to be an AbstractIdleService to ensure that 
> the thread pool is closed on scheduler shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down

2018-06-15 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1990:
-

 Summary: SlaManager's AsyncHttpClient can keep scheduler from 
shutting down
 Key: AURORA-1990
 URL: https://issues.apache.org/jira/browse/AURORA-1990
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


We observed a situation where the scheduler was unable to fully shutdown after 
a single non-daemon thread from the AsyncHttpClient's thread pool stayed alive.

We should convert the SlaManager to be an AbstractIdleService to ensure that 
the thread pool is closed on scheduler shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AURORA-1979) Introduce new SLA-aware maintenance APIs

2018-06-05 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1979.
---
Resolution: Implemented

https://reviews.apache.org/r/66716/

> Introduce new SLA-aware maintenance APIs
> 
>
> Key: AURORA-1979
> URL: https://issues.apache.org/jira/browse/AURORA-1979
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Introduce the new SLA-aware thrift endpoints on the Scheduler. More 
> information in the proposal document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AURORA-1976) SLA-based Maintenance in Scheduler

2018-06-05 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1976.
---
Resolution: Implemented

> SLA-based Maintenance in Scheduler
> --
>
> Key: AURORA-1976
> URL: https://issues.apache.org/jira/browse/AURORA-1976
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Proposal to create SLA-aware maintenance support in the Scheduler.
> https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AURORA-1978) Implement SLA computation utilities

2018-06-05 Thread Santhosh Kumar Shanmugham (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1978.
---
Resolution: Implemented

https://reviews.apache.org/r/66716/

> Implement SLA computation utilities
> ---
>
> Key: AURORA-1978
> URL: https://issues.apache.org/jira/browse/AURORA-1978
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Create utilities and helper methods required for computing SLA in the 
> Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AURORA-1986) Deprecate client-driven host maintenance.

2018-05-25 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1986:
-

 Summary: Deprecate client-driven host maintenance.
 Key: AURORA-1986
 URL: https://issues.apache.org/jira/browse/AURORA-1986
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


After AURORA-1978 the client-driven host maintenance is no longer required.

`sla_host_drain` is the newly introduced admin command. Merge this under the 
`host_drain` command and remove `sla_host_drain`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AURORA-1985) Offers are not updated after host_activate immediately

2018-05-25 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1985:
-

 Summary: Offers are not updated after host_activate immediately
 Key: AURORA-1985
 URL: https://issues.apache.org/jira/browse/AURORA-1985
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


After activating a host the {{HostAttributes}} on {{HostOffers}} does not 
consider the offer for scheduling tasks. It takes until the next offer-cycle to 
update the offer and use it for scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-05-21 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1977.
---
Resolution: Implemented

> Introduce thrift changes for describing SLAPolicies
> ---
>
> Key: AURORA-1977
> URL: https://issues.apache.org/jira/browse/AURORA-1977
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AURORA-1979) Introduce new SLA-aware maintenance APIs

2018-05-15 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1979:
-

Assignee: Santhosh Kumar Shanmugham

> Introduce new SLA-aware maintenance APIs
> 
>
> Key: AURORA-1979
> URL: https://issues.apache.org/jira/browse/AURORA-1979
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Introduce the new SLA-aware thrift endpoints on the Scheduler. More 
> information in the proposal document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-05-15 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476463#comment-16476463
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1977:
---

[~jingc] I have been working on this as part of 
[https://reviews.apache.org/r/66716.] I forgot to assign this to myself. 
Apologies.

> Introduce thrift changes for describing SLAPolicies
> ---
>
> Key: AURORA-1977
> URL: https://issues.apache.org/jira/browse/AURORA-1977
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Jing Chen
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-05-15 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1977:
-

Assignee: Santhosh Kumar Shanmugham  (was: Jing Chen)

> Introduce thrift changes for describing SLAPolicies
> ---
>
> Key: AURORA-1977
> URL: https://issues.apache.org/jira/browse/AURORA-1977
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AURORA-1978) Implement SLA computation utilities

2018-05-15 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1978:
-

Assignee: Santhosh Kumar Shanmugham

> Implement SLA computation utilities
> ---
>
> Key: AURORA-1978
> URL: https://issues.apache.org/jira/browse/AURORA-1978
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Create utilities and helper methods required for computing SLA in the 
> Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AURORA-1979) Introduce new SLA-aware maintenance APIs

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1979:
-

 Summary: Introduce new SLA-aware maintenance APIs
 Key: AURORA-1979
 URL: https://issues.apache.org/jira/browse/AURORA-1979
 Project: Aurora
  Issue Type: Sub-task
  Components: Scheduler
Reporter: Santhosh Kumar Shanmugham


Introduce the new SLA-aware thrift endpoints on the Scheduler. More information 
in the proposal document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AURORA-1978) Implement SLA computation utilities

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1978:
-

 Summary: Implement SLA computation utilities
 Key: AURORA-1978
 URL: https://issues.apache.org/jira/browse/AURORA-1978
 Project: Aurora
  Issue Type: Sub-task
  Components: Scheduler
Reporter: Santhosh Kumar Shanmugham


Create utilities and helper methods required for computing SLA in the Scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1977:
--
Component/s: Scheduler

> Introduce thrift changes for describing SLAPolicies
> ---
>
> Key: AURORA-1977
> URL: https://issues.apache.org/jira/browse/AURORA-1977
> Project: Aurora
>  Issue Type: Sub-task
>  Components: Scheduler
>Reporter: Santhosh Kumar Shanmugham
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AURORA-1977) Introduce thrift changes for describing SLAPolicies

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1977:
-

 Summary: Introduce thrift changes for describing SLAPolicies
 Key: AURORA-1977
 URL: https://issues.apache.org/jira/browse/AURORA-1977
 Project: Aurora
  Issue Type: Sub-task
Reporter: Santhosh Kumar Shanmugham






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AURORA-1976) SLA-based Maintenance in Scheduler

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1976:
-

Assignee: Santhosh Kumar Shanmugham

> SLA-based Maintenance in Scheduler
> --
>
> Key: AURORA-1976
> URL: https://issues.apache.org/jira/browse/AURORA-1976
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Major
>
> Proposal to create SLA-aware maintenance support in the Scheduler.
> https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AURORA-1976) SLA-based Maintenance in Scheduler

2018-03-08 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1976:
-

 Summary: SLA-based Maintenance in Scheduler
 Key: AURORA-1976
 URL: https://issues.apache.org/jira/browse/AURORA-1976
 Project: Aurora
  Issue Type: Story
Reporter: Santhosh Kumar Shanmugham


Proposal to create SLA-aware maintenance support in the Scheduler.

https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AURORA-1965) LaunchException needs transient task timeout for reschedule

2018-01-24 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1965:
-

 Summary: LaunchException needs transient task timeout for 
reschedule
 Key: AURORA-1965
 URL: https://issues.apache.org/jira/browse/AURORA-1965
 Project: Aurora
  Issue Type: Bug
Reporter: Santhosh Kumar Shanmugham


In {{TaskAssignerImpl}} a {{LaunchException}} either due to race-condition for 
grabbing {{Offer}} or Scheduler Driver {{acceptOffers}} failure will result in 
the task being help up in {{ASSIGNED}} state, until transient task timeout 
kicks in to move the task to LOST, resulting in a reschedule.

[https://github.com/apache/aurora/blob/dbe71374399d86d77baa35efbeca4fffa34f380e/src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java#L135]

The impact of this is minimal due to fact that multiple conditions need to 
align for this situation to happen.

This happens since the call to {{changeState}} fails since the supplied current 
state (PENDING) in the method call is different from the actual state of the 
task (ASSIGNED).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AURORA-1946) Make STARTING a transient state

2017-09-19 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172383#comment-16172383
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1946:
---

The underlying issue in the above case was that the Task's status updates to 
{{RUNNING}} and then to {{FAILED}} were never communicated to the Master and 
eventually to the Scheduler. So the issue lies with the Agent.

> Make STARTING a transient state
> ---
>
> Key: AURORA-1946
> URL: https://issues.apache.org/jira/browse/AURORA-1946
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> We saw a case where an update was stuck in {{IN_PROGRESS}} state, after a 
> task's status update from {{STARTING}} to {{FAILED}} was lost. In the ideal 
> scenario the {{Task}} should have been transitioned into {{LOST}} due to a 
> transient state. But {{STARTING}} is not a transient state.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (AURORA-1912) DbSnapShot may remove enum values

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1912:
--
Fix Version/s: 0.18.0

> DbSnapShot may remove enum values
> -
>
> Key: AURORA-1912
> URL: https://issues.apache.org/jira/browse/AURORA-1912
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> The dbnsapshot restore may truncate enum tables and cause referential 
> integrity issues. From the code it restores from the SQL dump by first 
> dropping all tables:
> {noformat}
> try (Connection c = ((DataSource) 
> store.getUnsafeStoreAccess()).getConnection()) {
>   LOG.info("Dropping all tables");
>   try (PreparedStatement drop = c.prepareStatement("DROP ALL 
> OBJECTS")) {
> drop.executeUpdate();
>   }
> {noformat}
> However a freshly started leader will have some data in there from preparing 
> the storage:
> {noformat}
>   @Override
>   @Transactional
>   protected void startUp() throws IOException {
> Configuration configuration = sessionFactory.getConfiguration();
> String createStatementName = "create_tables";
> configuration.setMapUnderscoreToCamelCase(true);
> // The ReuseExecutor will cache jdbc Statements with equivalent SQL, 
> improving performance
> // slightly when redundant queries are made.
> configuration.setDefaultExecutorType(ExecutorType.REUSE);
> addMappedStatement(
> configuration,
> createStatementName,
> CharStreams.toString(new InputStreamReader(
> DbStorage.class.getResourceAsStream("schema.sql"),
> StandardCharsets.UTF_8)));
> try (SqlSession session = sessionFactory.openSession()) {
>   session.update(createStatementName);
> }
> for (CronCollisionPolicy policy : CronCollisionPolicy.values()) {
>   enumValueMapper.addEnumValue("cron_policies", policy.getValue(), 
> policy.name());
> }
> for (MaintenanceMode mode : MaintenanceMode.values()) {
>   enumValueMapper.addEnumValue("maintenance_modes", mode.getValue(), 
> mode.name());
> }
> for (JobUpdateStatus status : JobUpdateStatus.values()) {
>   enumValueMapper.addEnumValue("job_update_statuses", status.getValue(), 
> status.name());
> }
> for (JobUpdateAction action : JobUpdateAction.values()) {
>   enumValueMapper.addEnumValue("job_instance_update_actions", 
> action.getValue(), action.name());
> }
> for (ScheduleStatus status : ScheduleStatus.values()) {
>   enumValueMapper.addEnumValue("task_states", status.getValue(), 
> status.name());
> }
> for (ResourceType resourceType : ResourceType.values()) {
>   enumValueMapper.addEnumValue("resource_types", resourceType.getValue(), 
> resourceType.name());
> }
> for (Mode mode : Mode.values()) {
>   enumValueMapper.addEnumValue("volume_modes", mode.getValue(), 
> mode.name());
> }
> createPoolMetrics();
>   }
> {noformat}
> Consider the case where we add a new value to an existing enum. This means 
> restoring from a snapshot will not allow us to have that value in the enum 
> table. 
> To fix this we should have a migration for every enum value we add. However 
> to me it seems that the better idea would be to update the enum tables after 
> we restore from a snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1680) Aurora 0.16.0 deprecations

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1680:
--
Fix Version/s: 0.18.0

> Aurora 0.16.0 deprecations
> --
>
> Key: AURORA-1680
> URL: https://issues.apache.org/jira/browse/AURORA-1680
> Project: Aurora
>  Issue Type: Epic
>Reporter: Maxim Khutornenko
> Fix For: 0.18.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1909) Thermos Health Check fails for MesosContainerizer if `--nosetuid-health-checks` is set

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1909:
--
Fix Version/s: 0.18.0

> Thermos Health Check fails for MesosContainerizer if 
> `--nosetuid-health-checks` is set
> --
>
> Key: AURORA-1909
> URL: https://issues.apache.org/jira/browse/AURORA-1909
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Charles Raimbert
>Assignee: Charles Raimbert
>  Labels: easyfix
> Fix For: 0.18.0
>
>
> With MesosContainerizer, the sandbox is of type FileSystemImageSandbox and 
> the health check is performed using a "mesos-containerizer launch" process, 
> but there is actually a code bug in the way of getting the user under which 
> to run the health check process:
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370
> {code}
> health_check_user = (os.getusername() if self._nosetuid_health_checks
> else assigned_task.task.job.role)
> {code}
> If the Aurora scheduler is configured with `--nosetuid-health-checks` then 
> "os.getusername()" is executed, but the python "os" module does not present a 
> "getusername()" function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1860) Fix bug in scheduler driver disconnect stats

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1860:
--
Fix Version/s: 0.18.0

> Fix bug in scheduler driver disconnect stats
> 
>
> Key: AURORA-1860
> URL: https://issues.apache.org/jira/browse/AURORA-1860
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Assignee: Ilya Pronin
>Priority: Minor
>  Labels: newbie
> Fix For: 0.18.0
>
>
> Correct the refactoring mistake introduced in 
> [https://reviews.apache.org/r/31550/] that has disabled 
> {{scheduler_framework_disconnects}} stats:
> {code:title=MesosSchedulerImpl.disconnected()}
> counters.get("scheduler_framework_disconnects").get();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1923) Aurora client should not automatically retry non-idempotent operations

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1923:
--
Fix Version/s: 0.18.0

> Aurora client should not automatically retry non-idempotent operations
> --
>
> Key: AURORA-1923
> URL: https://issues.apache.org/jira/browse/AURORA-1923
> Project: Aurora
>  Issue Type: Story
>  Components: Client
>Reporter: Mehrdad Nurolahzade
>Assignee: Mehrdad Nurolahzade
> Fix For: 0.18.0
>
>
> Aurora client has a built in mechanism to automatically retry thrift API 
> operations if the connection with scheduler times out, experiences transport 
> exception, or encounters a transient exception on the scheduler side.
> Retrying thrift calls due to scheduler connection timeout and transient 
> exceptions (see [AURORA-187]) is safe. However, as Aurora has no concept of 
> idempotency, its client can retry non-idempotent operations upon encountering 
> transport exceptions which can lead to nondeterministic situations.
> For example, if client requests go through a proxy to reach scheduler, client 
> might consider a non-idempotent request failed and automatically retry it 
> while the original request has been received and processed by the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1911) HTTP Scheduler Driver does not reliably re subscribe

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1911:
--
Fix Version/s: 0.18.0

> HTTP Scheduler Driver does not reliably re subscribe
> 
>
> Key: AURORA-1911
> URL: https://issues.apache.org/jira/browse/AURORA-1911
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> I observed this issue in a large production cluster during a period of Mesos 
> Master instability:
> 1. Mesos master crashes or restarts.
> 2. {{V1Mesos}} driver detects this and reconnects.
> 3. Aurora does the {{SUBSCRIBE}} call again.
> 4. The {{SUBSCRIBE}} Call fails silently in the driver.
> 5. All future calls are silently dropped by the driver.
> 6. Aurora has no offers because it is not subscribed.
> Logs:
> {noformat}
> I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at 
> http://10.162.14.30:5050/master/api/v1/scheduler
> W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service 
> Unavailable' () for SUBSCRIBE
> 
> W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> 
> W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> 
> W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> ...
> {noformat}
> To fix this, the {{VersionedSchedulerDriver}} needs to do two things:
> 1. Block calls when unsubscribed not just disconnected.
> 2. Retry the {{SUBSCRIBE}} call repeatedly with exponential backoff.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1904) Support Mesos Maintenance

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1904:
--
Fix Version/s: 0.18.0

> Support Mesos Maintenance
> -
>
> Key: AURORA-1904
> URL: https://issues.apache.org/jira/browse/AURORA-1904
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
> Fix For: 0.18.0
>
>
> Support Mesos Maintenance primitives in Aurora per the design 
> [doc|https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumkIbHrk0-ceoD2YI/edit#heading=h.n5tvzjaj9llx].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1920) Add pluggable scheduling logic to Aurora

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1920:
--
Fix Version/s: 0.18.0

> Add pluggable scheduling logic to Aurora
> 
>
> Key: AURORA-1920
> URL: https://issues.apache.org/jira/browse/AURORA-1920
> Project: Aurora
>  Issue Type: Epic
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
> Fix For: 0.18.0
>
>
> On the mailing list recently there was a desire to have custom scheduling 
> logic (e.g. to implement scheduling spread). This ticket tracks the proposal 
> and implementation of custom scheduling. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1924) Aurora client should reconcile idempotent job creations

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1924:
--
Fix Version/s: 0.18.0

> Aurora client should reconcile idempotent job creations
> ---
>
> Key: AURORA-1924
> URL: https://issues.apache.org/jira/browse/AURORA-1924
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Scheduler
>Reporter: Mehrdad Nurolahzade
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
> Fix For: 0.18.0
>
>
> Aurora scheduler rejects a request to create a job if a job with the same key 
> already exists (see {{SchedulerThriftInterface.createJob()}}). Aurora client 
> exits with an error once it receives a response with 
> {{ResponseCode.INVALID_REQUEST}} from scheduler in this case.
> However, an attempt to create a job with the exact same configuration and 
> number of instances is essentially idempotent. Scheduler can detect this 
> situation, ignore it, and signal client to treat operation as successful; 
> client warns user about existing job but does not fail the operation.
> This helps Aurora client and scheduler reconcile state when creating jobs in 
> presence of transport layer exceptions; allowing {{aurora job create}} 
> command can then be marked as idempotent after [AURORA-1923] is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1921) Create proposal for which components should be pluggable

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1921:
--
Fix Version/s: 0.18.0

> Create proposal for which components should be pluggable
> 
>
> Key: AURORA-1921
> URL: https://issues.apache.org/jira/browse/AURORA-1921
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
> Fix For: 0.18.0
>
>
> Create a proposal on the dev list for how pluggable scheduling should work. 
> Is TaskAssigner the class we want to customize? Should preemption also factor 
> into the customization since it very closely tied to the Scheduling logic? 
> We also need to confirm the plugin mechanism. Most likely we should base thus 
> on the current auth support which is pluggable by passing in a Guice module 
> via cmdline. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1925) Easily copy files to/from an aurora task instance

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1925:
--
Fix Version/s: 0.18.0

> Easily copy files to/from an aurora task instance
> -
>
> Key: AURORA-1925
> URL: https://issues.apache.org/jira/browse/AURORA-1925
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Jordan Ly
>Assignee: Jordan Ly
>Priority: Minor
>  Labels: features, newbie
> Fix For: 0.18.0
>
>
> "So, we have "aurora task ssh" which is handy...We have the web ui, which can 
> allow us to download files from the instance.
> I'd love to have a straightforward, non-hack way to copy files to an aurora 
> shard. Ex. aurora task copy source relative-path-in-chroot or something that 
> could mount the chroot as read/write"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1915) Add automatic browser tab open feature for aurora update start

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1915:
--
Fix Version/s: 0.18.0

> Add automatic browser tab open feature for aurora update start
> --
>
> Key: AURORA-1915
> URL: https://issues.apache.org/jira/browse/AURORA-1915
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Usability
>Reporter: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
> Fix For: 0.18.0
>
>
> Aurora client automatically open a browser tab following {{aurora job 
> create}} and {{aurora cron schedule}} commands. Provide similar ability for 
> {{aurora update start}} command as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1910) framework_registered metric isn't reset when scheduler disconnects

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1910:
--
Fix Version/s: 0.18.0

> framework_registered metric isn't reset when scheduler disconnects
> --
>
> Key: AURORA-1910
> URL: https://issues.apache.org/jira/browse/AURORA-1910
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> Right now the {{framework_registered}} metric transitions from 0 -> 1 when 
> the scheduler registers successfully the first time. It never transitions 
> from 1 -> 0 when it loses a connection.
> This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the 
> gauge as the scheduler loses registration and re-registers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1928) Aurora should prioritize adding instances over updating instances during an update

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1928:
--
Fix Version/s: 0.18.0

> Aurora should prioritize adding instances over updating instances during an 
> update
> --
>
> Key: AURORA-1928
> URL: https://issues.apache.org/jira/browse/AURORA-1928
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Jordan Ly
>Assignee: Jordan Ly
>Priority: Minor
> Fix For: 0.18.0
>
>
> Often times, when adding capacity to a service, I increase the number of 
> instances and the batch_size that the updates roll out at.
> For the first deploy, this updates the old instances with the more aggressive 
> batch size before the new instances are started.
> It might be better for Aurora to prioritize the "adding instances" step of 
> upgrades.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1933) Scheduler can process rescind before offer

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1933:
--
Fix Version/s: 0.18.0

> Scheduler can process rescind before offer
> --
>
> Key: AURORA-1933
> URL: https://issues.apache.org/jira/browse/AURORA-1933
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> I observed the following in production:
> {noformat}
> Jun  6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.510 
> [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer 
> rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> Jun  6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.903 
> [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received 
> offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> Jun  6 00:31:34 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:34.815 
> [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 
> 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> {noformat}
> Notice the rescind was processed before the offer was given. This means the 
> offer is in the offer storage, but using it is invalid. It will cause 
> whatever task launched with it to fail with {{Task launched with invalid 
> offers: Offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 is no longer 
> valid}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1907) Thermos unresponsive on hosts with many active task

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1907:
--
Fix Version/s: 0.18.0

> Thermos unresponsive on hosts with many active task
> ---
>
> Key: AURORA-1907
> URL: https://issues.apache.org/jira/browse/AURORA-1907
> Project: Aurora
>  Issue Type: Story
>  Components: Observer
>Reporter: Stephan Erb
>Assignee: Stephan Erb
> Fix For: 0.18.0
>
>
> We have noticed that on hosts with lots of active tasks (~100) and many 
> terminated tasks (~1500) the Thermos UI is not usable. Thermos spins at 300% 
> CPU but does not render any HTTP requests.
> Dumping {{/threads}} indicates we might be blocked by the hundret 
> {{TaskResourceMonitor}} threads trying to read values from {{/proc}}:
> {code}
> # Thread (daemon): TaskResourceMonitor (TaskResourceMonitor[mytask-id] 
> [TID=45241], 140682825963264)
>   File: "/usr/lib/python2.7/threading.py", line 525, in __bootstrap
> self.__bootstrap_inner()
>   File: "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
> self.run()
>   File: 
> "/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
>  line 115, in identified
> return instancemethod(self, *args, **kwargs)
>   File: 
> "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
> self.__real_run(*args, **kw)
>   File: "apache/thermos/monitoring/resource.py", line 204, in run
> collector.sample()
>   File: "apache/thermos/monitoring/process_collector_psutil.py", line 70, in 
> sample
> for child in parent.children(recursive=True)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 326, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 861, in children
> table[p.ppid()].append(p)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 545, in ppid
> return self._proc.ppid()
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 962, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1459, in ppid
> return int(self._parse_stat_file()[2])
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1001, in _parse_stat_file
> return [name] + fields_after_name
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1929:
--
Fix Version/s: 0.18.0

> Improve explicit task history pruning.
> --
>
> Key: AURORA-1929
> URL: https://issues.apache.org/jira/browse/AURORA-1929
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Minor
> Fix For: 0.18.0
>
>
> There are currently two types of task history pruning running by aurora:
> # The implicit task history pruning running by TaskHistoryPrunner in the 
> background, which registers all inactive tasks upon terminal state change for 
> pruning.
> # The explicit task history pruning initiated by `aurora_admin prune_tasks` 
> command, which prunes inactive tasks in the cluster.
> The prune_tasks endpoint seems to be very slow when the cluster has a large 
> number of inactive tasks. 
> For example, when we use $ aurora_admin prune_tasks for 135k running tasks 
> (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed 
> seems to max out at 3k tasks per minute.
> Currently, aurora uses StreamManager to manages a single log stream append 
> transaction for task history pruning. Local storage ops can be added to the 
> transaction and then later committed as an atomic unit. However, the 
> StateManager removes tasks one by one in a 
> for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376),
>  and each RemoveTasks operation is coalesced with its previous operation, 
> which seems inefficient and unnecessary 
> (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324).
> We need to batch all removeTasks operations and execute them all at once to 
> avoid additional cost of coalescing. The fix will also benefit implicit task 
> history pruning since it has similar underlying implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1916) Incompatibility with mesos 1.2

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1916:
--
Fix Version/s: 0.18.0

> Incompatibility with mesos 1.2
> --
>
> Key: AURORA-1916
> URL: https://issues.apache.org/jira/browse/AURORA-1916
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.17.0
> Environment: Ubuntu 16.04, Mesos 1.2
>Reporter: Kostiantyn Bokhan
> Fix For: 0.18.0
>
>
> The list of mesos-containerizer arguments has been changed since 1.2:
> {code}
> /usr/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --[no-]help  Prints this help message (default: false)
>   --launch_info=VALUE  
>   --namespace_mnt_target=VALUE The target 'pid' of the process whose 
> mount namespace we'd like
>to enter before executing the command.
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
> {code}
> Mesos 1.1.0:
> {code}
> /usr/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --capabilities=VALUE Capabilities the command can use.
>   --command=VALUE  The command to execute.
>   --environment=VALUE  The environment variables for the command.
>   --[no-]help  Prints this help message (default: false)
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pre_exec_commands=VALUEThe additional preparation commands to 
> execute before
>executing the command.
>   --rootfs=VALUE   Absolute path to the container root 
> filesystem. The command will be 
>interpreted relative to this path
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
>   --user=VALUE The user to change to.
>   --working_directory=VALUEThe working directory for the command. It 
> has to be an absolute path 
>w.r.t. the root filesystem used for the 
> command.
> {code}
> It causes the next error:
> {code}
> Failed to parse the flags: Failed to load unknown flag 'command'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1887) Create Driver implementation around V0Mesos.

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1887:
--
Fix Version/s: 0.18.0

> Create Driver implementation around V0Mesos.
> 
>
> Key: AURORA-1887
> URL: https://issues.apache.org/jira/browse/AURORA-1887
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> Create an implementation of the {{org.apache.aurora.scheduler.mesos.Driver}} 
> interface which uses the {{V0Mesos}} shim under the hood. Provide a flag to 
> switch between the two to show there is no regression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1888) Create Driver implementation around V1Mesos

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1888:
--
Fix Version/s: 0.18.0

> Create Driver implementation around V1Mesos
> ---
>
> Key: AURORA-1888
> URL: https://issues.apache.org/jira/browse/AURORA-1888
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> Create an implementation of {{org.apache.aurora.scheduler.mesos.Driver}} that 
> uses {{V1Mesos}} under the hood.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1905) Set "webui_url" field of FrameworkInfo

2017-06-09 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1905:
--
Fix Version/s: 0.18.0

> Set "webui_url" field of FrameworkInfo
> --
>
> Key: AURORA-1905
> URL: https://issues.apache.org/jira/browse/AURORA-1905
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.18.0
>
>
> Aurora should set the {{webui_url}} field of FrameworkInfo so the Mesos UI 
> can link to the current leader.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1925) Easily copy files to/from an aurora task instance

2017-05-10 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1925:
-

Assignee: (was: Santhosh Kumar Shanmugham)

> Easily copy files to/from an aurora task instance
> -
>
> Key: AURORA-1925
> URL: https://issues.apache.org/jira/browse/AURORA-1925
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Jordan Ly
>Priority: Minor
>  Labels: features, newbie
>
> "So, we have "aurora task ssh" which is handy...We have the web ui, which can 
> allow us to download files from the instance.
> I'd love to have a straightforward, non-hack way to copy files to an aurora 
> shard. Ex. aurora task copy source relative-path-in-chroot or something that 
> could mount the chroot as read/write
> For example, I can use this to use agents or tools not present in the default 
> image, or are something I don't want to clutter packer with.
> For example, I tried to use byteman to narrow down what's spawning threads.
> Just getting the zip to aurora is more difficult and mysterious than writing 
> a script to trace thread lifecycle. This should be easy... agreed?"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1925) Easily copy files to/from an aurora task instance

2017-05-10 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1925:
-

Assignee: Santhosh Kumar Shanmugham

> Easily copy files to/from an aurora task instance
> -
>
> Key: AURORA-1925
> URL: https://issues.apache.org/jira/browse/AURORA-1925
> Project: Aurora
>  Issue Type: Task
>  Components: Client
>Reporter: Jordan Ly
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
>  Labels: features, newbie
>
> "So, we have "aurora task ssh" which is handy...We have the web ui, which can 
> allow us to download files from the instance.
> I'd love to have a straightforward, non-hack way to copy files to an aurora 
> shard. Ex. aurora task copy source relative-path-in-chroot or something that 
> could mount the chroot as read/write
> For example, I can use this to use agents or tools not present in the default 
> image, or are something I don't want to clutter packer with.
> For example, I tried to use byteman to narrow down what's spawning threads.
> Just getting the zip to aurora is more difficult and mysterious than writing 
> a script to trace thread lifecycle. This should be easy... agreed?"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1913) Aurora Client Job Update Diff output is erroneous

2017-03-29 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1913:
-

Assignee: Santhosh Kumar Shanmugham

> Aurora Client Job Update Diff output is erroneous
> -
>
> Key: AURORA-1913
> URL: https://issues.apache.org/jira/browse/AURORA-1913
> Project: Aurora
>  Issue Type: Bug
>  Components: Client
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
>
> The {{TaskConfig}} diff shown by Aurora Client is not sorting the individual 
> entities (which are sets) which results in a confusing output. We need to 
> sort the individual fields inside the config to return a more meaningful 
> output.
> Although the {{host}} and {{rack}} limit constraints have not changed, the 
> diff still outputs the below,
> {code}
> <   'constraints': set([ Constraint(name='rack', 
> constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> ---
> >   'constraints': set([ Constraint(name='host', 
> > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> 4,5c4,5
>  constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None))]),
> ---
> >Constraint(name='rack', 
> > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1913) Aurora Client Job Update Diff output is erroneous

2017-03-29 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1913:
-

 Summary: Aurora Client Job Update Diff output is erroneous
 Key: AURORA-1913
 URL: https://issues.apache.org/jira/browse/AURORA-1913
 Project: Aurora
  Issue Type: Bug
  Components: Client
Reporter: Santhosh Kumar Shanmugham
Priority: Minor


The {{TaskConfig}} diff shown by Aurora Client is not sorting the individual 
entities (which are sets) which results in a confusing output. We need to sort 
the individual fields inside the config to return a more meaningful output.

Although the {{host}} and {{rack}} limit constraints have not changed, the diff 
still outputs the below,
{code}
<   'constraints': set([ Constraint(name='rack', 
constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
---
>   'constraints': set([ Constraint(name='host', 
> constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
4,5c4,5
Constraint(name='rack', 
> constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)),
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1908:
--
Description: When matching a {{ResourceRequest}} against a 
{{UnusedResource}} in {{PremeptionVictimFilter.filterPremeptionVictions}} there 
are 4 kinds of {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply 
to the entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, 
{{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can 
short-circuit and return early and move on to the next host to consider.  (was: 
When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
{{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
{{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} or 
{{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return early 
and move on to the next host to consider.)

> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{PremeptionVictimFilter.filterPremeptionVictions}} there are 4 kinds of 
> {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply to the 
> entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, 
> {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can 
> short-circuit and return early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935435#comment-15935435
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1908:
---

Plus we can improve the way resource increment happens now to explicitly check 
that either there was no Veto or the Veto was due to insufficient resources.

> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
> returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
> {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} 
> or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return 
> early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935431#comment-15935431
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1908:
---

This code here - 
https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilter.java#L214-L227

> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
> returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
> {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} 
> or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return 
> early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1908:
-

 Summary: Short-circuit preemption filtering when a Veto applies to 
entire host
 Key: AURORA-1908
 URL: https://issues.apache.org/jira/browse/AURORA-1908
 Project: Aurora
  Issue Type: Task
Reporter: Santhosh Kumar Shanmugham
Priority: Minor


When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
{{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
{{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} or 
{{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return early 
and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (AURORA-1882) Add support for Mesos ContainerLaunchInfo

2017-03-13 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1882.
---
   Resolution: Fixed
 Assignee: Santhosh Kumar Shanmugham
Fix Version/s: 0.18.0

> Add support for Mesos ContainerLaunchInfo
> -
>
> Key: AURORA-1882
> URL: https://issues.apache.org/jira/browse/AURORA-1882
> Project: Aurora
>  Issue Type: Task
>  Components: Executor, Thermos
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>Priority: Critical
> Fix For: 0.18.0
>
>
> Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops 
> support for multiple switches. Without support for this Thermos Executor will 
> not be able to launch tasks successfully.
> {noformat}
> /usr/local/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --[no-]help  Prints this help message (default: false)
>   --launch_info=VALUE
>   --namespace_mnt_target=VALUE The target 'pid' of the process whose 
> mount namespace we'd like
>to enter before executing the command.
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
> {noformat}
> https://issues.apache.org/jira/browse/MESOS-6648



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Reopened] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2017-02-14 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reopened AURORA-1824:
---

> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> {noformat}
> will be translated to,
> {noformat}
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2017-02-14 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1824.
---
Resolution: Fixed

> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> {noformat}
> will be translated to,
> {noformat}
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2017-02-14 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1824:
-

Assignee: Santhosh Kumar Shanmugham

> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> {noformat}
> will be translated to,
> {noformat}
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1892) TaskQuery `limit` and `offset` must be applied at TaskStore

2017-02-13 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1892:
-

 Summary: TaskQuery `limit` and `offset` must be applied at 
TaskStore
 Key: AURORA-1892
 URL: https://issues.apache.org/jira/browse/AURORA-1892
 Project: Aurora
  Issue Type: Task
Reporter: Santhosh Kumar Shanmugham


{{TaksQuery}}'s {{limit}} and {{offset}} are currently applied after the 
results have been fetched from the {{TaskStore}}, which is inefficient. Make 
the {{TaskStore}} apply the {{limit}} and {{offset}} conditions at the 
{{TaskStore}} level in both {{MemTaskStore}} and {{DBTaskStore}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:01 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will be release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change 

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:00 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will be release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 10:58 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be 
characterized as house-keeping work that is not in the critical scheduling 
path, it would make sense to rate-limit these ambient activities, so that the 
scheduler is protected from bursts of non-critical work (like - job updates 
with large of instances, network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses 

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 10:59 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> 

[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1837:
---

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be 
characterized as house-keeping work that is not in the critical scheduling 
path, it would make sense to rate-limit these ambient activities, so that the 
scheduler is protected from bursts of non-critical work (like - job updates 
with large of instances, network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1882) Add support for Mesos ContainerLaunchInfo

2017-01-25 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837484#comment-15837484
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1882:
---

https://reviews.apache.org/r/55916/

> Add support for Mesos ContainerLaunchInfo
> -
>
> Key: AURORA-1882
> URL: https://issues.apache.org/jira/browse/AURORA-1882
> Project: Aurora
>  Issue Type: Task
>  Components: Executor, Thermos
>Reporter: Santhosh Kumar Shanmugham
>Priority: Critical
>
> Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops 
> support for multiple switches. Without support for this Thermos Executor will 
> not be able to launch tasks successfully.
> {noformat}
> /usr/local/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --[no-]help  Prints this help message (default: false)
>   --launch_info=VALUE
>   --namespace_mnt_target=VALUE The target 'pid' of the process whose 
> mount namespace we'd like
>to enter before executing the command.
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor
>on Posix, or a handle on Windows. It's 
> caller's responsibility
>to make sure the file descriptor or the 
> handle is inherited
>properly in the subprocess. It's used to 
> synchronize with the
>parent process. If not specified, no 
> synchronization will happen.
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
> {noformat}
> https://issues.apache.org/jira/browse/MESOS-6648



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1850) Raw StatusResult passed to the scheduler when tasks are healthy

2016-12-09 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734721#comment-15734721
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1850:
---

https://reviews.apache.org/r/54299/

> Raw StatusResult passed to the scheduler when tasks are healthy
> ---
>
> Key: AURORA-1850
> URL: https://issues.apache.org/jira/browse/AURORA-1850
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Santhosh Kumar Shanmugham
>Priority: Minor
> Attachments: 
> 9ab25047-4418-4121-906c-380ded9e1962__Screen_Shot_2016-12-06_at_4.56.43_PM.png
>
>
> As part of the recent health check changes, we now pass a message to the 
> scheduler along with the RUNNING transition when the task is healthy. 
> Unfortunately it looks like this message is a stringified `StatusResult`, 
> rather than the message from the `StatusResult` (see attached screenshot for 
> details).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1841) Update HealthChecks has a backward incompatibility

2016-12-06 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727190#comment-15727190
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1841:
---

https://reviews.apache.org/r/54299/

> Update HealthChecks has a backward incompatibility
> --
>
> Key: AURORA-1841
> URL: https://issues.apache.org/jira/browse/AURORA-1841
> Project: Aurora
>  Issue Type: Bug
>  Components: Thermos
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> The current implementation of the HealthCheck based updates has a backward 
> incompatibility issue when the HealthCheckConfig is used in such a way that 
> the initial grace period is extended beyond the `initial_interval_secs`.
> This bug prematurely fails updates by using the new fail-fast feature when 
> the `initial_interval_secs` does not count for the entire task warmup time. 
> In the older version the actual grace period for update to succeed was 
> `initial_interval_secs + interval_secs * max_consecutive_failures`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1844) Force a snapshot at the end of Scheduler startup.

2016-12-02 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1844:
--
Summary: Force a snapshot at the end of Scheduler startup.  (was: Force a 
snapshot at the end of startup.)

> Force a snapshot at the end of Scheduler startup.
> -
>
> Key: AURORA-1844
> URL: https://issues.apache.org/jira/browse/AURORA-1844
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When the scheduler starts up, it replays the logs from the replicated log to 
> catch up with the current state, before announcing itself as the leader to 
> the outside world. If for any reason after this replay, the scheduler dies 
> after adding more log entires, the next startup will have to redo the work 
> again. This becomes problem when the amount of additional work added is not 
> trivial, and can take the scheduler down the path of a spiraling death. One 
> example, of this is when the TaskHistoryPruner cleans up the DB but adds to 
> the log entires. In order to avoid the repeated work, the scheduler should 
> force a snapshot after the initial replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1841) Update HealthChecks has a backward incompatibility

2016-12-01 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1841:
--
Component/s: Thermos

> Update HealthChecks has a backward incompatibility
> --
>
> Key: AURORA-1841
> URL: https://issues.apache.org/jira/browse/AURORA-1841
> Project: Aurora
>  Issue Type: Bug
>  Components: Thermos
>Reporter: Santhosh Kumar Shanmugham
>Assignee: Santhosh Kumar Shanmugham
>
> The current implementation of the HealthCheck based updates has a backward 
> incompatibility issue when the HealthCheckConfig is used in such a way that 
> the initial grace period is extended beyond the `initial_interval_secs`.
> This bug prematurely fails updates by using the new fail-fast feature when 
> the `initial_interval_secs` does not count for the entire task warmup time. 
> In the older version the actual grace period for update to succeed was 
> `initial_interval_secs + interval_secs * max_consecutive_failures`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1824:
--
Description: 
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

DSL Translation:

{noformat}
Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}

will be translated to,

Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}
{noformat}


  was:
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

MesosContainerizer:

Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}

will be translated to,

Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}



> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> will be translated to,
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1824:
--
Description: 
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

DSL Translation:

{noformat}
Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}
{noformat}

will be translated to,

{noformat}
Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}
{noformat}


  was:
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

DSL Translation:

{noformat}
Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}

will be translated to,

Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}
{noformat}



> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> DSL Translation:
> {noformat}
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> {noformat}
> will be translated to,
> {noformat}
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham updated AURORA-1824:
--
Description: 
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.

MesosContainerizer:

Job {
...
mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
...
}

will be translated to,

Job {
...
mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
...
}


  was:
Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.



> Create a binding-helper to resolve docker tags to concrete image digests for 
> MesosContainerizer
> ---
>
> Key: AURORA-1824
> URL: https://issues.apache.org/jira/browse/AURORA-1824
> Project: Aurora
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>
> Similar to the binding-helper that was introduced for DockerContainerizer 
> (introduced in https://reviews.apache.org/r/52479/), we need another 
> binding-helper that will resolve the MesosContainerizers docker config's 
> image tag to image digest.
> MesosContainerizer:
> Job {
> ...
> mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}')
> ...
> }
> will be translated to,
> Job {
> ...
> mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest"))
> ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684470#comment-15684470
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1014:
---

Created https://issues.apache.org/jira/browse/AURORA-1014 to create a similar 
binding-helper for MesosContainerizer.

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
> Fix For: 0.17.0
>
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1014.
---
Resolution: Fixed

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
> Fix For: 0.17.0
>
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)
Santhosh Kumar Shanmugham created AURORA-1824:
-

 Summary: Create a binding-helper to resolve docker tags to 
concrete image digests for MesosContainerizer
 Key: AURORA-1824
 URL: https://issues.apache.org/jira/browse/AURORA-1824
 Project: Aurora
  Issue Type: Story
Reporter: Santhosh Kumar Shanmugham


Similar to the binding-helper that was introduced for DockerContainerizer 
(introduced in https://reviews.apache.org/r/52479/), we need another 
binding-helper that will resolve the MesosContainerizers docker config's image 
tag to image digest.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham resolved AURORA-1225.
---
Resolution: Fixed

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-11-21 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684367#comment-15684367
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1225:
---

Tested this in our internal test cluster.

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-10-31 Thread Santhosh Kumar Shanmugham (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Kumar Shanmugham reassigned AURORA-1225:
-

Assignee: Santhosh Kumar Shanmugham

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-09-28 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531456#comment-15531456
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1014:
---

Took a shot at this but unfortunately it is blocked by lack of support from 
Mesos. See - https://issues.apache.org/jira/browse/MESOS-3505

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1688) Change framework_name default value from 'TwitterScheduler' to 'aurora'

2016-09-13 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488902#comment-15488902
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1688:
---

https://reviews.apache.org/r/51874

> Change framework_name default value from 'TwitterScheduler' to 'aurora'
> ---
>
> Key: AURORA-1688
> URL: https://issues.apache.org/jira/browse/AURORA-1688
> Project: Aurora
>  Issue Type: Sub-task
>Reporter: Stephan Erb
>Assignee: Santhosh Kumar Shanmugham
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-15 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421854#comment-15421854
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1711:
---

+1 on extending {{StartJobUpdateResult}}

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity

2016-08-12 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419776#comment-15419776
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1711:
---

Incorporating [~maximk] and [~davmclau]'s suggestions to put forward a proposal.

Client will send the UUID as part of the metadata field in JobUpdateRequest 
like,
{code}
JobUpdateRequest {
  TaskConfig {
 JobKey : “role/env/name”
   Metadata : {
 “Disambiguator” : “FGFJAGGFIAHKHGFAK” #UUID
  }
   }
}
{code}
Scheduler looks up the active updates for the {{JobKey}} and compare the 
{{Disambiguator}} to make sure the request is not a duplicate.

*Pros:*
- Logic is centralized and can benefit all the clients
- No costly diffs are needed
- No explicit API change

*Cons:*
- Scheduler and Client will need changes
- Possibility of identifier collision
- Additional query to the store

https://docs.google.com/a/twitter.com/document/d/1Ih-WXACZiPB0Z8EAQw_8cnAaf4eHMsUGG3B3OKia0k4/edit?usp=sharing

> Allow client to store metadata on Update entity
> ---
>
> Key: AURORA-1711
> URL: https://issues.apache.org/jira/browse/AURORA-1711
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>
> I have a use case where I'm programmatically starting updates via the Aurora 
> API and sometimes the request to the scheduler times out or fails, even 
> though the update is written to storage and started. 
> I'd like to be able to store some unique identifier on the update so that we 
> can reconcile this state later. We can make this generic by allowing clients 
> to store arbitrary metadata on an update (similar to how they do it with job 
> configuration). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)