[jira] [Commented] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down
[ https://issues.apache.org/jira/browse/AURORA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514181#comment-16514181 ] Santhosh Kumar Shanmugham commented on AURORA-1990: --- https://reviews.apache.org/r/67613/ > SlaManager's AsyncHttpClient can keep scheduler from shutting down > -- > > Key: AURORA-1990 > URL: https://issues.apache.org/jira/browse/AURORA-1990 > Project: Aurora > Issue Type: Bug >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > > We observed a situation where the scheduler was unable to fully shutdown > after a single non-daemon thread from the AsyncHttpClient's thread pool > stayed alive. > We should convert the SlaManager to be an AbstractIdleService to ensure that > the thread pool is closed on scheduler shutdown. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down
[ https://issues.apache.org/jira/browse/AURORA-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1990: - Assignee: Santhosh Kumar Shanmugham > SlaManager's AsyncHttpClient can keep scheduler from shutting down > -- > > Key: AURORA-1990 > URL: https://issues.apache.org/jira/browse/AURORA-1990 > Project: Aurora > Issue Type: Bug >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > > We observed a situation where the scheduler was unable to fully shutdown > after a single non-daemon thread from the AsyncHttpClient's thread pool > stayed alive. > We should convert the SlaManager to be an AbstractIdleService to ensure that > the thread pool is closed on scheduler shutdown. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AURORA-1990) SlaManager's AsyncHttpClient can keep scheduler from shutting down
Santhosh Kumar Shanmugham created AURORA-1990: - Summary: SlaManager's AsyncHttpClient can keep scheduler from shutting down Key: AURORA-1990 URL: https://issues.apache.org/jira/browse/AURORA-1990 Project: Aurora Issue Type: Bug Reporter: Santhosh Kumar Shanmugham We observed a situation where the scheduler was unable to fully shutdown after a single non-daemon thread from the AsyncHttpClient's thread pool stayed alive. We should convert the SlaManager to be an AbstractIdleService to ensure that the thread pool is closed on scheduler shutdown. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AURORA-1979) Introduce new SLA-aware maintenance APIs
[ https://issues.apache.org/jira/browse/AURORA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham resolved AURORA-1979. --- Resolution: Implemented https://reviews.apache.org/r/66716/ > Introduce new SLA-aware maintenance APIs > > > Key: AURORA-1979 > URL: https://issues.apache.org/jira/browse/AURORA-1979 > Project: Aurora > Issue Type: Sub-task > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > > Introduce the new SLA-aware thrift endpoints on the Scheduler. More > information in the proposal document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AURORA-1976) SLA-based Maintenance in Scheduler
[ https://issues.apache.org/jira/browse/AURORA-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham resolved AURORA-1976. --- Resolution: Implemented > SLA-based Maintenance in Scheduler > -- > > Key: AURORA-1976 > URL: https://issues.apache.org/jira/browse/AURORA-1976 > Project: Aurora > Issue Type: Story > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > > Proposal to create SLA-aware maintenance support in the Scheduler. > https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AURORA-1978) Implement SLA computation utilities
[ https://issues.apache.org/jira/browse/AURORA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham resolved AURORA-1978. --- Resolution: Implemented https://reviews.apache.org/r/66716/ > Implement SLA computation utilities > --- > > Key: AURORA-1978 > URL: https://issues.apache.org/jira/browse/AURORA-1978 > Project: Aurora > Issue Type: Sub-task > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > > Create utilities and helper methods required for computing SLA in the > Scheduler. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AURORA-1986) Deprecate client-driven host maintenance.
Santhosh Kumar Shanmugham created AURORA-1986: - Summary: Deprecate client-driven host maintenance. Key: AURORA-1986 URL: https://issues.apache.org/jira/browse/AURORA-1986 Project: Aurora Issue Type: Bug Reporter: Santhosh Kumar Shanmugham After AURORA-1978 the client-driven host maintenance is no longer required. `sla_host_drain` is the newly introduced admin command. Merge this under the `host_drain` command and remove `sla_host_drain`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AURORA-1985) Offers are not updated after host_activate immediately
Santhosh Kumar Shanmugham created AURORA-1985: - Summary: Offers are not updated after host_activate immediately Key: AURORA-1985 URL: https://issues.apache.org/jira/browse/AURORA-1985 Project: Aurora Issue Type: Bug Reporter: Santhosh Kumar Shanmugham After activating a host the {{HostAttributes}} on {{HostOffers}} does not consider the offer for scheduling tasks. It takes until the next offer-cycle to update the offer and use it for scheduling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (AURORA-1977) Introduce thrift changes for describing SLAPolicies
[ https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham resolved AURORA-1977. --- Resolution: Implemented > Introduce thrift changes for describing SLAPolicies > --- > > Key: AURORA-1977 > URL: https://issues.apache.org/jira/browse/AURORA-1977 > Project: Aurora > Issue Type: Sub-task > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AURORA-1979) Introduce new SLA-aware maintenance APIs
[ https://issues.apache.org/jira/browse/AURORA-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1979: - Assignee: Santhosh Kumar Shanmugham > Introduce new SLA-aware maintenance APIs > > > Key: AURORA-1979 > URL: https://issues.apache.org/jira/browse/AURORA-1979 > Project: Aurora > Issue Type: Sub-task > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > > Introduce the new SLA-aware thrift endpoints on the Scheduler. More > information in the proposal document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AURORA-1977) Introduce thrift changes for describing SLAPolicies
[ https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476463#comment-16476463 ] Santhosh Kumar Shanmugham commented on AURORA-1977: --- [~jingc] I have been working on this as part of [https://reviews.apache.org/r/66716.] I forgot to assign this to myself. Apologies. > Introduce thrift changes for describing SLAPolicies > --- > > Key: AURORA-1977 > URL: https://issues.apache.org/jira/browse/AURORA-1977 > Project: Aurora > Issue Type: Sub-task > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Assignee: Jing Chen >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AURORA-1977) Introduce thrift changes for describing SLAPolicies
[ https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1977: - Assignee: Santhosh Kumar Shanmugham (was: Jing Chen) > Introduce thrift changes for describing SLAPolicies > --- > > Key: AURORA-1977 > URL: https://issues.apache.org/jira/browse/AURORA-1977 > Project: Aurora > Issue Type: Sub-task > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AURORA-1978) Implement SLA computation utilities
[ https://issues.apache.org/jira/browse/AURORA-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1978: - Assignee: Santhosh Kumar Shanmugham > Implement SLA computation utilities > --- > > Key: AURORA-1978 > URL: https://issues.apache.org/jira/browse/AURORA-1978 > Project: Aurora > Issue Type: Sub-task > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > > Create utilities and helper methods required for computing SLA in the > Scheduler. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AURORA-1979) Introduce new SLA-aware maintenance APIs
Santhosh Kumar Shanmugham created AURORA-1979: - Summary: Introduce new SLA-aware maintenance APIs Key: AURORA-1979 URL: https://issues.apache.org/jira/browse/AURORA-1979 Project: Aurora Issue Type: Sub-task Components: Scheduler Reporter: Santhosh Kumar Shanmugham Introduce the new SLA-aware thrift endpoints on the Scheduler. More information in the proposal document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AURORA-1978) Implement SLA computation utilities
Santhosh Kumar Shanmugham created AURORA-1978: - Summary: Implement SLA computation utilities Key: AURORA-1978 URL: https://issues.apache.org/jira/browse/AURORA-1978 Project: Aurora Issue Type: Sub-task Components: Scheduler Reporter: Santhosh Kumar Shanmugham Create utilities and helper methods required for computing SLA in the Scheduler. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AURORA-1977) Introduce thrift changes for describing SLAPolicies
[ https://issues.apache.org/jira/browse/AURORA-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1977: -- Component/s: Scheduler > Introduce thrift changes for describing SLAPolicies > --- > > Key: AURORA-1977 > URL: https://issues.apache.org/jira/browse/AURORA-1977 > Project: Aurora > Issue Type: Sub-task > Components: Scheduler >Reporter: Santhosh Kumar Shanmugham >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AURORA-1977) Introduce thrift changes for describing SLAPolicies
Santhosh Kumar Shanmugham created AURORA-1977: - Summary: Introduce thrift changes for describing SLAPolicies Key: AURORA-1977 URL: https://issues.apache.org/jira/browse/AURORA-1977 Project: Aurora Issue Type: Sub-task Reporter: Santhosh Kumar Shanmugham -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AURORA-1976) SLA-based Maintenance in Scheduler
[ https://issues.apache.org/jira/browse/AURORA-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1976: - Assignee: Santhosh Kumar Shanmugham > SLA-based Maintenance in Scheduler > -- > > Key: AURORA-1976 > URL: https://issues.apache.org/jira/browse/AURORA-1976 > Project: Aurora > Issue Type: Story >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Major > > Proposal to create SLA-aware maintenance support in the Scheduler. > https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AURORA-1976) SLA-based Maintenance in Scheduler
Santhosh Kumar Shanmugham created AURORA-1976: - Summary: SLA-based Maintenance in Scheduler Key: AURORA-1976 URL: https://issues.apache.org/jira/browse/AURORA-1976 Project: Aurora Issue Type: Story Reporter: Santhosh Kumar Shanmugham Proposal to create SLA-aware maintenance support in the Scheduler. https://docs.google.com/document/d/1T2ejxbYeEGDcDemHfHTHVsDhJDq2w4_pIYQqj6Uxr0s/edit#heading=h.x53cx1kgdp3j -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AURORA-1965) LaunchException needs transient task timeout for reschedule
Santhosh Kumar Shanmugham created AURORA-1965: - Summary: LaunchException needs transient task timeout for reschedule Key: AURORA-1965 URL: https://issues.apache.org/jira/browse/AURORA-1965 Project: Aurora Issue Type: Bug Reporter: Santhosh Kumar Shanmugham In {{TaskAssignerImpl}} a {{LaunchException}} either due to race-condition for grabbing {{Offer}} or Scheduler Driver {{acceptOffers}} failure will result in the task being help up in {{ASSIGNED}} state, until transient task timeout kicks in to move the task to LOST, resulting in a reschedule. [https://github.com/apache/aurora/blob/dbe71374399d86d77baa35efbeca4fffa34f380e/src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java#L135] The impact of this is minimal due to fact that multiple conditions need to align for this situation to happen. This happens since the call to {{changeState}} fails since the supplied current state (PENDING) in the method call is different from the actual state of the task (ASSIGNED). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AURORA-1946) Make STARTING a transient state
[ https://issues.apache.org/jira/browse/AURORA-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172383#comment-16172383 ] Santhosh Kumar Shanmugham commented on AURORA-1946: --- The underlying issue in the above case was that the Task's status updates to {{RUNNING}} and then to {{FAILED}} were never communicated to the Master and eventually to the Scheduler. So the issue lies with the Agent. > Make STARTING a transient state > --- > > Key: AURORA-1946 > URL: https://issues.apache.org/jira/browse/AURORA-1946 > Project: Aurora > Issue Type: Task >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham > > We saw a case where an update was stuck in {{IN_PROGRESS}} state, after a > task's status update from {{STARTING}} to {{FAILED}} was lost. In the ideal > scenario the {{Task}} should have been transitioned into {{LOST}} due to a > transient state. But {{STARTING}} is not a transient state. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AURORA-1912) DbSnapShot may remove enum values
[ https://issues.apache.org/jira/browse/AURORA-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1912: -- Fix Version/s: 0.18.0 > DbSnapShot may remove enum values > - > > Key: AURORA-1912 > URL: https://issues.apache.org/jira/browse/AURORA-1912 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > Fix For: 0.18.0 > > > The dbnsapshot restore may truncate enum tables and cause referential > integrity issues. From the code it restores from the SQL dump by first > dropping all tables: > {noformat} > try (Connection c = ((DataSource) > store.getUnsafeStoreAccess()).getConnection()) { > LOG.info("Dropping all tables"); > try (PreparedStatement drop = c.prepareStatement("DROP ALL > OBJECTS")) { > drop.executeUpdate(); > } > {noformat} > However a freshly started leader will have some data in there from preparing > the storage: > {noformat} > @Override > @Transactional > protected void startUp() throws IOException { > Configuration configuration = sessionFactory.getConfiguration(); > String createStatementName = "create_tables"; > configuration.setMapUnderscoreToCamelCase(true); > // The ReuseExecutor will cache jdbc Statements with equivalent SQL, > improving performance > // slightly when redundant queries are made. > configuration.setDefaultExecutorType(ExecutorType.REUSE); > addMappedStatement( > configuration, > createStatementName, > CharStreams.toString(new InputStreamReader( > DbStorage.class.getResourceAsStream("schema.sql"), > StandardCharsets.UTF_8))); > try (SqlSession session = sessionFactory.openSession()) { > session.update(createStatementName); > } > for (CronCollisionPolicy policy : CronCollisionPolicy.values()) { > enumValueMapper.addEnumValue("cron_policies", policy.getValue(), > policy.name()); > } > for (MaintenanceMode mode : MaintenanceMode.values()) { > enumValueMapper.addEnumValue("maintenance_modes", mode.getValue(), > mode.name()); > } > for (JobUpdateStatus status : JobUpdateStatus.values()) { > enumValueMapper.addEnumValue("job_update_statuses", status.getValue(), > status.name()); > } > for (JobUpdateAction action : JobUpdateAction.values()) { > enumValueMapper.addEnumValue("job_instance_update_actions", > action.getValue(), action.name()); > } > for (ScheduleStatus status : ScheduleStatus.values()) { > enumValueMapper.addEnumValue("task_states", status.getValue(), > status.name()); > } > for (ResourceType resourceType : ResourceType.values()) { > enumValueMapper.addEnumValue("resource_types", resourceType.getValue(), > resourceType.name()); > } > for (Mode mode : Mode.values()) { > enumValueMapper.addEnumValue("volume_modes", mode.getValue(), > mode.name()); > } > createPoolMetrics(); > } > {noformat} > Consider the case where we add a new value to an existing enum. This means > restoring from a snapshot will not allow us to have that value in the enum > table. > To fix this we should have a migration for every enum value we add. However > to me it seems that the better idea would be to update the enum tables after > we restore from a snapshot. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1680) Aurora 0.16.0 deprecations
[ https://issues.apache.org/jira/browse/AURORA-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1680: -- Fix Version/s: 0.18.0 > Aurora 0.16.0 deprecations > -- > > Key: AURORA-1680 > URL: https://issues.apache.org/jira/browse/AURORA-1680 > Project: Aurora > Issue Type: Epic >Reporter: Maxim Khutornenko > Fix For: 0.18.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1909) Thermos Health Check fails for MesosContainerizer if `--nosetuid-health-checks` is set
[ https://issues.apache.org/jira/browse/AURORA-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1909: -- Fix Version/s: 0.18.0 > Thermos Health Check fails for MesosContainerizer if > `--nosetuid-health-checks` is set > -- > > Key: AURORA-1909 > URL: https://issues.apache.org/jira/browse/AURORA-1909 > Project: Aurora > Issue Type: Bug > Components: Executor >Reporter: Charles Raimbert >Assignee: Charles Raimbert > Labels: easyfix > Fix For: 0.18.0 > > > With MesosContainerizer, the sandbox is of type FileSystemImageSandbox and > the health check is performed using a "mesos-containerizer launch" process, > but there is actually a code bug in the way of getting the user under which > to run the health check process: > https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370 > {code} > health_check_user = (os.getusername() if self._nosetuid_health_checks > else assigned_task.task.job.role) > {code} > If the Aurora scheduler is configured with `--nosetuid-health-checks` then > "os.getusername()" is executed, but the python "os" module does not present a > "getusername()" function. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1860) Fix bug in scheduler driver disconnect stats
[ https://issues.apache.org/jira/browse/AURORA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1860: -- Fix Version/s: 0.18.0 > Fix bug in scheduler driver disconnect stats > > > Key: AURORA-1860 > URL: https://issues.apache.org/jira/browse/AURORA-1860 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Mehrdad Nurolahzade >Assignee: Ilya Pronin >Priority: Minor > Labels: newbie > Fix For: 0.18.0 > > > Correct the refactoring mistake introduced in > [https://reviews.apache.org/r/31550/] that has disabled > {{scheduler_framework_disconnects}} stats: > {code:title=MesosSchedulerImpl.disconnected()} > counters.get("scheduler_framework_disconnects").get(); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1923) Aurora client should not automatically retry non-idempotent operations
[ https://issues.apache.org/jira/browse/AURORA-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1923: -- Fix Version/s: 0.18.0 > Aurora client should not automatically retry non-idempotent operations > -- > > Key: AURORA-1923 > URL: https://issues.apache.org/jira/browse/AURORA-1923 > Project: Aurora > Issue Type: Story > Components: Client >Reporter: Mehrdad Nurolahzade >Assignee: Mehrdad Nurolahzade > Fix For: 0.18.0 > > > Aurora client has a built in mechanism to automatically retry thrift API > operations if the connection with scheduler times out, experiences transport > exception, or encounters a transient exception on the scheduler side. > Retrying thrift calls due to scheduler connection timeout and transient > exceptions (see [AURORA-187]) is safe. However, as Aurora has no concept of > idempotency, its client can retry non-idempotent operations upon encountering > transport exceptions which can lead to nondeterministic situations. > For example, if client requests go through a proxy to reach scheduler, client > might consider a non-idempotent request failed and automatically retry it > while the original request has been received and processed by the scheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1911) HTTP Scheduler Driver does not reliably re subscribe
[ https://issues.apache.org/jira/browse/AURORA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1911: -- Fix Version/s: 0.18.0 > HTTP Scheduler Driver does not reliably re subscribe > > > Key: AURORA-1911 > URL: https://issues.apache.org/jira/browse/AURORA-1911 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > Fix For: 0.18.0 > > > I observed this issue in a large production cluster during a period of Mesos > Master instability: > 1. Mesos master crashes or restarts. > 2. {{V1Mesos}} driver detects this and reconnects. > 3. Aurora does the {{SUBSCRIBE}} call again. > 4. The {{SUBSCRIBE}} Call fails silently in the driver. > 5. All future calls are silently dropped by the driver. > 6. Aurora has no offers because it is not subscribed. > Logs: > {noformat} > I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at > http://10.162.14.30:5050/master/api/v1/scheduler > W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service > Unavailable' () for SUBSCRIBE > > W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is > in state CONNECTED > > W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is > in state CONNECTED > > W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is > in state CONNECTED > ... > {noformat} > To fix this, the {{VersionedSchedulerDriver}} needs to do two things: > 1. Block calls when unsubscribed not just disconnected. > 2. Retry the {{SUBSCRIBE}} call repeatedly with exponential backoff. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1904) Support Mesos Maintenance
[ https://issues.apache.org/jira/browse/AURORA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1904: -- Fix Version/s: 0.18.0 > Support Mesos Maintenance > - > > Key: AURORA-1904 > URL: https://issues.apache.org/jira/browse/AURORA-1904 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji >Priority: Minor > Fix For: 0.18.0 > > > Support Mesos Maintenance primitives in Aurora per the design > [doc|https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumkIbHrk0-ceoD2YI/edit#heading=h.n5tvzjaj9llx]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1920) Add pluggable scheduling logic to Aurora
[ https://issues.apache.org/jira/browse/AURORA-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1920: -- Fix Version/s: 0.18.0 > Add pluggable scheduling logic to Aurora > > > Key: AURORA-1920 > URL: https://issues.apache.org/jira/browse/AURORA-1920 > Project: Aurora > Issue Type: Epic > Components: Scheduler >Reporter: David McLaughlin >Assignee: David McLaughlin > Fix For: 0.18.0 > > > On the mailing list recently there was a desire to have custom scheduling > logic (e.g. to implement scheduling spread). This ticket tracks the proposal > and implementation of custom scheduling. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1924) Aurora client should reconcile idempotent job creations
[ https://issues.apache.org/jira/browse/AURORA-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1924: -- Fix Version/s: 0.18.0 > Aurora client should reconcile idempotent job creations > --- > > Key: AURORA-1924 > URL: https://issues.apache.org/jira/browse/AURORA-1924 > Project: Aurora > Issue Type: Story > Components: Client, Scheduler >Reporter: Mehrdad Nurolahzade >Assignee: Mehrdad Nurolahzade >Priority: Minor > Fix For: 0.18.0 > > > Aurora scheduler rejects a request to create a job if a job with the same key > already exists (see {{SchedulerThriftInterface.createJob()}}). Aurora client > exits with an error once it receives a response with > {{ResponseCode.INVALID_REQUEST}} from scheduler in this case. > However, an attempt to create a job with the exact same configuration and > number of instances is essentially idempotent. Scheduler can detect this > situation, ignore it, and signal client to treat operation as successful; > client warns user about existing job but does not fail the operation. > This helps Aurora client and scheduler reconcile state when creating jobs in > presence of transport layer exceptions; allowing {{aurora job create}} > command can then be marked as idempotent after [AURORA-1923] is fixed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1921) Create proposal for which components should be pluggable
[ https://issues.apache.org/jira/browse/AURORA-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1921: -- Fix Version/s: 0.18.0 > Create proposal for which components should be pluggable > > > Key: AURORA-1921 > URL: https://issues.apache.org/jira/browse/AURORA-1921 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: David McLaughlin >Assignee: David McLaughlin > Fix For: 0.18.0 > > > Create a proposal on the dev list for how pluggable scheduling should work. > Is TaskAssigner the class we want to customize? Should preemption also factor > into the customization since it very closely tied to the Scheduling logic? > We also need to confirm the plugin mechanism. Most likely we should base thus > on the current auth support which is pluggable by passing in a Guice module > via cmdline. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1925) Easily copy files to/from an aurora task instance
[ https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1925: -- Fix Version/s: 0.18.0 > Easily copy files to/from an aurora task instance > - > > Key: AURORA-1925 > URL: https://issues.apache.org/jira/browse/AURORA-1925 > Project: Aurora > Issue Type: Task > Components: Client >Reporter: Jordan Ly >Assignee: Jordan Ly >Priority: Minor > Labels: features, newbie > Fix For: 0.18.0 > > > "So, we have "aurora task ssh" which is handy...We have the web ui, which can > allow us to download files from the instance. > I'd love to have a straightforward, non-hack way to copy files to an aurora > shard. Ex. aurora task copy source relative-path-in-chroot or something that > could mount the chroot as read/write" -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1915) Add automatic browser tab open feature for aurora update start
[ https://issues.apache.org/jira/browse/AURORA-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1915: -- Fix Version/s: 0.18.0 > Add automatic browser tab open feature for aurora update start > -- > > Key: AURORA-1915 > URL: https://issues.apache.org/jira/browse/AURORA-1915 > Project: Aurora > Issue Type: Story > Components: Client, Usability >Reporter: Mehrdad Nurolahzade >Priority: Minor > Labels: newbie > Fix For: 0.18.0 > > > Aurora client automatically open a browser tab following {{aurora job > create}} and {{aurora cron schedule}} commands. Provide similar ability for > {{aurora update start}} command as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1910) framework_registered metric isn't reset when scheduler disconnects
[ https://issues.apache.org/jira/browse/AURORA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1910: -- Fix Version/s: 0.18.0 > framework_registered metric isn't reset when scheduler disconnects > -- > > Key: AURORA-1910 > URL: https://issues.apache.org/jira/browse/AURORA-1910 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > Fix For: 0.18.0 > > > Right now the {{framework_registered}} metric transitions from 0 -> 1 when > the scheduler registers successfully the first time. It never transitions > from 1 -> 0 when it loses a connection. > This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the > gauge as the scheduler loses registration and re-registers. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1928) Aurora should prioritize adding instances over updating instances during an update
[ https://issues.apache.org/jira/browse/AURORA-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1928: -- Fix Version/s: 0.18.0 > Aurora should prioritize adding instances over updating instances during an > update > -- > > Key: AURORA-1928 > URL: https://issues.apache.org/jira/browse/AURORA-1928 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Jordan Ly >Assignee: Jordan Ly >Priority: Minor > Fix For: 0.18.0 > > > Often times, when adding capacity to a service, I increase the number of > instances and the batch_size that the updates roll out at. > For the first deploy, this updates the old instances with the more aggressive > batch size before the new instances are started. > It might be better for Aurora to prioritize the "adding instances" step of > upgrades. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1933) Scheduler can process rescind before offer
[ https://issues.apache.org/jira/browse/AURORA-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1933: -- Fix Version/s: 0.18.0 > Scheduler can process rescind before offer > -- > > Key: AURORA-1933 > URL: https://issues.apache.org/jira/browse/AURORA-1933 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > Fix For: 0.18.0 > > > I observed the following in production: > {noformat} > Jun 6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.510 > [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer > rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 > Jun 6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.903 > [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received > offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 > Jun 6 00:31:34 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:34.815 > [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer > 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH] > {noformat} > Notice the rescind was processed before the offer was given. This means the > offer is in the offer storage, but using it is invalid. It will cause > whatever task launched with it to fail with {{Task launched with invalid > offers: Offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 is no longer > valid}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1907) Thermos unresponsive on hosts with many active task
[ https://issues.apache.org/jira/browse/AURORA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1907: -- Fix Version/s: 0.18.0 > Thermos unresponsive on hosts with many active task > --- > > Key: AURORA-1907 > URL: https://issues.apache.org/jira/browse/AURORA-1907 > Project: Aurora > Issue Type: Story > Components: Observer >Reporter: Stephan Erb >Assignee: Stephan Erb > Fix For: 0.18.0 > > > We have noticed that on hosts with lots of active tasks (~100) and many > terminated tasks (~1500) the Thermos UI is not usable. Thermos spins at 300% > CPU but does not render any HTTP requests. > Dumping {{/threads}} indicates we might be blocked by the hundret > {{TaskResourceMonitor}} threads trying to read values from {{/proc}}: > {code} > # Thread (daemon): TaskResourceMonitor (TaskResourceMonitor[mytask-id] > [TID=45241], 140682825963264) > File: "/usr/lib/python2.7/threading.py", line 525, in __bootstrap > self.__bootstrap_inner() > File: "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner > self.run() > File: > "/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py", > line 115, in identified > return instancemethod(self, *args, **kwargs) > File: > "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", > line 126, in _excepting_run > self.__real_run(*args, **kw) > File: "apache/thermos/monitoring/resource.py", line 204, in run > collector.sample() > File: "apache/thermos/monitoring/process_collector_psutil.py", line 70, in > sample > for child in parent.children(recursive=True) > File: > "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py", > line 326, in wrapper > return fun(self, *args, **kwargs) > File: > "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py", > line 861, in children > table[p.ppid()].append(p) > File: > "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py", > line 545, in ppid > return self._proc.ppid() > File: > "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py", > line 962, in wrapper > return fun(self, *args, **kwargs) > File: > "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py", > line 1459, in ppid > return int(self._parse_stat_file()[2]) > File: > "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py", > line 1001, in _parse_stat_file > return [name] + fields_after_name > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1929) Improve explicit task history pruning.
[ https://issues.apache.org/jira/browse/AURORA-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1929: -- Fix Version/s: 0.18.0 > Improve explicit task history pruning. > -- > > Key: AURORA-1929 > URL: https://issues.apache.org/jira/browse/AURORA-1929 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Minor > Fix For: 0.18.0 > > > There are currently two types of task history pruning running by aurora: > # The implicit task history pruning running by TaskHistoryPrunner in the > background, which registers all inactive tasks upon terminal state change for > pruning. > # The explicit task history pruning initiated by `aurora_admin prune_tasks` > command, which prunes inactive tasks in the cluster. > The prune_tasks endpoint seems to be very slow when the cluster has a large > number of inactive tasks. > For example, when we use $ aurora_admin prune_tasks for 135k running tasks > (1k jobs), it takes about ~30 minutes to prune all tasks, the pruning speed > seems to max out at 3k tasks per minute. > Currently, aurora uses StreamManager to manages a single log stream append > transaction for task history pruning. Local storage ops can be added to the > transaction and then later committed as an atomic unit. However, the > StateManager removes tasks one by one in a > for-loop(https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java#L376), > and each RemoveTasks operation is coalesced with its previous operation, > which seems inefficient and unnecessary > (https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/main/java/org/apache/aurora/scheduler/storage/log/StreamManagerImpl.java#L324). > We need to batch all removeTasks operations and execute them all at once to > avoid additional cost of coalescing. The fix will also benefit implicit task > history pruning since it has similar underlying implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1916) Incompatibility with mesos 1.2
[ https://issues.apache.org/jira/browse/AURORA-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1916: -- Fix Version/s: 0.18.0 > Incompatibility with mesos 1.2 > -- > > Key: AURORA-1916 > URL: https://issues.apache.org/jira/browse/AURORA-1916 > Project: Aurora > Issue Type: Bug > Components: Executor >Affects Versions: 0.17.0 > Environment: Ubuntu 16.04, Mesos 1.2 >Reporter: Kostiantyn Bokhan > Fix For: 0.18.0 > > > The list of mesos-containerizer arguments has been changed since 1.2: > {code} > /usr/libexec/mesos/mesos-containerizer launch --help > Usage: launch [options] > --[no-]help Prints this help message (default: false) > --launch_info=VALUE > --namespace_mnt_target=VALUE The target 'pid' of the process whose > mount namespace we'd like >to enter before executing the command. > --pipe_read=VALUEThe read end of the control pipe. This is > a file descriptor >on Posix, or a handle on Windows. It's > caller's responsibility >to make sure the file descriptor or the > handle is inherited >properly in the subprocess. It's used to > synchronize with the >parent process. If not specified, no > synchronization will happen. > --pipe_write=VALUE The write end of the control pipe. This is > a file descriptor >on Posix, or a handle on Windows. It's > caller's responsibility >to make sure the file descriptor or the > handle is inherited >properly in the subprocess. It's used to > synchronize with the >parent process. If not specified, no > synchronization will happen. > --runtime_directory=VALUEThe runtime directory for the container > (used for checkpointing) > --[no-]unshare_namespace_mnt Whether to launch the command in a new > mount namespace. (default: false) > {code} > Mesos 1.1.0: > {code} > /usr/libexec/mesos/mesos-containerizer launch --help > Usage: launch [options] > --capabilities=VALUE Capabilities the command can use. > --command=VALUE The command to execute. > --environment=VALUE The environment variables for the command. > --[no-]help Prints this help message (default: false) > --pipe_read=VALUEThe read end of the control pipe. This is > a file descriptor >on Posix, or a handle on Windows. It's > caller's responsibility >to make sure the file descriptor or the > handle is inherited >properly in the subprocess. It's used to > synchronize with the >parent process. If not specified, no > synchronization will happen. > --pipe_write=VALUE The write end of the control pipe. This is > a file descriptor >on Posix, or a handle on Windows. It's > caller's responsibility >to make sure the file descriptor or the > handle is inherited >properly in the subprocess. It's used to > synchronize with the >parent process. If not specified, no > synchronization will happen. > --pre_exec_commands=VALUEThe additional preparation commands to > execute before >executing the command. > --rootfs=VALUE Absolute path to the container root > filesystem. The command will be >interpreted relative to this path > --runtime_directory=VALUEThe runtime directory for the container > (used for checkpointing) > --[no-]unshare_namespace_mnt Whether to launch the command in a new > mount namespace. (default: false) > --user=VALUE The user to change to. > --working_directory=VALUEThe working directory for the command. It > has to be an absolute path >w.r.t. the root filesystem used for the > command. > {code} > It causes the next error: > {code} > Failed to parse the flags: Failed to load unknown flag 'command' > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1887) Create Driver implementation around V0Mesos.
[ https://issues.apache.org/jira/browse/AURORA-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1887: -- Fix Version/s: 0.18.0 > Create Driver implementation around V0Mesos. > > > Key: AURORA-1887 > URL: https://issues.apache.org/jira/browse/AURORA-1887 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji > Fix For: 0.18.0 > > > Create an implementation of the {{org.apache.aurora.scheduler.mesos.Driver}} > interface which uses the {{V0Mesos}} shim under the hood. Provide a flag to > switch between the two to show there is no regression. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1888) Create Driver implementation around V1Mesos
[ https://issues.apache.org/jira/browse/AURORA-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1888: -- Fix Version/s: 0.18.0 > Create Driver implementation around V1Mesos > --- > > Key: AURORA-1888 > URL: https://issues.apache.org/jira/browse/AURORA-1888 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji > Fix For: 0.18.0 > > > Create an implementation of {{org.apache.aurora.scheduler.mesos.Driver}} that > uses {{V1Mesos}} under the hood. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1905) Set "webui_url" field of FrameworkInfo
[ https://issues.apache.org/jira/browse/AURORA-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1905: -- Fix Version/s: 0.18.0 > Set "webui_url" field of FrameworkInfo > -- > > Key: AURORA-1905 > URL: https://issues.apache.org/jira/browse/AURORA-1905 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji > Fix For: 0.18.0 > > > Aurora should set the {{webui_url}} field of FrameworkInfo so the Mesos UI > can link to the current leader. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1925) Easily copy files to/from an aurora task instance
[ https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1925: - Assignee: (was: Santhosh Kumar Shanmugham) > Easily copy files to/from an aurora task instance > - > > Key: AURORA-1925 > URL: https://issues.apache.org/jira/browse/AURORA-1925 > Project: Aurora > Issue Type: Task > Components: Client >Reporter: Jordan Ly >Priority: Minor > Labels: features, newbie > > "So, we have "aurora task ssh" which is handy...We have the web ui, which can > allow us to download files from the instance. > I'd love to have a straightforward, non-hack way to copy files to an aurora > shard. Ex. aurora task copy source relative-path-in-chroot or something that > could mount the chroot as read/write > For example, I can use this to use agents or tools not present in the default > image, or are something I don't want to clutter packer with. > For example, I tried to use byteman to narrow down what's spawning threads. > Just getting the zip to aurora is more difficult and mysterious than writing > a script to trace thread lifecycle. This should be easy... agreed?" -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1925) Easily copy files to/from an aurora task instance
[ https://issues.apache.org/jira/browse/AURORA-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1925: - Assignee: Santhosh Kumar Shanmugham > Easily copy files to/from an aurora task instance > - > > Key: AURORA-1925 > URL: https://issues.apache.org/jira/browse/AURORA-1925 > Project: Aurora > Issue Type: Task > Components: Client >Reporter: Jordan Ly >Assignee: Santhosh Kumar Shanmugham >Priority: Minor > Labels: features, newbie > > "So, we have "aurora task ssh" which is handy...We have the web ui, which can > allow us to download files from the instance. > I'd love to have a straightforward, non-hack way to copy files to an aurora > shard. Ex. aurora task copy source relative-path-in-chroot or something that > could mount the chroot as read/write > For example, I can use this to use agents or tools not present in the default > image, or are something I don't want to clutter packer with. > For example, I tried to use byteman to narrow down what's spawning threads. > Just getting the zip to aurora is more difficult and mysterious than writing > a script to trace thread lifecycle. This should be easy... agreed?" -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1913) Aurora Client Job Update Diff output is erroneous
[ https://issues.apache.org/jira/browse/AURORA-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1913: - Assignee: Santhosh Kumar Shanmugham > Aurora Client Job Update Diff output is erroneous > - > > Key: AURORA-1913 > URL: https://issues.apache.org/jira/browse/AURORA-1913 > Project: Aurora > Issue Type: Bug > Components: Client >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Minor > > The {{TaskConfig}} diff shown by Aurora Client is not sorting the individual > entities (which are sets) which results in a confusing output. We need to > sort the individual fields inside the config to return a more meaningful > output. > Although the {{host}} and {{rack}} limit constraints have not changed, the > diff still outputs the below, > {code} > < 'constraints': set([ Constraint(name='rack', > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)), > --- > > 'constraints': set([ Constraint(name='host', > > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)), > 4,5c4,5 >constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None))]), > --- > >Constraint(name='rack', > > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)), > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1913) Aurora Client Job Update Diff output is erroneous
Santhosh Kumar Shanmugham created AURORA-1913: - Summary: Aurora Client Job Update Diff output is erroneous Key: AURORA-1913 URL: https://issues.apache.org/jira/browse/AURORA-1913 Project: Aurora Issue Type: Bug Components: Client Reporter: Santhosh Kumar Shanmugham Priority: Minor The {{TaskConfig}} diff shown by Aurora Client is not sorting the individual entities (which are sets) which results in a confusing output. We need to sort the individual fields inside the config to return a more meaningful output. Although the {{host}} and {{rack}} limit constraints have not changed, the diff still outputs the below, {code} < 'constraints': set([ Constraint(name='rack', constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)), --- > 'constraints': set([ Constraint(name='host', > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)), 4,5c4,5Constraint(name='rack', > constraint=TaskConstraint(limit=LimitConstraint(limit=1), value=None)), {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host
[ https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1908: -- Description: When matching a {{ResourceRequest}} against a {{UnusedResource}} in {{PremeptionVictimFilter.filterPremeptionVictions}} there are 4 kinds of {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return early and move on to the next host to consider. (was: When matching a {{ResourceRequest}} against a {{UnusedResource}} in {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return early and move on to the next host to consider.) > Short-circuit preemption filtering when a Veto applies to entire host > - > > Key: AURORA-1908 > URL: https://issues.apache.org/jira/browse/AURORA-1908 > Project: Aurora > Issue Type: Task >Reporter: Santhosh Kumar Shanmugham >Priority: Minor > > When matching a {{ResourceRequest}} against a {{UnusedResource}} in > {{PremeptionVictimFilter.filterPremeptionVictions}} there are 4 kinds of > {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply to the > entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, > {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can > short-circuit and return early and move on to the next host to consider. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host
[ https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935435#comment-15935435 ] Santhosh Kumar Shanmugham commented on AURORA-1908: --- Plus we can improve the way resource increment happens now to explicitly check that either there was no Veto or the Veto was due to insufficient resources. > Short-circuit preemption filtering when a Veto applies to entire host > - > > Key: AURORA-1908 > URL: https://issues.apache.org/jira/browse/AURORA-1908 > Project: Aurora > Issue Type: Task >Reporter: Santhosh Kumar Shanmugham >Priority: Minor > > When matching a {{ResourceRequest}} against a {{UnusedResource}} in > {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be > returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely > {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} > or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return > early and move on to the next host to consider. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host
[ https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935431#comment-15935431 ] Santhosh Kumar Shanmugham commented on AURORA-1908: --- This code here - https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilter.java#L214-L227 > Short-circuit preemption filtering when a Veto applies to entire host > - > > Key: AURORA-1908 > URL: https://issues.apache.org/jira/browse/AURORA-1908 > Project: Aurora > Issue Type: Task >Reporter: Santhosh Kumar Shanmugham >Priority: Minor > > When matching a {{ResourceRequest}} against a {{UnusedResource}} in > {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be > returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely > {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} > or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return > early and move on to the next host to consider. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host
Santhosh Kumar Shanmugham created AURORA-1908: - Summary: Short-circuit preemption filtering when a Veto applies to entire host Key: AURORA-1908 URL: https://issues.apache.org/jira/browse/AURORA-1908 Project: Aurora Issue Type: Task Reporter: Santhosh Kumar Shanmugham Priority: Minor When matching a {{ResourceRequest}} against a {{UnusedResource}} in {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return early and move on to the next host to consider. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AURORA-1882) Add support for Mesos ContainerLaunchInfo
[ https://issues.apache.org/jira/browse/AURORA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham resolved AURORA-1882. --- Resolution: Fixed Assignee: Santhosh Kumar Shanmugham Fix Version/s: 0.18.0 > Add support for Mesos ContainerLaunchInfo > - > > Key: AURORA-1882 > URL: https://issues.apache.org/jira/browse/AURORA-1882 > Project: Aurora > Issue Type: Task > Components: Executor, Thermos >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham >Priority: Critical > Fix For: 0.18.0 > > > Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops > support for multiple switches. Without support for this Thermos Executor will > not be able to launch tasks successfully. > {noformat} > /usr/local/libexec/mesos/mesos-containerizer launch --help > Usage: launch [options] > --[no-]help Prints this help message (default: false) > --launch_info=VALUE > --namespace_mnt_target=VALUE The target 'pid' of the process whose > mount namespace we'd like >to enter before executing the command. > --pipe_read=VALUEThe read end of the control pipe. This is > a file descriptor >on Posix, or a handle on Windows. It's > caller's responsibility >to make sure the file descriptor or the > handle is inherited >properly in the subprocess. It's used to > synchronize with the >parent process. If not specified, no > synchronization will happen. > --pipe_write=VALUE The write end of the control pipe. This is > a file descriptor >on Posix, or a handle on Windows. It's > caller's responsibility >to make sure the file descriptor or the > handle is inherited >properly in the subprocess. It's used to > synchronize with the >parent process. If not specified, no > synchronization will happen. > --runtime_directory=VALUEThe runtime directory for the container > (used for checkpointing) > --[no-]unshare_namespace_mnt Whether to launch the command in a new > mount namespace. (default: false) > {noformat} > https://issues.apache.org/jira/browse/MESOS-6648 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Reopened] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer
[ https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reopened AURORA-1824: --- > Create a binding-helper to resolve docker tags to concrete image digests for > MesosContainerizer > --- > > Key: AURORA-1824 > URL: https://issues.apache.org/jira/browse/AURORA-1824 > Project: Aurora > Issue Type: Story >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham > > Similar to the binding-helper that was introduced for DockerContainerizer > (introduced in https://reviews.apache.org/r/52479/), we need another > binding-helper that will resolve the MesosContainerizers docker config's > image tag to image digest. > DSL Translation: > {noformat} > Job { > ... > mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') > ... > } > {noformat} > will be translated to, > {noformat} > Job { > ... > mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) > ... > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer
[ https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham resolved AURORA-1824. --- Resolution: Fixed > Create a binding-helper to resolve docker tags to concrete image digests for > MesosContainerizer > --- > > Key: AURORA-1824 > URL: https://issues.apache.org/jira/browse/AURORA-1824 > Project: Aurora > Issue Type: Story >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham > > Similar to the binding-helper that was introduced for DockerContainerizer > (introduced in https://reviews.apache.org/r/52479/), we need another > binding-helper that will resolve the MesosContainerizers docker config's > image tag to image digest. > DSL Translation: > {noformat} > Job { > ... > mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') > ... > } > {noformat} > will be translated to, > {noformat} > Job { > ... > mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) > ... > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer
[ https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1824: - Assignee: Santhosh Kumar Shanmugham > Create a binding-helper to resolve docker tags to concrete image digests for > MesosContainerizer > --- > > Key: AURORA-1824 > URL: https://issues.apache.org/jira/browse/AURORA-1824 > Project: Aurora > Issue Type: Story >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham > > Similar to the binding-helper that was introduced for DockerContainerizer > (introduced in https://reviews.apache.org/r/52479/), we need another > binding-helper that will resolve the MesosContainerizers docker config's > image tag to image digest. > DSL Translation: > {noformat} > Job { > ... > mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') > ... > } > {noformat} > will be translated to, > {noformat} > Job { > ... > mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) > ... > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1892) TaskQuery `limit` and `offset` must be applied at TaskStore
Santhosh Kumar Shanmugham created AURORA-1892: - Summary: TaskQuery `limit` and `offset` must be applied at TaskStore Key: AURORA-1892 URL: https://issues.apache.org/jira/browse/AURORA-1892 Project: Aurora Issue Type: Task Reporter: Santhosh Kumar Shanmugham {{TaksQuery}}'s {{limit}} and {{offset}} are currently applied after the results have been fetched from the {{TaskStore}}, which is inefficient. Make the {{TaskStore}} apply the {{limit}} and {{offset}} conditions at the {{TaskStore}} level in both {{MemTaskStore}} and {{DBTaskStore}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566 ] Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:01 PM: - Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large number of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which in-turn will release the work into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] was (Author: santhk): Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large number of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which in-turn will be release the work into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change
[jira] [Comment Edited] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566 ] Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:00 PM: - Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large number of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which in-turn will be release the work into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] was (Author: santhk): Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large number of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which will be released into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning.
[jira] [Comment Edited] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566 ] Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 10:58 PM: - Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which will be released into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] was (Author: santhk): Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which will be released into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. > {{TaskHistoryPrunner::registerInactiveTask()}} uses
[jira] [Comment Edited] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566 ] Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 10:59 PM: - Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large number of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which will be released into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] was (Author: santhk): Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which will be released into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. >
[jira] [Commented] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566 ] Santhosh Kumar Shanmugham commented on AURORA-1837: --- Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which will be released into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. > {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to > schedule the process of pruning _task_s. However, we have noticed most of > pruning takes place after scheduler recovers from a fail-over. > Modify {{TaskHistoryPruner}} to a design similar to > {{JobUpdateHistoryPruner}}: > # Instead of registering delay executor's upon terminal task state > transitions, have it wake up on preconfigured intervals, find all terminal > state tasks that meet pruning criteria and delete them. > # Make the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1882) Add support for Mesos ContainerLaunchInfo
[ https://issues.apache.org/jira/browse/AURORA-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837484#comment-15837484 ] Santhosh Kumar Shanmugham commented on AURORA-1882: --- https://reviews.apache.org/r/55916/ > Add support for Mesos ContainerLaunchInfo > - > > Key: AURORA-1882 > URL: https://issues.apache.org/jira/browse/AURORA-1882 > Project: Aurora > Issue Type: Task > Components: Executor, Thermos >Reporter: Santhosh Kumar Shanmugham >Priority: Critical > > Mesos-1.2.0 changes the interface for MesosContainerizer binary and drops > support for multiple switches. Without support for this Thermos Executor will > not be able to launch tasks successfully. > {noformat} > /usr/local/libexec/mesos/mesos-containerizer launch --help > Usage: launch [options] > --[no-]help Prints this help message (default: false) > --launch_info=VALUE > --namespace_mnt_target=VALUE The target 'pid' of the process whose > mount namespace we'd like >to enter before executing the command. > --pipe_read=VALUEThe read end of the control pipe. This is > a file descriptor >on Posix, or a handle on Windows. It's > caller's responsibility >to make sure the file descriptor or the > handle is inherited >properly in the subprocess. It's used to > synchronize with the >parent process. If not specified, no > synchronization will happen. > --pipe_write=VALUE The write end of the control pipe. This is > a file descriptor >on Posix, or a handle on Windows. It's > caller's responsibility >to make sure the file descriptor or the > handle is inherited >properly in the subprocess. It's used to > synchronize with the >parent process. If not specified, no > synchronization will happen. > --runtime_directory=VALUEThe runtime directory for the container > (used for checkpointing) > --[no-]unshare_namespace_mnt Whether to launch the command in a new > mount namespace. (default: false) > {noformat} > https://issues.apache.org/jira/browse/MESOS-6648 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1850) Raw StatusResult passed to the scheduler when tasks are healthy
[ https://issues.apache.org/jira/browse/AURORA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734721#comment-15734721 ] Santhosh Kumar Shanmugham commented on AURORA-1850: --- https://reviews.apache.org/r/54299/ > Raw StatusResult passed to the scheduler when tasks are healthy > --- > > Key: AURORA-1850 > URL: https://issues.apache.org/jira/browse/AURORA-1850 > Project: Aurora > Issue Type: Bug > Components: Executor >Reporter: Joshua Cohen >Assignee: Santhosh Kumar Shanmugham >Priority: Minor > Attachments: > 9ab25047-4418-4121-906c-380ded9e1962__Screen_Shot_2016-12-06_at_4.56.43_PM.png > > > As part of the recent health check changes, we now pass a message to the > scheduler along with the RUNNING transition when the task is healthy. > Unfortunately it looks like this message is a stringified `StatusResult`, > rather than the message from the `StatusResult` (see attached screenshot for > details). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1841) Update HealthChecks has a backward incompatibility
[ https://issues.apache.org/jira/browse/AURORA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727190#comment-15727190 ] Santhosh Kumar Shanmugham commented on AURORA-1841: --- https://reviews.apache.org/r/54299/ > Update HealthChecks has a backward incompatibility > -- > > Key: AURORA-1841 > URL: https://issues.apache.org/jira/browse/AURORA-1841 > Project: Aurora > Issue Type: Bug > Components: Thermos >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham > > The current implementation of the HealthCheck based updates has a backward > incompatibility issue when the HealthCheckConfig is used in such a way that > the initial grace period is extended beyond the `initial_interval_secs`. > This bug prematurely fails updates by using the new fail-fast feature when > the `initial_interval_secs` does not count for the entire task warmup time. > In the older version the actual grace period for update to succeed was > `initial_interval_secs + interval_secs * max_consecutive_failures`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1844) Force a snapshot at the end of Scheduler startup.
[ https://issues.apache.org/jira/browse/AURORA-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1844: -- Summary: Force a snapshot at the end of Scheduler startup. (was: Force a snapshot at the end of startup.) > Force a snapshot at the end of Scheduler startup. > - > > Key: AURORA-1844 > URL: https://issues.apache.org/jira/browse/AURORA-1844 > Project: Aurora > Issue Type: Task >Reporter: Santhosh Kumar Shanmugham >Priority: Minor > > When the scheduler starts up, it replays the logs from the replicated log to > catch up with the current state, before announcing itself as the leader to > the outside world. If for any reason after this replay, the scheduler dies > after adding more log entires, the next startup will have to redo the work > again. This becomes problem when the amount of additional work added is not > trivial, and can take the scheduler down the path of a spiraling death. One > example, of this is when the TaskHistoryPruner cleans up the DB but adds to > the log entires. In order to avoid the repeated work, the scheduler should > force a snapshot after the initial replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1841) Update HealthChecks has a backward incompatibility
[ https://issues.apache.org/jira/browse/AURORA-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1841: -- Component/s: Thermos > Update HealthChecks has a backward incompatibility > -- > > Key: AURORA-1841 > URL: https://issues.apache.org/jira/browse/AURORA-1841 > Project: Aurora > Issue Type: Bug > Components: Thermos >Reporter: Santhosh Kumar Shanmugham >Assignee: Santhosh Kumar Shanmugham > > The current implementation of the HealthCheck based updates has a backward > incompatibility issue when the HealthCheckConfig is used in such a way that > the initial grace period is extended beyond the `initial_interval_secs`. > This bug prematurely fails updates by using the new fail-fast feature when > the `initial_interval_secs` does not count for the entire task warmup time. > In the older version the actual grace period for update to succeed was > `initial_interval_secs + interval_secs * max_consecutive_failures`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer
[ https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1824: -- Description: Similar to the binding-helper that was introduced for DockerContainerizer (introduced in https://reviews.apache.org/r/52479/), we need another binding-helper that will resolve the MesosContainerizers docker config's image tag to image digest. DSL Translation: {noformat} Job { ... mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') ... } will be translated to, Job { ... mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) ... } {noformat} was: Similar to the binding-helper that was introduced for DockerContainerizer (introduced in https://reviews.apache.org/r/52479/), we need another binding-helper that will resolve the MesosContainerizers docker config's image tag to image digest. MesosContainerizer: Job { ... mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') ... } will be translated to, Job { ... mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) ... } > Create a binding-helper to resolve docker tags to concrete image digests for > MesosContainerizer > --- > > Key: AURORA-1824 > URL: https://issues.apache.org/jira/browse/AURORA-1824 > Project: Aurora > Issue Type: Story >Reporter: Santhosh Kumar Shanmugham > > Similar to the binding-helper that was introduced for DockerContainerizer > (introduced in https://reviews.apache.org/r/52479/), we need another > binding-helper that will resolve the MesosContainerizers docker config's > image tag to image digest. > DSL Translation: > {noformat} > Job { > ... > mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') > ... > } > will be translated to, > Job { > ... > mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) > ... > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer
[ https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1824: -- Description: Similar to the binding-helper that was introduced for DockerContainerizer (introduced in https://reviews.apache.org/r/52479/), we need another binding-helper that will resolve the MesosContainerizers docker config's image tag to image digest. DSL Translation: {noformat} Job { ... mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') ... } {noformat} will be translated to, {noformat} Job { ... mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) ... } {noformat} was: Similar to the binding-helper that was introduced for DockerContainerizer (introduced in https://reviews.apache.org/r/52479/), we need another binding-helper that will resolve the MesosContainerizers docker config's image tag to image digest. DSL Translation: {noformat} Job { ... mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') ... } will be translated to, Job { ... mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) ... } {noformat} > Create a binding-helper to resolve docker tags to concrete image digests for > MesosContainerizer > --- > > Key: AURORA-1824 > URL: https://issues.apache.org/jira/browse/AURORA-1824 > Project: Aurora > Issue Type: Story >Reporter: Santhosh Kumar Shanmugham > > Similar to the binding-helper that was introduced for DockerContainerizer > (introduced in https://reviews.apache.org/r/52479/), we need another > binding-helper that will resolve the MesosContainerizers docker config's > image tag to image digest. > DSL Translation: > {noformat} > Job { > ... > mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') > ... > } > {noformat} > will be translated to, > {noformat} > Job { > ... > mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) > ... > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer
[ https://issues.apache.org/jira/browse/AURORA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham updated AURORA-1824: -- Description: Similar to the binding-helper that was introduced for DockerContainerizer (introduced in https://reviews.apache.org/r/52479/), we need another binding-helper that will resolve the MesosContainerizers docker config's image tag to image digest. MesosContainerizer: Job { ... mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') ... } will be translated to, Job { ... mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) ... } was: Similar to the binding-helper that was introduced for DockerContainerizer (introduced in https://reviews.apache.org/r/52479/), we need another binding-helper that will resolve the MesosContainerizers docker config's image tag to image digest. > Create a binding-helper to resolve docker tags to concrete image digests for > MesosContainerizer > --- > > Key: AURORA-1824 > URL: https://issues.apache.org/jira/browse/AURORA-1824 > Project: Aurora > Issue Type: Story >Reporter: Santhosh Kumar Shanmugham > > Similar to the binding-helper that was introduced for DockerContainerizer > (introduced in https://reviews.apache.org/r/52479/), we need another > binding-helper that will resolve the MesosContainerizers docker config's > image tag to image digest. > MesosContainerizer: > Job { > ... > mesos = Mesos(image='{{docker.resolve["my-image"]["my-tag"]}}') > ... > } > will be translated to, > Job { > ... > mesos = Mesos(image=DockerImage(name="my-image", digest="my-digest")) > ... > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create
[ https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684470#comment-15684470 ] Santhosh Kumar Shanmugham commented on AURORA-1014: --- Created https://issues.apache.org/jira/browse/AURORA-1014 to create a similar binding-helper for MesosContainerizer. > Client binding_helper to resolve docker label to a stable ID at create > -- > > Key: AURORA-1014 > URL: https://issues.apache.org/jira/browse/AURORA-1014 > Project: Aurora > Issue Type: Story > Components: Client, Packaging >Reporter: Kevin Sweeney >Assignee: Santhosh Kumar Shanmugham > Fix For: 0.17.0 > > > Follow-up from discussion on IRC: > Some docker labels are mutable, meaning the image a task runs in could change > from restart to restart even if the rest of the task config doesn't change. > This breaks assumptions that make rolling updates the safe and preferred way > to deploy a new Aurora job > Add a binding helper that resolves a docker label to an immutable image > identifier at create time and make it the default for the Docker helper > introduced in https://reviews.apache.org/r/28920/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create
[ https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham resolved AURORA-1014. --- Resolution: Fixed > Client binding_helper to resolve docker label to a stable ID at create > -- > > Key: AURORA-1014 > URL: https://issues.apache.org/jira/browse/AURORA-1014 > Project: Aurora > Issue Type: Story > Components: Client, Packaging >Reporter: Kevin Sweeney >Assignee: Santhosh Kumar Shanmugham > Fix For: 0.17.0 > > > Follow-up from discussion on IRC: > Some docker labels are mutable, meaning the image a task runs in could change > from restart to restart even if the rest of the task config doesn't change. > This breaks assumptions that make rolling updates the safe and preferred way > to deploy a new Aurora job > Add a binding helper that resolves a docker label to an immutable image > identifier at create time and make it the default for the Docker helper > introduced in https://reviews.apache.org/r/28920/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1824) Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer
Santhosh Kumar Shanmugham created AURORA-1824: - Summary: Create a binding-helper to resolve docker tags to concrete image digests for MesosContainerizer Key: AURORA-1824 URL: https://issues.apache.org/jira/browse/AURORA-1824 Project: Aurora Issue Type: Story Reporter: Santhosh Kumar Shanmugham Similar to the binding-helper that was introduced for DockerContainerizer (introduced in https://reviews.apache.org/r/52479/), we need another binding-helper that will resolve the MesosContainerizers docker config's image tag to image digest. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham resolved AURORA-1225. --- Resolution: Fixed > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Santhosh Kumar Shanmugham > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684367#comment-15684367 ] Santhosh Kumar Shanmugham commented on AURORA-1225: --- Tested this in our internal test cluster. > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Santhosh Kumar Shanmugham > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santhosh Kumar Shanmugham reassigned AURORA-1225: - Assignee: Santhosh Kumar Shanmugham > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Santhosh Kumar Shanmugham > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create
[ https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531456#comment-15531456 ] Santhosh Kumar Shanmugham commented on AURORA-1014: --- Took a shot at this but unfortunately it is blocked by lack of support from Mesos. See - https://issues.apache.org/jira/browse/MESOS-3505 > Client binding_helper to resolve docker label to a stable ID at create > -- > > Key: AURORA-1014 > URL: https://issues.apache.org/jira/browse/AURORA-1014 > Project: Aurora > Issue Type: Story > Components: Client, Packaging >Reporter: Kevin Sweeney >Assignee: Santhosh Kumar Shanmugham > > Follow-up from discussion on IRC: > Some docker labels are mutable, meaning the image a task runs in could change > from restart to restart even if the rest of the task config doesn't change. > This breaks assumptions that make rolling updates the safe and preferred way > to deploy a new Aurora job > Add a binding helper that resolves a docker label to an immutable image > identifier at create time and make it the default for the Docker helper > introduced in https://reviews.apache.org/r/28920/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1688) Change framework_name default value from 'TwitterScheduler' to 'aurora'
[ https://issues.apache.org/jira/browse/AURORA-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488902#comment-15488902 ] Santhosh Kumar Shanmugham commented on AURORA-1688: --- https://reviews.apache.org/r/51874 > Change framework_name default value from 'TwitterScheduler' to 'aurora' > --- > > Key: AURORA-1688 > URL: https://issues.apache.org/jira/browse/AURORA-1688 > Project: Aurora > Issue Type: Sub-task >Reporter: Stephan Erb >Assignee: Santhosh Kumar Shanmugham > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity
[ https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421854#comment-15421854 ] Santhosh Kumar Shanmugham commented on AURORA-1711: --- +1 on extending {{StartJobUpdateResult}} > Allow client to store metadata on Update entity > --- > > Key: AURORA-1711 > URL: https://issues.apache.org/jira/browse/AURORA-1711 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: David McLaughlin > > I have a use case where I'm programmatically starting updates via the Aurora > API and sometimes the request to the scheduler times out or fails, even > though the update is written to storage and started. > I'd like to be able to store some unique identifier on the update so that we > can reconcile this state later. We can make this generic by allowing clients > to store arbitrary metadata on an update (similar to how they do it with job > configuration). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1711) Allow client to store metadata on Update entity
[ https://issues.apache.org/jira/browse/AURORA-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419776#comment-15419776 ] Santhosh Kumar Shanmugham commented on AURORA-1711: --- Incorporating [~maximk] and [~davmclau]'s suggestions to put forward a proposal. Client will send the UUID as part of the metadata field in JobUpdateRequest like, {code} JobUpdateRequest { TaskConfig { JobKey : “role/env/name” Metadata : { “Disambiguator” : “FGFJAGGFIAHKHGFAK” #UUID } } } {code} Scheduler looks up the active updates for the {{JobKey}} and compare the {{Disambiguator}} to make sure the request is not a duplicate. *Pros:* - Logic is centralized and can benefit all the clients - No costly diffs are needed - No explicit API change *Cons:* - Scheduler and Client will need changes - Possibility of identifier collision - Additional query to the store https://docs.google.com/a/twitter.com/document/d/1Ih-WXACZiPB0Z8EAQw_8cnAaf4eHMsUGG3B3OKia0k4/edit?usp=sharing > Allow client to store metadata on Update entity > --- > > Key: AURORA-1711 > URL: https://issues.apache.org/jira/browse/AURORA-1711 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: David McLaughlin > > I have a use case where I'm programmatically starting updates via the Aurora > API and sometimes the request to the scheduler times out or fails, even > though the update is written to storage and started. > I'd like to be able to store some unique identifier on the update so that we > can reconcile this state later. We can make this generic by allowing clients > to store arbitrary metadata on an update (similar to how they do it with job > configuration). -- This message was sent by Atlassian JIRA (v6.3.4#6332)