[ 
https://issues.apache.org/jira/browse/AURORA-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124597#comment-15124597
 ] 

Maxim Khutornenko commented on AURORA-1603:
-------------------------------------------

Finally figured it. When we 
[check|https://github.com/apache/aurora/blob/33d7e2170a86f54722a02a2dc9cb1e09fb52df25/src/main/java/org/apache/aurora/scheduler/storage/db/TaskConfigManager.java#L42-L51]
 if a task config row exists, we use the {{ITaskConfig}} instance from thrift 
snapshot, which does *not* have the removed fields. Specifically, {{jobName}}, 
{{environment}} and {{identity.owner}} are all null. At the same time, the SQL 
select statement correctly populates all fields. That results in a perfect 
mismatch and return statement always returning {{Optional.absent()}} during 
snapshot restore. At the end, we have as many {{TaskConfig}} copies as there 
are unique entries in the {{job_update_configs}} table.

One solution could be always keeping the StorageBackfill around and only remove 
backfilling of specific task config values within the same commit as api.thrift 
changes.

> Investigate RB:42922 reversal
> -----------------------------
>
>                 Key: AURORA-1603
>                 URL: https://issues.apache.org/jira/browse/AURORA-1603
>             Project: Aurora
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Maxim Khutornenko
>            Assignee: Maxim Khutornenko
>            Priority: Critical
>
> We had to rollback scheduler due to the duplicate instances in the UI and 
> when tried to restart on the older version 
> (8d3fb2413306387bc533b1b800bbc97149f96b26) got the following error preventing 
> scheduler from loading snapshot:
> {noformat}
> To index multiple values under a key, use Multimaps.index.
>         at com.google.common.collect.Maps.uniqueIndex(Maps.java:1215) 
> ~[guava-19.0.jar:na]
>         at com.google.common.collect.Maps.uniqueIndex(Maps.java:1173) 
> ~[guava-19.0.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.db.TaskConfigManager.getConfigRow(TaskConfigManager.java:46)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.db.TaskConfigManager.insert(TaskConfigManager.java:57)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.db.DbJobUpdateStore.saveJobUpdate(DbJobUpdateStore.java:125)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>  ~[commons-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl$7.restoreFromSnapshot(SnapshotStoreImpl.java:208)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.lambda$applySnapshot$238(SnapshotStoreImpl.java:278)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:137)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.db.DbStorage.transactionedWrite(DbStorage.java:146)
>  ~[aurora-113.jar:na]
>         at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
>  ~[mybatis-guice-3.7.jar:3.7]
>         at 
> org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$203(DbStorage.java:160)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:158) 
> ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>  ~[commons-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.applySnapshot(SnapshotStoreImpl.java:274)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>  ~[commons-113.jar:na]
>         at 
> org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.applySnapshot(SnapshotStoreImpl.java:63)
>  ~[aurora-113.jar:na]
>         at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
>  ~[commons-113.jar:na]
> ...
> {noformat}
> We blamed that to fee5943a95c4f08e148dc5f1366486a8c23d5773 and reverted it in 
> https://reviews.apache.org/r/42922/. I have been unable to reproduce it in 
> unit tests yet. Need some further investigation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to