[ https://issues.apache.org/jira/browse/AURORA-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124597#comment-15124597 ]
Maxim Khutornenko commented on AURORA-1603: ------------------------------------------- Finally figured it. When we [check|https://github.com/apache/aurora/blob/33d7e2170a86f54722a02a2dc9cb1e09fb52df25/src/main/java/org/apache/aurora/scheduler/storage/db/TaskConfigManager.java#L42-L51] if a task config row exists, we use the {{ITaskConfig}} instance from thrift snapshot, which does *not* have the removed fields. Specifically, {{jobName}}, {{environment}} and {{identity.owner}} are all null. At the same time, the SQL select statement correctly populates all fields. That results in a perfect mismatch and return statement always returning {{Optional.absent()}} during snapshot restore. At the end, we have as many {{TaskConfig}} copies as there are unique entries in the {{job_update_configs}} table. One solution could be always keeping the StorageBackfill around and only remove backfilling of specific task config values within the same commit as api.thrift changes. > Investigate RB:42922 reversal > ----------------------------- > > Key: AURORA-1603 > URL: https://issues.apache.org/jira/browse/AURORA-1603 > Project: Aurora > Issue Type: Bug > Components: Scheduler > Reporter: Maxim Khutornenko > Assignee: Maxim Khutornenko > Priority: Critical > > We had to rollback scheduler due to the duplicate instances in the UI and > when tried to restart on the older version > (8d3fb2413306387bc533b1b800bbc97149f96b26) got the following error preventing > scheduler from loading snapshot: > {noformat} > To index multiple values under a key, use Multimaps.index. > at com.google.common.collect.Maps.uniqueIndex(Maps.java:1215) > ~[guava-19.0.jar:na] > at com.google.common.collect.Maps.uniqueIndex(Maps.java:1173) > ~[guava-19.0.jar:na] > at > org.apache.aurora.scheduler.storage.db.TaskConfigManager.getConfigRow(TaskConfigManager.java:46) > ~[aurora-113.jar:na] > at > org.apache.aurora.scheduler.storage.db.TaskConfigManager.insert(TaskConfigManager.java:57) > ~[aurora-113.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbJobUpdateStore.saveJobUpdate(DbJobUpdateStore.java:125) > ~[aurora-113.jar:na] > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > ~[commons-113.jar:na] > at > org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl$7.restoreFromSnapshot(SnapshotStoreImpl.java:208) > ~[aurora-113.jar:na] > at > org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.lambda$applySnapshot$238(SnapshotStoreImpl.java:278) > ~[aurora-113.jar:na] > at > org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:137) > ~[aurora-113.jar:na] > at > org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:132) > ~[aurora-113.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage.transactionedWrite(DbStorage.java:146) > ~[aurora-113.jar:na] > at > org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101) > ~[mybatis-guice-3.7.jar:3.7] > at > org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$203(DbStorage.java:160) > ~[aurora-113.jar:na] > at > org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62) > ~[aurora-113.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:158) > ~[aurora-113.jar:na] > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > ~[commons-113.jar:na] > at > org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.applySnapshot(SnapshotStoreImpl.java:274) > ~[aurora-113.jar:na] > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > ~[commons-113.jar:na] > at > org.apache.aurora.scheduler.storage.log.SnapshotStoreImpl.applySnapshot(SnapshotStoreImpl.java:63) > ~[aurora-113.jar:na] > at > org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83) > ~[commons-113.jar:na] > ... > {noformat} > We blamed that to fee5943a95c4f08e148dc5f1366486a8c23d5773 and reverted it in > https://reviews.apache.org/r/42922/. I have been unable to reproduce it in > unit tests yet. Need some further investigation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)