Alright, the guess last night was correct, and the candidate fix as well. I've managed to reproduce the problem by stressing out the scenario described with 4 concurrent runners running the following two operations, meanwhile the chaos mechanism injects random slowdowns in various critical points:
[]txn.Op{{ C: "accounts", Id: 0, Update: M{"$inc": M{"balance": 1}}, }, { C: "accounts", Id: 1, Update: M{"$inc": M{"balance": 1}}, }} To reach the bug, the stress test also has to run half of the transactions in this order, and the other half with these same operations in the opposite order, so that dependency cycles are created between the transactions. Note that the txn package guarantees that operations are always executed in the order provided in the transaction. The fix and the complete test is available in this change: https://github.com/go-mgo/mgo/commit/3bc3ddaa The numbers there are lower to run in a reasonable mount of time, but to give some confidence on the fix and the code in general, I've run this test for 100k transactions being concurrently executed with no problems. Also, to give a better perspective of the sort of outcome that the logic for concurrent runners produce, this output was generated by that test while running for 100 transactions: http://paste.ubuntu.com/7906618/ The tokens like "a)" in these lines are the unique identifier for a given transaction runner. Note how every single operation is executed in precise lock-step despite the concurrency and the ordering issues, even assigning the same revision to both documents since they were created together. Also, perhaps most interestingly, note the occurrences such as: [LOG] 0:00.180 b) Applying 53d92a4bca654539e7000003_7791e1dc op 0 (update) on {accounts 0} with txn-revno 2: DONE [LOG] 0:00.186 d) Applying 53d92a4bca654539e7000003_7791e1dc op 1 (update) on {accounts 1} with txn-revno 2: DONE Note the first one is "b)" while the second one is "d)", which means there are two completely independent runners, in different goroutines (might as well be different machines), collaborating towards the completion of a single transaction. So, I believe this is sorted. Please let me know how it goes there. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1318366 Title: jujud on state server panic misses transaction in queue To manage notifications about this bug go to: https://bugs.launchpad.net/juju-core/+bug/1318366/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs