Alright, the guess last night was correct, and the candidate fix as
well. I've managed to reproduce the problem by stressing out the
scenario described with 4 concurrent runners running the following two
operations, meanwhile the chaos mechanism injects random slowdowns in
various critical points:

        []txn.Op{{
                C:      "accounts",
                Id:     0,
                Update: M{"$inc": M{"balance": 1}},
        }, {
                C:      "accounts",
                Id:     1,
                Update: M{"$inc": M{"balance": 1}},
        }}


To reach the bug, the stress test also has to run half of the transactions in 
this order, and the other half with these same operations in the opposite 
order, so that dependency cycles are created between the transactions. Note 
that the txn package guarantees that operations are always executed in the 
order provided in the transaction.

The fix and the complete test is available in this change:

    https://github.com/go-mgo/mgo/commit/3bc3ddaa

The numbers there are lower to run in a reasonable mount of time, but to
give some confidence on the fix and the code in general, I've run this
test for 100k transactions being concurrently executed with no problems.

Also, to give a better perspective of the sort of outcome that the logic
for concurrent runners produce, this output was generated by that test
while running for 100 transactions:

    http://paste.ubuntu.com/7906618/

The tokens like "a)" in these lines are the unique identifier for a
given transaction runner. Note how every single operation is executed in
precise lock-step despite the concurrency and the ordering issues, even
assigning the same revision to both documents since they were created
together.

Also, perhaps most interestingly, note the occurrences such as:

[LOG] 0:00.180 b) Applying 53d92a4bca654539e7000003_7791e1dc op 0 (update) on 
{accounts 0} with txn-revno 2: DONE
[LOG] 0:00.186 d) Applying 53d92a4bca654539e7000003_7791e1dc op 1 (update) on 
{accounts 1} with txn-revno 2: DONE

Note the first one is "b)" while the second one is "d)", which means
there are two completely independent runners, in different goroutines
(might as well be different machines), collaborating towards the
completion of a single transaction.

So, I believe this is sorted. Please let me know how it goes there.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1318366

Title:
  jujud on state server panic misses transaction in queue

To manage notifications about this bug go to:
https://bugs.launchpad.net/juju-core/+bug/1318366/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to