Alright, the guess last night was correct, and the candidate fix as
well. I've managed to reproduce the problem by stressing out the
scenario described with 4 concurrent runners running the following two
operations, meanwhile the chaos mechanism injects random slowdowns in
various critical points:

        []txn.Op{{
                C: "accounts",
                Id: 0,
                Update: M{"$inc": M{"balance": 1}},
        }, {
                C: "accounts",
                Id: 1,
                Update: M{"$inc": M{"balance": 1}},
        }}

To reach the bug, the stress test also has to run half of the
transactions in this order, and the other half with these same
operations in the opposite order, so that dependency cycles are
created between the transactions. Note that the txn package guarantees
that operations are always executed in the order provided in the
transaction.

The fix and the complete test is available in this change:

    https://github.com/go-mgo/mgo/commit/3bc3ddaa

The numbers there are lower to run in a reasonable mount of time, but
to give some confidence on the fix and the code in general, I've run
this test for 100k transactions being concurrently executed with no
problems.

Also, to give a better perspective of the sort of outcome that the
logic for concurrent runners produce, this output was generated by
that test while running for 100 transactions:

    http://paste.ubuntu.com/7906618/

The tokens like "a)" in these lines are the unique identifier for a
given transaction runner. Note how every single operation is executed
in precise lock-step despite the concurrency and the ordering issues,
even assigning the same revision to both documents since they were
created together.

Also, perhaps most interestingly, note the occurrences such as:

[LOG] 0:00.180 b) Applying 53d92a4bca654539e7000003_7791e1dc op 0
(update) on {accounts 0} with txn-revno 2: DONE
[LOG] 0:00.186 d) Applying 53d92a4bca654539e7000003_7791e1dc op 1
(update) on {accounts 1} with txn-revno 2: DONE

Note the first one is "b)" while the second one is "d)", which means
there are two completely independent runners, in different goroutines
(might as well be different machines), collaborating towards the
completion of a single transaction.

So, I believe this is sorted. Please let me know how it goes there.



On Wed, Jul 30, 2014 at 4:14 AM, Gustavo Niemeyer
<gustavo.nieme...@canonical.com> wrote:
> Okay, I couldn't resist investigating a bit. I've been looking at the
> database dump from earlier today and it's smelling like a simpler bug
> in the txn package, and I might have found the cause already.
>
> Here is a quick walkthrough while debugging the problem, to also serve
> as future aid in similar quests.
>
> Enabling full debug for the txn package with SetDebug and SetLogger,
> and doing a ResumeAll to flush all pending transactions, we can
> quickly get to the affected document and transaction:
>
> 2014/07/30 02:19:23 Resuming all unfinished transactions
> 2014/07/30 02:19:23 Resuming 53d6057930009a01ba0002e7 from "prepared"
> 2014/07/30 02:19:23 a) Processing 53d6057930009a01ba0002e7_dcdbc894
> 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002e7_dcdbc894
> 2014/07/30 02:19:23 a) Rescanned queue with
> 53d6057930009a01ba0002e7_dcdbc894: has prereqs, not forced
> 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002eb_98124806
> 2014/07/30 02:19:23 a) Rescanned queue with
> 53d6057930009a01ba0002eb_98124806: has prereqs, not forced
> 2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002ee_a83bd775
> 2014/07/30 02:19:23 a) Rescanned document {services ntp} misses
> 53d6057930009a01ba0002ee_a83bd775 in queue:
> [53d6057930009a01ba0002eb_98124806 53d6057930009a01ba0002ea_4ca6ed41
> 53
> d6057c30009a01ba0002fd_4d8d9123 53d6057e30009a01ba000301_ba0b61dd
> 53d6057e30009a01ba000303_a26cb429]
> 2014/07/30 02:19:23 a) Reloaded 53d6057930009a01ba0002ee_a83bd775: "prepared"
> panic: rescanned document misses transaction in queue
>
> So this error actually means something slightly different from what I
> pointed out in the bug.
>
> The transaction runner state machine creates transactions in the
> "preparing" state, and then moves it over to "prepared" when all
> affected documents were tagged with the transaction id+nonce. So what
> this means is that there is a transaction in progress in the prepared
> state, while the actual document misses the id in its local queue,
> which is an impossible situation unless the document was fiddled with,
> there was corruption, or a bug in the code.
>
> So, let's have a look at the affected documents. First, the document
> being changed:
>
>     db.services.findOne({_id: "ntp"})
>     http://paste.ubuntu.com/7902134/
>
> We can see a few transactions in the queue, but the one raising the
> issue is not there as reported by the error.
>
> And this is full transaction raised by the error:
>
>     db.txns.findOne({_id: ObjectId("53d6057930009a01ba0002ee")})
>     http://paste.ubuntu.com/7902095/
>
> One interesting thing we can do from here is verifying that other
> documents which are part of this transaction are properly tagged, and
> indeed they are, so this one document has the issue alone in this
> transaction. Good hint.
>
> Another detail worth checking is getting the set of all transactions
> involving the affected document since it was created, stripping
> operations from unrelated documents:
>
>     db.txns.aggregate(
>             {$match: {"o.c": "services", "o.d": "ntp"}},
>             {$unwind: "$o"},
>             {$match: {"o.c": "services", "o.d": "ntp"}})
>     http://paste.ubuntu.com/7902097/
>
> This has quite a few useful hints about the problem. The document was
> created for quite a while, so that's most probably not related to the
> issue. The timing is also available, encoded inside the transaction id
> itself, and this is perhaps the best hint: around the missing
> transaction there were a number of transactions being applied on the
> same second, most of them pending, but one of them actually succeeds:
>
>     53d6057930009a01ba0002e9 (prepared and missing)
>     53d6057930009a01ba0002ea (prepared and in the queue)
>     53d6057930009a01ba0002eb (prepared and in the queue)
>     53d6057930009a01ba0002ec (applied)
>     53d6057930009a01ba0002ed (prepared and missing)
>     53d6057930009a01ba0002ee (prepared and missing)
>     53d6057930009a01ba0002ef (prepared and missing)
>
> Note how all of these are exactly the same except for the last bytes,
> which contain a simple incrementing counter. This means they were done
> concurrently and about the exact same time, and from the same client
> machine. So it's not one, but several missing transactions, on this
> one document, which had an applied transaction interleaved with other
> pending ones. This sounds very much like a problem on the logic that
> pops completed transactions from the queue.
>
> Looking through the code on that area, I could indeed find some logic
> that comes from the previous version of the algorithm, and apparently
> should not be there anymore. The fix is simple, and now that we know
> the scenario writing a test case that exercises the race shouldn't be
> hard either.
>
> But it's quite late here and it'd be unwise to not get some sleep.
> More on this tomorrow.
>
>
> On Wed, Jul 30, 2014 at 3:21 AM, John Meinel <j...@arbash-meinel.com> wrote:
>> Could this be something where we are getting transactions faster than we can
>> finalize them?
>>
>> John
>> =:->
>>
>>
>> On Wed, Jul 30, 2014 at 9:17 AM, Gustavo Niemeyer <gust...@niemeyer.net>
>> wrote:
>>>
>>> We've got a database dump yesterday, which gives me something to
>>> investigate. I'll spend some time on this tomorrow (today) and report
>>> back.
>>>
>>> On Wed, Jul 30, 2014 at 1:34 AM, Menno Smits <menno.sm...@canonical.com>
>>> wrote:
>>> > All,
>>> >
>>> > Various people have been seeing the machine agents panic with the
>>> > following
>>> > message:
>>> >
>>> >    panic: rescanned document misses transaction in queue
>>> >
>>> > The error message comes from mgo but the actual cause is unknown.
>>> > There's
>>> > plenty of detail in the comments for the LP bug that's tracking this. If
>>> > you
>>> > have any ideas about a possible cause or how to debug this further
>>> > please
>>> > weigh in.
>>> >
>>> > https://bugs.launchpad.net/juju-core/+bug/1318366
>>> >
>>> > Thanks,
>>> > Menno
>>> >
>>> > --
>>> > Juju-dev mailing list
>>> > Juju-dev@lists.ubuntu.com
>>> > Modify settings or unsubscribe at:
>>> > https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>> >
>>>
>>>
>>>
>>> --
>>>
>>> gustavo @ http://niemeyer.net
>>>
>>> --
>>> Juju-dev mailing list
>>> Juju-dev@lists.ubuntu.com
>>> Modify settings or unsubscribe at:
>>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>>
>
> --
> gustavo @ http://niemeyer.net



-- 
gustavo @ http://niemeyer.net

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Reply via email to