As posted in juju-dev last night:

Okay, I couldn't resist investigating a bit. I've been looking at the
database dump from earlier today and it's smelling like a simpler bug
in the txn package, and I might have found the cause already.

Here is a quick walkthrough while debugging the problem, to also serve
as future aid in similar quests.

Enabling full debug for the txn package with SetDebug and SetLogger,
and doing a ResumeAll to flush all pending transactions, we can
quickly get to the affected document and transaction:

2014/07/30 02:19:23 Resuming all unfinished transactions
2014/07/30 02:19:23 Resuming 53d6057930009a01ba0002e7 from "prepared"
2014/07/30 02:19:23 a) Processing 53d6057930009a01ba0002e7_dcdbc894
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002e7_dcdbc894
2014/07/30 02:19:23 a) Rescanned queue with
53d6057930009a01ba0002e7_dcdbc894: has prereqs, not forced
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002eb_98124806
2014/07/30 02:19:23 a) Rescanned queue with
53d6057930009a01ba0002eb_98124806: has prereqs, not forced
2014/07/30 02:19:23 a) Rescanning 53d6057930009a01ba0002ee_a83bd775
2014/07/30 02:19:23 a) Rescanned document {services ntp} misses
53d6057930009a01ba0002ee_a83bd775 in queue:
[53d6057930009a01ba0002eb_98124806 53d6057930009a01ba0002ea_4ca6ed41
53
d6057c30009a01ba0002fd_4d8d9123 53d6057e30009a01ba000301_ba0b61dd
53d6057e30009a01ba000303_a26cb429]
2014/07/30 02:19:23 a) Reloaded 53d6057930009a01ba0002ee_a83bd775: "prepared"
panic: rescanned document misses transaction in queue

So this error actually means something slightly different from what I
pointed out in the bug.

The transaction runner state machine creates transactions in the
"preparing" state, and then moves it over to "prepared" when all
affected documents were tagged with the transaction id+nonce. So what
this means is that there is a transaction in progress in the prepared
state, while the actual document misses the id in its local queue,
which is an impossible situation unless the document was fiddled with,
there was corruption, or a bug in the code.

So, let's have a look at the affected documents. First, the document
being changed:

    db.services.findOne({_id: "ntp"})
    http://paste.ubuntu.com/7902134/

We can see a few transactions in the queue, but the one raising the
issue is not there as reported by the error.

And this is full transaction raised by the error:

    db.txns.findOne({_id: ObjectId("53d6057930009a01ba0002ee")})
    http://paste.ubuntu.com/7902095/

One interesting thing we can do from here is verifying that other
documents which are part of this transaction are properly tagged, and
indeed they are, so this one document has the issue alone in this
transaction. Good hint.

Another detail worth checking is getting the set of all transactions
involving the affected document since it was created, stripping
operations from unrelated documents:

    db.txns.aggregate(
            {$match: {"o.c": "services", "o.d": "ntp"}},
            {$unwind: "$o"},
            {$match: {"o.c": "services", "o.d": "ntp"}})
    http://paste.ubuntu.com/7902097/

This has quite a few useful hints about the problem. The document was
created for quite a while, so that's most probably not related to the
issue. The timing is also available, encoded inside the transaction id
itself, and this is perhaps the best hint: around the missing
transaction there were a number of transactions being applied on the
same second, most of them pending, but one of them actually succeeds:

    53d6057930009a01ba0002e9 (prepared and missing)
    53d6057930009a01ba0002ea (prepared and in the queue)
    53d6057930009a01ba0002eb (prepared and in the queue)
    53d6057930009a01ba0002ec (applied)
    53d6057930009a01ba0002ed (prepared and missing)
    53d6057930009a01ba0002ee (prepared and missing)
    53d6057930009a01ba0002ef (prepared and missing)

Note how all of these are exactly the same except for the last bytes,
which contain a simple incrementing counter. This means they were done
concurrently and about the exact same time, and from the same client
machine. So it's not one, but several missing transactions, on this
one document, which had an applied transaction interleaved with other
pending ones. This sounds very much like a problem on the logic that
pops completed transactions from the queue.

Looking through the code on that area, I could indeed find some logic
that comes from the previous version of the algorithm, and apparently
should not be there anymore. The fix is simple, and now that we know
the scenario writing a test case that exercises the race shouldn't be
hard either.

More on this soon.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to juju-core in Ubuntu.
https://bugs.launchpad.net/bugs/1318366

Title:
  jujud on state server panic misses transaction in queue

To manage notifications about this bug go to:
https://bugs.launchpad.net/juju-core/+bug/1318366/+subscriptions

-- 
Ubuntu-server-bugs mailing list
Ubuntu-server-bugs@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs

Reply via email to