On Mon, 18 Jun 2012 10:45:26 +0200, Vincent Ladeuil <vila+...@canonical.com> 
wrote:
> >>>>> James Westby <james.wes...@canonical.com> writes:
> 
>     > On Fri, 15 Jun 2012 12:34:12 +0200, Vincent Ladeuil 
> <vila+...@canonical.com> wrote:
>     >> > It's not magic. It's moving from a database that's not designed for
>     >> > concurrent use to one that is designed for concurrent use.
>     >> 
>     >> Despite not being designed for concurrent use, it *is* used this
>     >> way and lock contentions have been encountered leading me to
>     >> believe that the actual *design* needs to be fixed. The fact that
>     >> changing the db is triggering more contentions is a symptom of a
>     >> deeper issue.
> 
>     > Changing the db access layer is triggering that, we were still
>     > running on the same (non-multi-user db). I agree that the design
>     > needs to be fixed, and that's exactly what we're taking about,
>     > fixing it by moving a db that is designed for multi-user use.
> 
> It looks like your understanding of the issue is better than mine here,
> would you mind sharing that knowledge in an automated test (with the
> added benefit that we won't regress in this area) ?
> 
> Just this week-end we had an add-import-jobs failure:
> 
> Traceback (most recent call last):
>   File "/srv/package-import.canonical.com/new/scripts/bin/add-import-jobs", 
> line 5, in <module>
>     sys.exit(main())
>   File 
> "/srv/package-import.canonical.com/new/scripts/udd/scripts/add_import_jobs.py",
>  line 17, in main
>     icommon.create_import_jobs(lp, status_db)
>   File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
> 304, in create_import_jobs
>     status_db.add_import_jobs(checked, newest_published)
>   File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
> 633, in add_import_jobs
>     self._add_job(c, package, self.JOB_TYPE_NEW)
>   File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
> 615, in _add_job
>     datetime.utcnow(), None, None))
> sqlite3.OperationalError: database is locked
> 
> So we already know that add_import_jobs is involved in the bug (with the
> current sqlite-based implementation), who is the other user in this case
> and how can this be reproduced ?

Each connection to sqlite is another "user", so each of the cron
scripts, as well as the imports themselves, and several connections
within mass-import are all the users.

When a write operation is started a global lock is acquired that locks
out any other writers until the operation is complete.

If the lock is held then the library will wait up to a timeout
(configured to be 30s for udd) for the lock to be released before giving
up.

The errors like the above occur when the timeout is reached, so either
another transaction took more than 30s to release the lock, or there
were lots of connections trying to take the lock, and this one didn't
win before the 30s was up.

When we change to storm it forces pysqlite in to a higher isolation level,
so that transactions are started when any statement is executed. My
guess is that this means locks are taken more frequently and are held
for longer, giving more contention errors.

Postgres doesn't have a global lock, it has table or row locks, so that
clients will only hit lock contention if they are changing the same
data, which will be much less frequent.

How can I show that in an automated test? I can write an XFAIL test that
if two connections are opened, one starts a transaction and then the
other hits an locking exception if it tries to do anything, but that
doesn't seem to prove much about the operation of the system.

> I.e. reproducing the add_import_jobs failure in a test that will fail
> with sqlite and succeed with your changes will demonstrate we've
> captured (and fixed) at least one lock contention.

We are dealing with probabilistic failure though. I can demonstrate that
in a deterministic situation changing two separate tables under sqlite
will take global locks, but I can't prove that we will never get
contention under postgres.

> If the test suite cannot be trusted to catch most of the issues that
> happen in production, the test suite should be fixed.
> 
> You're not implying that testing in production being needed, the test
> suite is useless right ?

No, I'm saying that the only measure of whether something runs correctly
in production is whether it runs in production.

> From that, can we imagine a test that will import a few packages and
> compare the corresponding dbs ?

We can do that as part of testing the migration script.

>     > It can be restarted with the dbs from whenever the transition starts and
>     > it will catch up in roughly the between starting the transition and
>     > rolling back. There may be a few bugs due to replaying things, but we do
>     > it all the time (e.g. removing revids and re-importing when someone does
>     > push --overwrite)
> 
> As in requeue --full ? requeue --zap-revids ? None of them is used on a
> daily basis but my limited experience there never triggered issues
> either.

As in just start it back up pointing at the old database copies again,
and it will continue on from where it was stopped.

> Fire-fighting in production is wasted time if it's caused by a bug that
> could have been caught before deployment. Improving the test suite is
> the best answer to that: the time is spent on making deployment more
> reliable so no fire-fighting (ideally) or less fire-fighting is needed.

Absolutely, but it's a tradeoff, as taken to extreme we would spend all
of our time improving the test suite and no time actually deploying any
changes.

> It's hard for me to judge without being able to redo these tests nor
> understand how they demonstrate how the issue is fixed :-/

You can do the tests if you want. I deployed to ec2 using jml's fab task
and then logged in to the system. I reverted the revert of the storm
changes. I then wrote a very simple import script that just wrote some
canned data to the revids db and exited (though it would probably work
with /bin/true as the import script.)

I then ran add_import_jobs and started mass-import.

This meant that each import took a couple of seconds, so mass-import was
running queries much more frequently than in production.

Leave it running for a few minutes and you should see the exceptions in
the mass import log. Left for long enough it will enter the deadlock
condition.

> I don't mind having higher level tests to start with (writing unit tests
> requires a precise understanding of the bug at the lowest level), what I
> care about is that we add tests as we better understand how the importer
> works and how it fails. I.e. we reduce the space were *we don't know*
> whether the importer works or fails.
> 
> Since https://bugs.launchpad.net/udd/+bug/724893 has been filed, we
> haven't understood *when* the importer fails but we know it can fail and
> it failed more often recently (and a lot during your first attempt if I
> remember correctly). If you have a fix for that, great ! Show us how
> it's fixed with a test.

I'm still not sure what test would be satisfactory for doing this. The
above setup isn't amendable to a fast unit test.

> And the more tests we add the easier it becomes to add a new one.

Agreed, but I'm not sure that we are working in an area where unit tests
are a good fix. A unit test can't prove the absence of deadlocks in a
multi-process system.

Thanks,

James

-- 
ubuntu-distributed-devel mailing list
ubuntu-distributed-devel@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-distributed-devel

Reply via email to