Re: Fwd: Re: Fwd: Re: Upgrading pristine-xz on jubany

2012-06-18 Thread Vincent Ladeuil
> Dmitrijs Ledkovs  writes:

> FYI.
> Stephane is highly experienced with LXC and does a lot of work with it.

> On 06/15/2012 04:43 AM, Dmitrijs Ledkovs wrote:
>> Dear Stephane,
>> 
>> Can you comment about running quantal LXC container on Lucid host?
>> It would help package importer a lot.
>> 
>> Regards,
>> Dmitrijs

> Hi Dmitrijs,

> I wouldn't recommend running LXC on 10.04, the kernel lacks some
> required features and the userspace is really quite behind.
> Not to mention that these are not secured by apparmor.

> Instead I'd strongly recommend going with an Ubuntu 12.04 host and
> running the quantal container on top of that.
> This way you get the "supported" LXC stack, with apparmor and a working
> quantal template.

Thanks, that was my gut feeling. I'm glad you confirm it so I can focus
on a precise based setup.

   Vincent

-- 
ubuntu-distributed-devel mailing list
ubuntu-distributed-devel@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-distributed-devel


Re: Moving udd away from sqlite

2012-06-18 Thread Robert Collins
On Tue, Jun 19, 2012 at 3:09 AM, James Westby
 wrote:
>> Since https://bugs.launchpad.net/udd/+bug/724893 has been filed, we
>> haven't understood *when* the importer fails but we know it can fail and
>> it failed more often recently (and a lot during your first attempt if I
>> remember correctly). If you have a fix for that, great ! Show us how
>> it's fixed with a test.
>
> I'm still not sure what test would be satisfactory for doing this. The
> above setup isn't amendable to a fast unit test.
>
>> And the more tests we add the easier it becomes to add a new one.
>
> Agreed, but I'm not sure that we are working in an area where unit tests
> are a good fix. A unit test can't prove the absence of deadlocks in a
> multi-process system.

Further to that, no amount of testing can show whether an architecture
is good or bad: the architecture talks to things like *tendency to
fail* and expected or predictable emergent behaviour. Testing, of
various sorts, can help you measure particular aspects like - 'in
situation X, does a failure occur', but it cannot tell you which
situation X's to test, nor which situation Y's a test isn't needed
because the structure supports the desired use.

So in this case its very simple: SQLite is designed for a
single-process embedded DB use case. The import is not such a use
case, and *all* the failure modes you are experiencing are predicted
by the statement 'using SQLite with multiple processes'.

Postgresql is designed for multiple processes working on the system at
once, as long as no schema change are being made: schema changes will
cause lock contention, as will reading the same rows at maximum
isolation - which is why LP, for instance, explicitly loosens the
isolation we have.

AIUI we want, and we know we want, to move to postgresql, in
production, as soon as possible. The storm migration is an attempt to
do that.

Some routes are:
 - move to postgresql using the lower level APIs and add storm later
 - work on the storm patchset to make it reliable with sqlite
 - do both postgresql and sqlite at the same time.

My recommendation, given the existing investment, is to bite the
bullet: bring up a postgresql instance in staging and prod, dump and
import the data to it, and switch over to postgresql. Debug any issues
in real-time, and move on. The depth of potential issues is sharply
limited, because postgresql handles many more combinations of client
interaction than SQLite (by design and necessity).

-Rob

-- 
ubuntu-distributed-devel mailing list
ubuntu-distributed-devel@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-distributed-devel


Re: landscape-client in oneiric-updates

2012-06-18 Thread Dmitrijs Ledkovs
On 18/06/12 22:45, Andreas Hasenack wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi,
> 
> landscape-client got an update for oneiric a while ago, but the
> lp:ubuntu/oneiric-updates/landscape-client branch doesn't exist. Is it
> a case of the importer failing? How can we trigger it to run again for
> that package?
> 


http://package-import.ubuntu.com/status/landscape-client.html#2012-04-22
00:43:56.697761

Should explain.


> The updated package:
> http://packages.ubuntu.com/oneiric-updates/landscape-client
> 
> But the branch:
> 
> $ bzr branch lp:ubuntu/oneiric-updates/landscape-client
> bzr: ERROR: Not a branch:
> "bzr+ssh://bazaar.launchpad.net/+branch/ubuntu/oneiric-updates/landscape-client/".
> 
> lp:ubuntu/oneiric/landscape-client exists, but it has the released
> version as expected, not the update.
> 
> Thanks!


-- 
Regards,
Dmitrijs.

-- 
ubuntu-distributed-devel mailing list
ubuntu-distributed-devel@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-distributed-devel


landscape-client in oneiric-updates

2012-06-18 Thread Andreas Hasenack
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

landscape-client got an update for oneiric a while ago, but the
lp:ubuntu/oneiric-updates/landscape-client branch doesn't exist. Is it
a case of the importer failing? How can we trigger it to run again for
that package?

The updated package:
http://packages.ubuntu.com/oneiric-updates/landscape-client

But the branch:

$ bzr branch lp:ubuntu/oneiric-updates/landscape-client
bzr: ERROR: Not a branch:
"bzr+ssh://bazaar.launchpad.net/+branch/ubuntu/oneiric-updates/landscape-client/".

lp:ubuntu/oneiric/landscape-client exists, but it has the released
version as expected, not the update.

Thanks!

- -- 
Andreas Hasenack
andr...@canonical.com

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk/foZYACgkQeEJZs/PdwpC4dgCg4dW599swGtANKW3gPaHVFpVu
IncAoJQNgUGQ544VMP2XCZRXPDi9gKf4
=hGlU
-END PGP SIGNATURE-

-- 
ubuntu-distributed-devel mailing list
ubuntu-distributed-devel@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-distributed-devel


Fwd: Re: Fwd: Re: Upgrading pristine-xz on jubany

2012-06-18 Thread Dmitrijs Ledkovs
FYI.

Stephane is highly experienced with LXC and does a lot of work with it.

On 06/15/2012 04:43 AM, Dmitrijs Ledkovs wrote:
> Dear Stephane,
> 
> Can you comment about running quantal LXC container on Lucid host?
> It would help package importer a lot.
> 
> Regards,
> Dmitrijs

Hi Dmitrijs,

I wouldn't recommend running LXC on 10.04, the kernel lacks some
required features and the userspace is really quite behind.
Not to mention that these are not secured by apparmor.

Instead I'd strongly recommend going with an Ubuntu 12.04 host and
running the quantal container on top of that.
This way you get the "supported" LXC stack, with apparmor and a working
quantal template.

Stéphane

>  Original Message 
> Subject: Re: Upgrading pristine-xz on jubany
> Date: Fri, 15 Jun 2012 10:32:59 +0200
> From: Vincent Ladeuil 
> To: Barry Warsaw 
> CC: ubuntu-distributed-devel@lists.ubuntu.com
> 
>> Barry Warsaw  writes:
> 
> > On Jun 14, 2012, at 05:21 PM, Vincent Ladeuil wrote:
> >> - I'm already running successful tests inside a quantal lxc
> container :)
> 
> > It has become for many of us not just a nice-to-have but a
> > must-have for Ubuntu development.
> 
> That's my understanding as well.
> 
> Here are my last achievements for the week:
> 
> - I got in touch with pristine-tar maintainers resulting in a trivial
>   bugfix included in 1.25. This is a small step in getting *known* as a
>   primary consumer but it also demonstrates that we can get fixes
>   upstream quickly (1.25 has already been uploaded to sid and quantal).
> 
> - I got in touch with xz maintainers and a fix is on its way there
>   (many thanks to Lasse Collin for its invaluable help here). This will
>   require an additional fix to pristine-xz which I will submit as soon
>   as I can test the xz fix).
> 
> With these fixes in place, on quantal, it should remain only < 10
> pristine-tar import failures out of the current 338 on jubany. Said
> failures include crazy stuff like tarballs containing files with 
> chmod bits... I haven't looked more precisely how to fix that (and I'm
> not sure it's worth digging for now).
> 
> And don't forget that when a package fail to import one release, all the
> subsequent ones are blocked as well. When we fixed the bzip2 issue last
> January, ~70 packages were blocked accounting for ~800 releases (don't
> quote on these numbers, it's just a vague remembering but the scales
> should be ok).
> 
> I also have a pending patch for bzr-builddeb that makes it easier to
> test against pristine-tar failures (will probably submit an mp for that
> today). Roughly, both builddeb and pristine-tar use temp files so when
> the import fails, the context is lost. The fix is to save enough of the
> temp files to be able to re-run pristine-tar alone without re-trying an
> import (the test cycle is then reduced to seconds instead of hours).
> 
> With these 3 fixes, we'll be in a far better position to be more
> reactive to pristine-tar failures in the future (running quantal will
> then mean that getting fixes will be as simple as stopping the importer,
> running apt-get upgrade and restart the importer).
> 
> It also means that testing can occur on quantal without the need to
> install a bunch of pre-requisites in sync with what is deployed on
> jubany (which can quickly get totally out of control).
> 
> I'm still investigating running a quantal lxc container right now on
> jubany (any feedback about lxc on *lucid* welcome especially known
> issues that has been fixed in precise).
> 
> Once I validate this we can look at deploying a quantal lxc
> container on jubany.
> 
>   Vincent





-- 
ubuntu-distributed-devel mailing list
ubuntu-distributed-devel@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-distributed-devel


Re: Moving udd away from sqlite

2012-06-18 Thread James Westby
On Mon, 18 Jun 2012 10:45:26 +0200, Vincent Ladeuil  
wrote:
> > James Westby  writes:
> 
> > On Fri, 15 Jun 2012 12:34:12 +0200, Vincent Ladeuil 
>  wrote:
> >> > It's not magic. It's moving from a database that's not designed for
> >> > concurrent use to one that is designed for concurrent use.
> >> 
> >> Despite not being designed for concurrent use, it *is* used this
> >> way and lock contentions have been encountered leading me to
> >> believe that the actual *design* needs to be fixed. The fact that
> >> changing the db is triggering more contentions is a symptom of a
> >> deeper issue.
> 
> > Changing the db access layer is triggering that, we were still
> > running on the same (non-multi-user db). I agree that the design
> > needs to be fixed, and that's exactly what we're taking about,
> > fixing it by moving a db that is designed for multi-user use.
> 
> It looks like your understanding of the issue is better than mine here,
> would you mind sharing that knowledge in an automated test (with the
> added benefit that we won't regress in this area) ?
> 
> Just this week-end we had an add-import-jobs failure:
> 
> Traceback (most recent call last):
>   File "/srv/package-import.canonical.com/new/scripts/bin/add-import-jobs", 
> line 5, in 
> sys.exit(main())
>   File 
> "/srv/package-import.canonical.com/new/scripts/udd/scripts/add_import_jobs.py",
>  line 17, in main
> icommon.create_import_jobs(lp, status_db)
>   File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
> 304, in create_import_jobs
> status_db.add_import_jobs(checked, newest_published)
>   File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
> 633, in add_import_jobs
> self._add_job(c, package, self.JOB_TYPE_NEW)
>   File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
> 615, in _add_job
> datetime.utcnow(), None, None))
> sqlite3.OperationalError: database is locked
> 
> So we already know that add_import_jobs is involved in the bug (with the
> current sqlite-based implementation), who is the other user in this case
> and how can this be reproduced ?

Each connection to sqlite is another "user", so each of the cron
scripts, as well as the imports themselves, and several connections
within mass-import are all the users.

When a write operation is started a global lock is acquired that locks
out any other writers until the operation is complete.

If the lock is held then the library will wait up to a timeout
(configured to be 30s for udd) for the lock to be released before giving
up.

The errors like the above occur when the timeout is reached, so either
another transaction took more than 30s to release the lock, or there
were lots of connections trying to take the lock, and this one didn't
win before the 30s was up.

When we change to storm it forces pysqlite in to a higher isolation level,
so that transactions are started when any statement is executed. My
guess is that this means locks are taken more frequently and are held
for longer, giving more contention errors.

Postgres doesn't have a global lock, it has table or row locks, so that
clients will only hit lock contention if they are changing the same
data, which will be much less frequent.

How can I show that in an automated test? I can write an XFAIL test that
if two connections are opened, one starts a transaction and then the
other hits an locking exception if it tries to do anything, but that
doesn't seem to prove much about the operation of the system.

> I.e. reproducing the add_import_jobs failure in a test that will fail
> with sqlite and succeed with your changes will demonstrate we've
> captured (and fixed) at least one lock contention.

We are dealing with probabilistic failure though. I can demonstrate that
in a deterministic situation changing two separate tables under sqlite
will take global locks, but I can't prove that we will never get
contention under postgres.

> If the test suite cannot be trusted to catch most of the issues that
> happen in production, the test suite should be fixed.
> 
> You're not implying that testing in production being needed, the test
> suite is useless right ?

No, I'm saying that the only measure of whether something runs correctly
in production is whether it runs in production.

> From that, can we imagine a test that will import a few packages and
> compare the corresponding dbs ?

We can do that as part of testing the migration script.

> > It can be restarted with the dbs from whenever the transition starts and
> > it will catch up in roughly the between starting the transition and
> > rolling back. There may be a few bugs due to replaying things, but we do
> > it all the time (e.g. removing revids and re-importing when someone does
> > push --overwrite)
> 
> As in requeue --full ? requeue --zap-revids ? None of them is used on a
> daily basis but my limited experience th

Re: Moving udd away from sqlite

2012-06-18 Thread Vincent Ladeuil
> James Westby  writes:

> On Fri, 15 Jun 2012 12:34:12 +0200, Vincent Ladeuil 
 wrote:
>> > It's not magic. It's moving from a database that's not designed for
>> > concurrent use to one that is designed for concurrent use.
>> 
>> Despite not being designed for concurrent use, it *is* used this
>> way and lock contentions have been encountered leading me to
>> believe that the actual *design* needs to be fixed. The fact that
>> changing the db is triggering more contentions is a symptom of a
>> deeper issue.

> Changing the db access layer is triggering that, we were still
> running on the same (non-multi-user db). I agree that the design
> needs to be fixed, and that's exactly what we're taking about,
> fixing it by moving a db that is designed for multi-user use.

It looks like your understanding of the issue is better than mine here,
would you mind sharing that knowledge in an automated test (with the
added benefit that we won't regress in this area) ?

Just this week-end we had an add-import-jobs failure:

Traceback (most recent call last):
  File "/srv/package-import.canonical.com/new/scripts/bin/add-import-jobs", 
line 5, in 
sys.exit(main())
  File 
"/srv/package-import.canonical.com/new/scripts/udd/scripts/add_import_jobs.py", 
line 17, in main
icommon.create_import_jobs(lp, status_db)
  File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
304, in create_import_jobs
status_db.add_import_jobs(checked, newest_published)
  File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
633, in add_import_jobs
self._add_job(c, package, self.JOB_TYPE_NEW)
  File "/srv/package-import.canonical.com/new/scripts/udd/icommon.py", line 
615, in _add_job
datetime.utcnow(), None, None))
sqlite3.OperationalError: database is locked

So we already know that add_import_jobs is involved in the bug (with the
current sqlite-based implementation), who is the other user in this case
and how can this be reproduced ?

>> Well, when the correctness and safety is demonstrated, the context (and
>> hence my own answer) will probably be different but until then I just
>> can't say.
>> 
>> > And I'm very reluctant to fork without an actual plan for merging
>> > back: how to know when it's safe & how to actually achieve it.
>> 
>> And I have no idea (nor time right now) to debug the fallouts of such a
>> change that the actual package importer doesn't need. Hence my tendency
>> to consider that demonstrating the validity of this change should be
>> achieved first.

> But you just said above that you *do* think it needs to be fixed?

Yes, it needs to be fixed. It doesn't cause blocking issues currently
which is why https://bugs.launchpad.net/udd/+bug/724893 hasn't been
fixed yet.

> How can we demonstrate the validity of the change?

With an automated test (or several ;) ! What else ? ;)

This would make it more comfortable to run into production to catch the
other *fallouts* being confident the known issue *is* fixed.

I.e. reproducing the add_import_jobs failure in a test that will fail
with sqlite and succeed with your changes will demonstrate we've
captured (and fixed) at least one lock contention.

> We can only demonstrate that it doesn't break production by
> running the changes in production.

If the test suite cannot be trusted to catch most of the issues that
happen in production, the test suite should be fixed.

You're not implying that testing in production being needed, the test
suite is useless right ?

> What would satisfy you that it was unlikely to break production?

Bugs that happen in production have a far higher cost than test
failures, that's where automated tests get most of their value.

But see below.

>> Would there be a script to migrate from sqlite to PG ?

> Yes.

Cool.

>From that, can we imagine a test that will import a few packages and
compare the corresponding dbs ?

>> Can the package importer be re-started with empty dbs and catch up (how
>> long will it take ? Days ? Weeks ?). Can this trigger bugs because the
>> importer don't remember what it pushed to lp ?

> It can be restarted with the dbs from whenever the transition starts and
> it will catch up in roughly the between starting the transition and
> rolling back. There may be a few bugs due to replaying things, but we do
> it all the time (e.g. removing revids and re-importing when someone does
> push --overwrite)

As in requeue --full ? requeue --zap-revids ? None of them is used on a
daily basis but my limited experience there never triggered issues
either.

>> Or do you expect us to see another peek like
>> http://webnumbr.com/ubuntu-package-import-failures.from%282012-01-24%29 ?

> Hopefully not.

>> Yes, that's why we're not in a position to safely accept such a change !
>> 
>> And all the time spent

Re: Upgrading pristine-xz on jubany

2012-06-18 Thread Vincent Ladeuil
Some good news after my tests Friday and this week-end:

> Vincent Ladeuil  writes:



> - I got in touch with xz maintainers and a fix is on its way there
> (many thanks to Lasse Collin for its invaluable help here). This
> will require an additional fix to pristine-xz which I will submit
> as soon as I can test the xz fix).

Lasse Collin gave me a patch that I tested successfully: 135 out the 150
failures are fixed.

> And don't forget that when a package fail to import one release,
> all the subsequent ones are blocked as well. When we fixed the
> bzip2 issue last January, ~70 packages were blocked accounting for
> ~800 releases (don't quote on these numbers, it's just a vague
> remembering but the scales should be ok).

 135 packages were blocking 4437 releases here \o/ So the scale
is even bigger than I thought.

The fix has not been committed upstream yet[1] so we'll need to carry it
ourselves.

Merge proposal/patch submission time ;)

I'll followup with more explanations about the root cause later.

   Vincent

[1]: In order to re-create the compressed file, one has to provide xz
with a list of block sizes. Since this is a feature that only a handful
of use cases will benefit from, it will need to be cooked a bit before
landing in xz-utils. I had a deep enough discussion with Lasse to be
confident that the feature will find its way into an upcoming release in
one form or another but until this happen, we'll need to carry a patch
in both xz-utils and pristine-tar. But I don't expect significant
changes except for the option name (which is why I can't send the full
patch to pristine-tar for now).

-- 
ubuntu-distributed-devel mailing list
ubuntu-distributed-devel@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-distributed-devel