Bug#641468: lintian: update the lab layout (i.e. use pools)

2011-10-26 Thread Niels Thykier
On 2011-10-05 10:45, Niels Thykier wrote:
 On 2011-09-13 18:04, Niels Thykier wrote:
 Package: lintian
 Severity: important


 Jakub realized the source of a lot of our errors on lintian.d.o are
 caused by limitations in the file-system.  We should probably use
 a pool or something similar to reduce the amount of elements in
 each dirs.

 ~Niels



 
 I guess it might be a good time for a little status update here.
 

Since no one has commented so far I have applied do-cracy and done stuff...

 The lab-refactor branch is now working for simple use cases[1].
 However, the lintian.d.o-style usage needs some attention.
 
 In the master branch we use $lab/info/* as a list of what was in the
 mirror last time we checked.  Those files have been repurposed in the
 lab-refactor branch, where their new meaning is what is currently in
 the lab.  This means that dist search[2] is currently broken.
   To my knowledge there are *2* known cases where dist searches make
 sense - lintian.d.o and lintian.debathena.o.  I feel we should move that
 functionality to a new frontend (such as the lintian-harness[3]) that
 would focus lintian.d.o-like setups.
 
 Note that repurposing is not entirely complete and therefore
 reporting/harness is more or less broken right now.  One of the issues
 is that unpack/* still use the files in info/* as a dist list and not a
 lab list.
 

dist search is now removed from lintian - the reporting stuff left
untouched and is therefore still broken.  Yay for progress! :)

When I was looking at this, I realised two things.  First a lot of
variables (and cmd-options) now appear to be redundant in
frontend/lintian.  Namely all of LINTIAN_{ARCHIVEDIR,AREA,DIST} and
possibly also LINTIAN_ARCH.  It has not been double checked, but I
strongly suspect them of being unused now.

Secondly, the current search rules were not sufficient.  Basically, it
was only possible to match all packages with a given name.  I have
solved this by creating a simple way of referring to packages in the Lab.

Originially I planned to accept both the current britney-style
format[1] and the filename-style[2].  However it occurred to me that
the filename-style is (for obvious reasons) impossible to reliably
distinguish from a normal file.  As this could lead to confusion for the
users (i.e. principe of least surprise), I decided to not include the
filename-style.  The britney-style format is described in
man/lintian.pod.in.

[1] [type:]package[/version[/arch]]

[2] package_version[_arch].ext

 
 I also considered adding a file in info/ to keep track of lab-wide
 (meta)data, such as the lab-format.  In the old lab format, this is
 stored in every entry.  This makes is slightly more difficult to check
 if we are dealing with a compatible lab.
   Consider if you use an old lintian to use the new lab style - they
 do not store the entries the same place, so it has no reliable way to
 detect it is not compatible.  I would prefer that an old lintian would
 always be able to say The lab uses a newer lab-format that this version
 of lintian supports - even if this case will probably never happen.
 

I have added a lab-wide data file stored as $LAB/info/lab-info.  It
uses the deb822-style syntax and has two fields in the first paragraph:


Lab-Format: $format
Layout: pool


The Lab-Format field describes the current format of the lab[3].  The
Layout field describes how the packages are placed in the lab.
Currently only one layout exists (namely pool), which reflects the
layout in the current branch.
  The Layout field allows us to implement and play with a new layout
side-by-side with the current one.  Hopefully we will never need this
feature, but probably we will.

[3] Will be 11 when the development is done.  Currently it is 10.1.

 
 I am also wondering what we need in the per-entry lintian-status file.
  In the master branch, we store Lintian-Version, Lab-Format, Package
 (name), Version (package), Type (package) and Timestamp.
   When we read the status file, we compare lab-format, package version
 and timestamp.  With the changes in lab-refactor branch, the lab always
 supports multiple versions of the same package, thus the package version
 comparision is a no-op.
 
 As I understand it, the timestamp is there to make lintian re-unpack
 the package if it changed since the last run.  Currently it completely
 removes the entry if the timestamp does not match.  Though this code
 only makes sense for personal static labs - on the lintian.d.o case,
 the version of a package can not be reused (at least not in general).
   The timestamp-part is not in the lab-refactor branch (yet?).
 
 I am considering to replace the Lab-format value with an
 entry-format-version.  Not sure it makes sense, but I thinking it may
 make migration to newer formats easier.
   If I had not (ab)used the oppertunity to do optimizations in the
 .lintian-status file (see below), the migration from the current to the
 lab-format would basically just have been a 

Bug#641468: lintian: update the lab layout (i.e. use pools)

2011-10-05 Thread Niels Thykier
On 2011-09-13 18:04, Niels Thykier wrote:
 Package: lintian
 Severity: important
 
 
 Jakub realized the source of a lot of our errors on lintian.d.o are
 caused by limitations in the file-system.  We should probably use
 a pool or something similar to reduce the amount of elements in
 each dirs.
 
 ~Niels
 
 
 

I guess it might be a good time for a little status update here.

The lab-refactor branch is now working for simple use cases[1].
However, the lintian.d.o-style usage needs some attention.

In the master branch we use $lab/info/* as a list of what was in the
mirror last time we checked.  Those files have been repurposed in the
lab-refactor branch, where their new meaning is what is currently in
the lab.  This means that dist search[2] is currently broken.
  To my knowledge there are *2* known cases where dist searches make
sense - lintian.d.o and lintian.debathena.o.  I feel we should move that
functionality to a new frontend (such as the lintian-harness[3]) that
would focus lintian.d.o-like setups.

Note that repurposing is not entirely complete and therefore
reporting/harness is more or less broken right now.  One of the issues
is that unpack/* still use the files in info/* as a dist list and not a
lab list.


I also considered adding a file in info/ to keep track of lab-wide
(meta)data, such as the lab-format.  In the old lab format, this is
stored in every entry.  This makes is slightly more difficult to check
if we are dealing with a compatible lab.
  Consider if you use an old lintian to use the new lab style - they
do not store the entries the same place, so it has no reliable way to
detect it is not compatible.  I would prefer that an old lintian would
always be able to say The lab uses a newer lab-format that this version
of lintian supports - even if this case will probably never happen.


I am also wondering what we need in the per-entry lintian-status file.
 In the master branch, we store Lintian-Version, Lab-Format, Package
(name), Version (package), Type (package) and Timestamp.
  When we read the status file, we compare lab-format, package version
and timestamp.  With the changes in lab-refactor branch, the lab always
supports multiple versions of the same package, thus the package version
comparision is a no-op.

As I understand it, the timestamp is there to make lintian re-unpack
the package if it changed since the last run.  Currently it completely
removes the entry if the timestamp does not match.  Though this code
only makes sense for personal static labs - on the lintian.d.o case,
the version of a package can not be reused (at least not in general).
  The timestamp-part is not in the lab-refactor branch (yet?).

I am considering to replace the Lab-format value with an
entry-format-version.  Not sure it makes sense, but I thinking it may
make migration to newer formats easier.
  If I had not (ab)used the oppertunity to do optimizations in the
.lintian-status file (see below), the migration from the current to the
lab-format would basically just have been a bunch of mv X Y + updating
info/*.

Finally, I have added a Collections entry to the .lintian-status file.
 This is used to keep track of which collections have been run and
removes the need for .$coll-$ver files.
  This will reduce our (expected) file-creation from 18 to 1 per binary
package[4].  For a full mirror run 18 files per binary package roughly
translate to 630 000 files[5].  The udeb and sources we go from 10 and 8
to 1.


So to sum it up:  I am repurposing $lab/info/* files to be a manifest of
what is in the lab (rather than what is on the mirror).  I am breaking
dist search and suggest we create a separate frontend for
archive-checks that supports dist search.
  I am considering to add a metadata file in $lab/info/ to include stuff
like Lab format version.  I have removed data from the (per-entry)
.lintian-status files.  The (per-entry) .$coll-$ver files will be
removed and the .lintian-status file will track those.

Any comments?  If not I will (hopefully) get the branch ready to be
merged into master within 2-3 weeks - so if you have not reviewed the
branch yet, now would be a good time to start.  :)

~Niels

[1] That would be single package checks:
lintian $pkg

but also simple static-lab usage

lintian --lab $lab --setup-static
lintian --lab $lab --unpack $pkg[,..., $pkgN]
lintian --lab $lab -r $pkg[,..., $pkgN]
etc.

[2] The check packages from mirror search, i.e.

lintian --lab $lab $pkg[,...,$pkgN]

will first check the mirror and then fallback to the lab.  I suggest we
only check the lab in this case.

[3] http://lists.debian.org/debian-lint-maint/2011/08/msg00285.html

[4] 17 binary collections + 1 lintian status file.

[5]  Assumes 35 000 binary packages.  Though currently only 576 000
files are created due to the file system limitations (~32 000 binary
packages).




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact 

Bug#641468: lintian: update the lab layout (i.e. use pools)

2011-09-15 Thread Niels Thykier
On 2011-09-14 12:34, Niels Thykier wrote:
 On 2011-09-13 19:21, Jakub Wilk wrote:
 * Niels Thykier ni...@thykier.net, 2011-09-13, 18:04:
 Jakub realized the source of a lot of our errors on lintian.d.o are
 caused by limitations in the file-system.  We should probably use a
 pool or something similar to reduce the amount of elements in each dirs.

 Just to shed more light on what the problem is:

 $ stat /srv/lintian.debian.org/laboratory/binary/ | grep Links
 Device: 807h/2055dInode: 7512069 Links: 32000

 On ext3 filesysytem, at least in squeeze, 32K is hard limit on number of
 hard links, so we can't create more directories in binary/.

 ext4 doesn't have this limitation, so a work-around would be to convert
 the filesystem.

 
 Upgrading to ext4 might be a solution, but I personally think that
 changing the Lab layout is the right thing(tm) to do in this case.
 Considering we want derivatives to do archive-wide Lintian runs, it may
 be prudent to be file-system agnostic.
 
 Also, we can (ab)use this oppertunity to enable multi-version +
 multi-architectures in static labs as well.  Hopefully we can also
 clean up the Lab API while we are at it. XD
 
 ~Niels
 
 
 
 

Okay, so before diving into this - can anyone elaborate on the current
Lab design?  The User Manual does not give me a lot here.  I am asking
because I want to know if there is something we should pay attention to
when working on this.


Beyond this I have been spending some time looking at the situation and
possible solutions and extensions.  My basic does not affect the layout
of entries/unpacked packages (i.e. collections should be unaffected).
  It appears that we have been bumping the LAB_FORMAT about once a year
the last two years (in 01-2010 for changes and 03-2003 due to many
recent changes[LFB]).  Hopefully format 11 can last far longer than a
year. :)

So, simple solution is to use a mirror-like pool, so something like:

 $LINTIAN_LAB/pool/l/lintian/lintian_2.5.3_all_binary/
 $LINTIAN_LAB/pool/l/lintian/lintian_2.5.3_source/

The last entry would be ${name}_${version}(_${arch})_${type}, where
the ${arch} part is not relevant for source.  Not sure what to do about
changes and architecture though (as they may have multiple architectures).
  The above have the advantage of trivially allowing multiple versions
(and architectures) of the same package in the pool.

I am thinking this would be a good time to make the Lab maintain its own
state files (info/$type-packages).  This implies updating the state
files when adding or removing a package from the lab.  If the lab
maintains these, we will most likely have an easier time providing a
sane API for accessing packages in the Lab.
  If we go down this route I would probably use the oppertunity to empty
the standards-version and the architecture field in
info/source-packages.
  Furthermore, each entry in these files have enough information to
provide create a Lintian::Processable.  If we extend L::Processable we
could be looking at a standard way of requesting and submitting
packages to the Lab.

I also consider the option of creating a migration utility that could
upgrade a LAB_FORMAT 10 to 11.  Since we have gotten rid of the unpack
scripts and cross-package symlinks, this change will basically just be
a bunch of mkdir followed by a tons of mv (and updating the
Lab-Format in the .lintian-status file).
  Admittedly lintian.d.o probably has relatively little to gain from it
as we will need to doing a full run once this bug is fixed anyway.


~Niels

[LFB] Commits:
fc10b608
8b1f0cfe




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#641468: lintian: update the lab layout (i.e. use pools)

2011-09-14 Thread Niels Thykier
On 2011-09-13 19:21, Jakub Wilk wrote:
 * Niels Thykier ni...@thykier.net, 2011-09-13, 18:04:
 Jakub realized the source of a lot of our errors on lintian.d.o are
 caused by limitations in the file-system.  We should probably use a
 pool or something similar to reduce the amount of elements in each dirs.
 
 Just to shed more light on what the problem is:
 
 $ stat /srv/lintian.debian.org/laboratory/binary/ | grep Links
 Device: 807h/2055dInode: 7512069 Links: 32000
 
 On ext3 filesysytem, at least in squeeze, 32K is hard limit on number of
 hard links, so we can't create more directories in binary/.
 
 ext4 doesn't have this limitation, so a work-around would be to convert
 the filesystem.
 

Upgrading to ext4 might be a solution, but I personally think that
changing the Lab layout is the right thing(tm) to do in this case.
Considering we want derivatives to do archive-wide Lintian runs, it may
be prudent to be file-system agnostic.

Also, we can (ab)use this oppertunity to enable multi-version +
multi-architectures in static labs as well.  Hopefully we can also
clean up the Lab API while we are at it. XD

~Niels




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#641468: lintian: update the lab layout (i.e. use pools)

2011-09-13 Thread Niels Thykier
Package: lintian
Severity: important


Jakub realized the source of a lot of our errors on lintian.d.o are
caused by limitations in the file-system.  We should probably use
a pool or something similar to reduce the amount of elements in
each dirs.

~Niels



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#641468: lintian: update the lab layout (i.e. use pools)

2011-09-13 Thread Jakub Wilk

* Niels Thykier ni...@thykier.net, 2011-09-13, 18:04:
Jakub realized the source of a lot of our errors on lintian.d.o are 
caused by limitations in the file-system.  We should probably use a 
pool or something similar to reduce the amount of elements in each 
dirs.


Just to shed more light on what the problem is:

$ stat /srv/lintian.debian.org/laboratory/binary/ | grep Links
Device: 807h/2055d  Inode: 7512069 Links: 32000

On ext3 filesysytem, at least in squeeze, 32K is hard limit on number of 
hard links, so we can't create more directories in binary/.


ext4 doesn't have this limitation, so a work-around would be to convert 
the filesystem.


--
Jakub Wilk



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org