Bug#641468: lintian: update the lab layout (i.e. use pools)
On 2011-10-05 10:45, Niels Thykier wrote: On 2011-09-13 18:04, Niels Thykier wrote: Package: lintian Severity: important Jakub realized the source of a lot of our errors on lintian.d.o are caused by limitations in the file-system. We should probably use a pool or something similar to reduce the amount of elements in each dirs. ~Niels I guess it might be a good time for a little status update here. Since no one has commented so far I have applied do-cracy and done stuff... The lab-refactor branch is now working for simple use cases[1]. However, the lintian.d.o-style usage needs some attention. In the master branch we use $lab/info/* as a list of what was in the mirror last time we checked. Those files have been repurposed in the lab-refactor branch, where their new meaning is what is currently in the lab. This means that dist search[2] is currently broken. To my knowledge there are *2* known cases where dist searches make sense - lintian.d.o and lintian.debathena.o. I feel we should move that functionality to a new frontend (such as the lintian-harness[3]) that would focus lintian.d.o-like setups. Note that repurposing is not entirely complete and therefore reporting/harness is more or less broken right now. One of the issues is that unpack/* still use the files in info/* as a dist list and not a lab list. dist search is now removed from lintian - the reporting stuff left untouched and is therefore still broken. Yay for progress! :) When I was looking at this, I realised two things. First a lot of variables (and cmd-options) now appear to be redundant in frontend/lintian. Namely all of LINTIAN_{ARCHIVEDIR,AREA,DIST} and possibly also LINTIAN_ARCH. It has not been double checked, but I strongly suspect them of being unused now. Secondly, the current search rules were not sufficient. Basically, it was only possible to match all packages with a given name. I have solved this by creating a simple way of referring to packages in the Lab. Originially I planned to accept both the current britney-style format[1] and the filename-style[2]. However it occurred to me that the filename-style is (for obvious reasons) impossible to reliably distinguish from a normal file. As this could lead to confusion for the users (i.e. principe of least surprise), I decided to not include the filename-style. The britney-style format is described in man/lintian.pod.in. [1] [type:]package[/version[/arch]] [2] package_version[_arch].ext I also considered adding a file in info/ to keep track of lab-wide (meta)data, such as the lab-format. In the old lab format, this is stored in every entry. This makes is slightly more difficult to check if we are dealing with a compatible lab. Consider if you use an old lintian to use the new lab style - they do not store the entries the same place, so it has no reliable way to detect it is not compatible. I would prefer that an old lintian would always be able to say The lab uses a newer lab-format that this version of lintian supports - even if this case will probably never happen. I have added a lab-wide data file stored as $LAB/info/lab-info. It uses the deb822-style syntax and has two fields in the first paragraph: Lab-Format: $format Layout: pool The Lab-Format field describes the current format of the lab[3]. The Layout field describes how the packages are placed in the lab. Currently only one layout exists (namely pool), which reflects the layout in the current branch. The Layout field allows us to implement and play with a new layout side-by-side with the current one. Hopefully we will never need this feature, but probably we will. [3] Will be 11 when the development is done. Currently it is 10.1. I am also wondering what we need in the per-entry lintian-status file. In the master branch, we store Lintian-Version, Lab-Format, Package (name), Version (package), Type (package) and Timestamp. When we read the status file, we compare lab-format, package version and timestamp. With the changes in lab-refactor branch, the lab always supports multiple versions of the same package, thus the package version comparision is a no-op. As I understand it, the timestamp is there to make lintian re-unpack the package if it changed since the last run. Currently it completely removes the entry if the timestamp does not match. Though this code only makes sense for personal static labs - on the lintian.d.o case, the version of a package can not be reused (at least not in general). The timestamp-part is not in the lab-refactor branch (yet?). I am considering to replace the Lab-format value with an entry-format-version. Not sure it makes sense, but I thinking it may make migration to newer formats easier. If I had not (ab)used the oppertunity to do optimizations in the .lintian-status file (see below), the migration from the current to the lab-format would basically just have been a
Bug#641468: lintian: update the lab layout (i.e. use pools)
On 2011-09-13 18:04, Niels Thykier wrote: Package: lintian Severity: important Jakub realized the source of a lot of our errors on lintian.d.o are caused by limitations in the file-system. We should probably use a pool or something similar to reduce the amount of elements in each dirs. ~Niels I guess it might be a good time for a little status update here. The lab-refactor branch is now working for simple use cases[1]. However, the lintian.d.o-style usage needs some attention. In the master branch we use $lab/info/* as a list of what was in the mirror last time we checked. Those files have been repurposed in the lab-refactor branch, where their new meaning is what is currently in the lab. This means that dist search[2] is currently broken. To my knowledge there are *2* known cases where dist searches make sense - lintian.d.o and lintian.debathena.o. I feel we should move that functionality to a new frontend (such as the lintian-harness[3]) that would focus lintian.d.o-like setups. Note that repurposing is not entirely complete and therefore reporting/harness is more or less broken right now. One of the issues is that unpack/* still use the files in info/* as a dist list and not a lab list. I also considered adding a file in info/ to keep track of lab-wide (meta)data, such as the lab-format. In the old lab format, this is stored in every entry. This makes is slightly more difficult to check if we are dealing with a compatible lab. Consider if you use an old lintian to use the new lab style - they do not store the entries the same place, so it has no reliable way to detect it is not compatible. I would prefer that an old lintian would always be able to say The lab uses a newer lab-format that this version of lintian supports - even if this case will probably never happen. I am also wondering what we need in the per-entry lintian-status file. In the master branch, we store Lintian-Version, Lab-Format, Package (name), Version (package), Type (package) and Timestamp. When we read the status file, we compare lab-format, package version and timestamp. With the changes in lab-refactor branch, the lab always supports multiple versions of the same package, thus the package version comparision is a no-op. As I understand it, the timestamp is there to make lintian re-unpack the package if it changed since the last run. Currently it completely removes the entry if the timestamp does not match. Though this code only makes sense for personal static labs - on the lintian.d.o case, the version of a package can not be reused (at least not in general). The timestamp-part is not in the lab-refactor branch (yet?). I am considering to replace the Lab-format value with an entry-format-version. Not sure it makes sense, but I thinking it may make migration to newer formats easier. If I had not (ab)used the oppertunity to do optimizations in the .lintian-status file (see below), the migration from the current to the lab-format would basically just have been a bunch of mv X Y + updating info/*. Finally, I have added a Collections entry to the .lintian-status file. This is used to keep track of which collections have been run and removes the need for .$coll-$ver files. This will reduce our (expected) file-creation from 18 to 1 per binary package[4]. For a full mirror run 18 files per binary package roughly translate to 630 000 files[5]. The udeb and sources we go from 10 and 8 to 1. So to sum it up: I am repurposing $lab/info/* files to be a manifest of what is in the lab (rather than what is on the mirror). I am breaking dist search and suggest we create a separate frontend for archive-checks that supports dist search. I am considering to add a metadata file in $lab/info/ to include stuff like Lab format version. I have removed data from the (per-entry) .lintian-status files. The (per-entry) .$coll-$ver files will be removed and the .lintian-status file will track those. Any comments? If not I will (hopefully) get the branch ready to be merged into master within 2-3 weeks - so if you have not reviewed the branch yet, now would be a good time to start. :) ~Niels [1] That would be single package checks: lintian $pkg but also simple static-lab usage lintian --lab $lab --setup-static lintian --lab $lab --unpack $pkg[,..., $pkgN] lintian --lab $lab -r $pkg[,..., $pkgN] etc. [2] The check packages from mirror search, i.e. lintian --lab $lab $pkg[,...,$pkgN] will first check the mirror and then fallback to the lab. I suggest we only check the lab in this case. [3] http://lists.debian.org/debian-lint-maint/2011/08/msg00285.html [4] 17 binary collections + 1 lintian status file. [5] Assumes 35 000 binary packages. Though currently only 576 000 files are created due to the file system limitations (~32 000 binary packages). -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact
Bug#641468: lintian: update the lab layout (i.e. use pools)
On 2011-09-14 12:34, Niels Thykier wrote: On 2011-09-13 19:21, Jakub Wilk wrote: * Niels Thykier ni...@thykier.net, 2011-09-13, 18:04: Jakub realized the source of a lot of our errors on lintian.d.o are caused by limitations in the file-system. We should probably use a pool or something similar to reduce the amount of elements in each dirs. Just to shed more light on what the problem is: $ stat /srv/lintian.debian.org/laboratory/binary/ | grep Links Device: 807h/2055dInode: 7512069 Links: 32000 On ext3 filesysytem, at least in squeeze, 32K is hard limit on number of hard links, so we can't create more directories in binary/. ext4 doesn't have this limitation, so a work-around would be to convert the filesystem. Upgrading to ext4 might be a solution, but I personally think that changing the Lab layout is the right thing(tm) to do in this case. Considering we want derivatives to do archive-wide Lintian runs, it may be prudent to be file-system agnostic. Also, we can (ab)use this oppertunity to enable multi-version + multi-architectures in static labs as well. Hopefully we can also clean up the Lab API while we are at it. XD ~Niels Okay, so before diving into this - can anyone elaborate on the current Lab design? The User Manual does not give me a lot here. I am asking because I want to know if there is something we should pay attention to when working on this. Beyond this I have been spending some time looking at the situation and possible solutions and extensions. My basic does not affect the layout of entries/unpacked packages (i.e. collections should be unaffected). It appears that we have been bumping the LAB_FORMAT about once a year the last two years (in 01-2010 for changes and 03-2003 due to many recent changes[LFB]). Hopefully format 11 can last far longer than a year. :) So, simple solution is to use a mirror-like pool, so something like: $LINTIAN_LAB/pool/l/lintian/lintian_2.5.3_all_binary/ $LINTIAN_LAB/pool/l/lintian/lintian_2.5.3_source/ The last entry would be ${name}_${version}(_${arch})_${type}, where the ${arch} part is not relevant for source. Not sure what to do about changes and architecture though (as they may have multiple architectures). The above have the advantage of trivially allowing multiple versions (and architectures) of the same package in the pool. I am thinking this would be a good time to make the Lab maintain its own state files (info/$type-packages). This implies updating the state files when adding or removing a package from the lab. If the lab maintains these, we will most likely have an easier time providing a sane API for accessing packages in the Lab. If we go down this route I would probably use the oppertunity to empty the standards-version and the architecture field in info/source-packages. Furthermore, each entry in these files have enough information to provide create a Lintian::Processable. If we extend L::Processable we could be looking at a standard way of requesting and submitting packages to the Lab. I also consider the option of creating a migration utility that could upgrade a LAB_FORMAT 10 to 11. Since we have gotten rid of the unpack scripts and cross-package symlinks, this change will basically just be a bunch of mkdir followed by a tons of mv (and updating the Lab-Format in the .lintian-status file). Admittedly lintian.d.o probably has relatively little to gain from it as we will need to doing a full run once this bug is fixed anyway. ~Niels [LFB] Commits: fc10b608 8b1f0cfe -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#641468: lintian: update the lab layout (i.e. use pools)
On 2011-09-13 19:21, Jakub Wilk wrote: * Niels Thykier ni...@thykier.net, 2011-09-13, 18:04: Jakub realized the source of a lot of our errors on lintian.d.o are caused by limitations in the file-system. We should probably use a pool or something similar to reduce the amount of elements in each dirs. Just to shed more light on what the problem is: $ stat /srv/lintian.debian.org/laboratory/binary/ | grep Links Device: 807h/2055dInode: 7512069 Links: 32000 On ext3 filesysytem, at least in squeeze, 32K is hard limit on number of hard links, so we can't create more directories in binary/. ext4 doesn't have this limitation, so a work-around would be to convert the filesystem. Upgrading to ext4 might be a solution, but I personally think that changing the Lab layout is the right thing(tm) to do in this case. Considering we want derivatives to do archive-wide Lintian runs, it may be prudent to be file-system agnostic. Also, we can (ab)use this oppertunity to enable multi-version + multi-architectures in static labs as well. Hopefully we can also clean up the Lab API while we are at it. XD ~Niels -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#641468: lintian: update the lab layout (i.e. use pools)
Package: lintian Severity: important Jakub realized the source of a lot of our errors on lintian.d.o are caused by limitations in the file-system. We should probably use a pool or something similar to reduce the amount of elements in each dirs. ~Niels -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#641468: lintian: update the lab layout (i.e. use pools)
* Niels Thykier ni...@thykier.net, 2011-09-13, 18:04: Jakub realized the source of a lot of our errors on lintian.d.o are caused by limitations in the file-system. We should probably use a pool or something similar to reduce the amount of elements in each dirs. Just to shed more light on what the problem is: $ stat /srv/lintian.debian.org/laboratory/binary/ | grep Links Device: 807h/2055d Inode: 7512069 Links: 32000 On ext3 filesysytem, at least in squeeze, 32K is hard limit on number of hard links, so we can't create more directories in binary/. ext4 doesn't have this limitation, so a work-around would be to convert the filesystem. -- Jakub Wilk -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org