A new metric for source package importance in ports

2013-11-27 Thread Johannes Schauer
Hi,

the following is a report of a successful implementation of what I have been
talking about with Niels Thykier during debconf13. The question was how
important it is for a source package to be compilable or exist in the first
place given an incomplete port which is in the process of being bootstrapped.
This work is solving a different purpose than the identification of key
packages by Lucas Nussbaum [1]. Instead of attaching a binary value to each
source package, this method is associating integer values to them. Once
bootstrapping of the whole archive becomes more important or even possible in
real life through an implementation of build profiles, this heuristic could be
used to further extend the meaning of key packages as well.

This heuristic attaches to each source package A the number of source packages
which need A to be compilable so that they become compilable themselves. The
dependency graph which is needed to extract this information is conveniently
created by the service I run as http://bootstrap.debian.net - I'm using a
simple Python script to walk this graph to extract the information.

In fact that Python script uses two different graphs. Since dependencies
contain disjunctions, there exists different choices for packages which have to
be available for something to be compilable or installable. To not make this
choice arbitrary, I calculate the minimum number of dependencies that have to
be available (strong dependencies) and the maximum number that has to be
available (dependency closure). Therefore each source package A is associated
with two numbers: the minimum amount of source packages which depend on A being
compilable and the maximum number of source packages which depend on A being
compilable.

To create more than syntactic meaning I also added popcon information to the
output. I associate to each source package A the sum of all popcon values of
the source packages which depend on A being compilable. Again this is done for
the minimum as well as the maximum.

So here is the (tab delimetered) data in no particular order:

http://mister-muffin.de/p/pVxb.txt

1st column: the name of the source package
2nd column: minimum number of source packages which need this source pacage to 
be compilable
3rd column: maximum number of source packages which need this source pacage to 
be compilable
4th column: minimum sum of popcon values
5th column: maximum sum of popcon values

Do you see any obvious error?

When sorting the data by the second column, you will see that there are 1194
source packages with the same value: 19554. This value corresponds to the total
amount of source packages. It means: everything else depends on these 1194
source packages being compilable. If those 1194 source package are not
compilable then the rest will be neither. Remember that this only true during a
bootstrappping scenario. These 1194 source package are also all part of the
same strongly connected component of the strong srcgraph and roughly correlate
to the smallest set of packages which are needed for a self-hosting Debian
system.  We call a set of binary and source packages self-hosting if all binary
packages can be created from the source packages and all source packages can be
compiled with just the available binary packages. In my opinion it would make
sense to make all packages which are at minimum required to make Debian
self-hosted to the set of key packages by extending the definition by Lucas
Nussbaum at [1].

The amount of source packages which are needed to bootstrap themselves and all
the rest of Debian is that high because it includes source packages which are
only included because of the arch:all binary packages they build, because of
the essential:yes packages they build or because of the build-essential
packages they build. While it is important to include these for rebuilds of the
whole archive, they are not important in a real bootstrap situation. Arch:all
binary packages already exist and do not need to be bootstrapped and to start
to compile packages natively, a minimal build system (essential:yes +
build-essential) is required in the first place. Therefore I created a
different graph which takes into account that arch:all packages as well as the
packages of the minimal build system do not need to be rebuild:

http://mister-muffin.de/p/Gid8.txt

One can see that now the amount of source packages which is needed to build the
rest of the archive is only 383. It is important that these source packages
remain compilable (in addition to essential:yes + build-essential being
cross-able) because otherwise a bootstrap of any new architecture cannot be
done. The service at http://bootstrap.debian.net will indicate that an
architecture is not bootstrappable at all if this is the case.

Does anybody see enough value in these numbers for source package importance in
the light of bootstrapping Debian (either for a new port or for rebuilding the
archive from scratch)? If so, then I can generate these 

Re: A new metric for source package importance in ports

2013-11-27 Thread Steven Chamberlain
Hi josch!

On 27/11/13 17:58, Johannes Schauer wrote:
 http://mister-muffin.de/p/Gid8.txt
 
 One can see that now the amount of source packages which is needed to build 
 the
 rest of the archive is only 383.

So, there are 383 packages that share the same, maximum value (in this
case 11657) in the second column?

 Does anybody see enough value in these numbers for source package importance 
 in
 the light of bootstrapping Debian (either for a new port or for rebuilding the
 archive from scratch)?

I find the list of 383 packages interesting, at least.  I think this
closure is what I had in mind[0] for regular testing of ports'
toolchains and reproducibility of builds.  Because each Debian port
depends in some indirect way on the authenticity of these packages.  And
likewise any toolchain bugs are most critical here.  I just didn't think
there would be so many packages.

Does the list vary by architecture?  I see many odd things in here such
as 'systemd' and 'redhat-cluster' which would be unavailable if trying
to bootstrap a non-Linux port, for example.

I also find it interesting to see openjdk-7 listed but not gcj;  or even
gcc-4.8.  Was this computed for jessie or sid?

[0]: http://lists.debian.org/5266df9d.9040...@pyro.eu.org

Regards,
-- 
Steven Chamberlain
ste...@pyro.eu.org


-- 
To UNSUBSCRIBE, email to debian-sparc-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/529688a8.8080...@pyro.eu.org



Re: A new metric for source package importance in ports

2013-11-27 Thread peter green

Johannes Schauer wrote:

Hi,

the following is a report of a successful implementation of what I have been
talking about with Niels Thykier during debconf13. The question was how
important it is for a source package to be compilable or exist in the first
place given an incomplete port which is in the process of being bootstrapped.
This work is solving a different purpose than the identification of key
packages by Lucas Nussbaum [1]. Instead of attaching a binary value to each
source package, this method is associating integer values to them. Once
bootstrapping of the whole archive becomes more important or even possible in
real life through an implementation of build profiles, this heuristic could be
used to further extend the meaning of key packages as well.

One problem with these metrics is that you get source packages whose
importance is artifically inflated because of the way our source
packages work. If anything in a source package needs x then the whole
source package has to build-depend on x.  Even if x is only needed for
some perhipheral functionlity that could easilly be removed in the event
that x was unavailable (either on a particular port or in general). In
the case of libraries there may be a binary dependency too for rarely
used fuctionality.

For example some of the mesa libraries drag in libwayland0 which means
wayland ends up with a very high importance even though afaict hardly
anyone uses it right now.

Another big example is languages. Lots of packages build language
bindings for lots of languages dragging those languages into the
important set.

So these metrics are a good guide to what packages are unimportant
but to determine whether a package is really important or just
psuedo-important still requires human judgement.


--
To UNSUBSCRIBE, email to debian-sparc-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/52968a89.6050...@p10link.net



Re: A new metric for source package importance in ports

2013-11-27 Thread Dmitrijs Ledkovs
On 28 November 2013 00:04, Steven Chamberlain ste...@pyro.eu.org wrote:
 Hi josch!

 On 27/11/13 17:58, Johannes Schauer wrote:
 http://mister-muffin.de/p/Gid8.txt

 One can see that now the amount of source packages which is needed to build 
 the
 rest of the archive is only 383.

 So, there are 383 packages that share the same, maximum value (in this
 case 11657) in the second column?

 Does anybody see enough value in these numbers for source package importance 
 in
 the light of bootstrapping Debian (either for a new port or for rebuilding 
 the
 archive from scratch)?

 I find the list of 383 packages interesting, at least.  I think this
 closure is what I had in mind[0] for regular testing of ports'
 toolchains and reproducibility of builds.  Because each Debian port
 depends in some indirect way on the authenticity of these packages.  And
 likewise any toolchain bugs are most critical here.  I just didn't think
 there would be so many packages.

 Does the list vary by architecture?  I see many odd things in here such
 as 'systemd' and 'redhat-cluster' which would be unavailable if trying
 to bootstrap a non-Linux port, for example.

 I also find it interesting to see openjdk-7 listed but not gcj;  or even
 gcc-4.8.  Was this computed for jessie or sid?

I guess implicit relationships are not considered: build-essential
build-dependencies, and essential dependencies. I would expect for
packages in those to sets have the highest rank, since,
hypothetically, all packages in debian build-depend  depend on those.

Regards,

Dmitrijs.


-- 
To UNSUBSCRIBE, email to debian-sparc-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/CANBHLUiifmR+_keS3eSQa_b3_CfZ_56o9vBRR8p2SeY=hy9...@mail.gmail.com



Re: A new metric for source package importance in ports

2013-11-27 Thread Leslie S Satenstein
Instead of dwelling on this discovery, which is not productive, why not 
concentrate on what to do to improve Debian.

The analysis has shown faults. Has Debian stopped working?  Has the world 
crashed?  

The problems have been identified, the patches to address the issues are being 
evaluated and planned for retesting.

By January 15,2014, Debian, Ubuntu , SUSE13.1, Fedora, RedHat, and probably 
every distribution that has an old or recent kernel will be corrected.

So, whats the next topic?


 
Regards 

 Leslie

Mr. Leslie Satenstein
An experienced Information Technology specialist.
Yesterday was a good day, today is a better day,
and tomorrow will be even better.lsatenst...@yahoo.com
SENT FROM MY OPEN SOURCE LINUX SYSTEM.





 From: Dmitrijs Ledkovs x...@debian.org
To: Steven Chamberlain ste...@pyro.eu.org 
Cc: Johannes Schauer j.scha...@email.de; Debian Release 
debian-rele...@lists.debian.org; debian-po...@lists.debian.org 
Sent: Wednesday, November 27, 2013 7:15 PM
Subject: Re: A new metric for source package importance in ports
 

On 28 November 2013 00:04, Steven Chamberlain ste...@pyro.eu.org wrote:
 Hi josch!

 On 27/11/13 17:58, Johannes Schauer wrote:
 http://mister-muffin.de/p/Gid8.txt

 One can see that now the amount of source packages which is needed to build 
 the
 rest of the archive is only 383.

 So, there are 383 packages that share the same, maximum value (in this
 case 11657) in the second column?

 Does anybody see enough value in these numbers for source package 
 importance in
 the light of bootstrapping Debian (either for a new port or for rebuilding 
 the
 archive from scratch)?

 I find the list of 383 packages interesting, at least.  I think this
 closure is what I had in mind[0] for regular testing of ports'
 toolchains and reproducibility of builds.  Because each Debian port
 depends in some indirect way on the authenticity of these packages.  And
 likewise any toolchain bugs are most critical here.  I just didn't think
 there would be so many packages.

 Does the list vary by architecture?  I see many odd things in here such
 as 'systemd' and 'redhat-cluster' which would be unavailable if trying
 to bootstrap a non-Linux port, for example.

 I also find it interesting to see openjdk-7 listed but not gcj;  or even
 gcc-4.8.  Was this computed for jessie or sid?

I guess implicit relationships are not considered: build-essential
build-dependencies, and essential dependencies. I would expect for
packages in those to sets have the highest rank, since,
hypothetically, all packages in debian build-depend  depend on those.

Regards,

Dmitrijs.



-- 
To UNSUBSCRIBE, email to debian-amd64-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/CANBHLUiifmR+_keS3eSQa_b3_CfZ_56o9vBRR8p2SeY=hy9...@mail.gmail.com






Re: A new metric for source package importance in ports

2013-11-27 Thread Johannes Schauer
Hi,

Quoting peter green (2013-11-28 01:12:57)
 One problem with these metrics is that you get source packages whose
 importance is artifically inflated because of the way our source packages
 work. If anything in a source package needs x then the whole source package
 has to build-depend on x.  Even if x is only needed for some perhipheral
 functionlity that could easilly be removed in the event that x was
 unavailable (either on a particular port or in general). In the case of
 libraries there may be a binary dependency too for rarely used fuctionality.
 
 For example some of the mesa libraries drag in libwayland0 which means
 wayland ends up with a very high importance even though afaict hardly
 anyone uses it right now.
 
 Another big example is languages. Lots of packages build language
 bindings for lots of languages dragging those languages into the
 important set.
 
 So these metrics are a good guide to what packages are unimportant
 but to determine whether a package is really important or just
 psuedo-important still requires human judgement.

Correct.

The situation can be greatly improved once build profiles allow to mark build
dependencies as less important or non essential.

cheers, josch


--
To UNSUBSCRIBE, email to debian-sparc-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20131128074506.2752.10616@hoothoot