Re: [Reproducible-builds] proposal: store information in one place instead of multiple ones

2015-07-30 Thread Jérémy Bobbio
Johannes Schauer:
> here are several questions I have which, for me boil down to information being
> duplicated and stored in different locations, leading to possible confusion 
> for
> contributors and added work when adding new bugs and issues:

Before I go further with answering: it seems you assume there's
well-thoughts reasons for the current state of things. For most of your
questions, that is not the case. Things grew organically from
experiments and different people making things better when they see they
could.

> 1. Why is the set of bts usertags different from the set of r-b issues? The 
> bts
>usertags seem to be way more broad.

That was their point initially. I wanted to be able to make statistics
on which kind of class of issues were most prevalent.

>A solution would be to ditch the current usertags and use the issue names
>instead. This would allow a one-to-one mapping between issue and bug 
> number.

This would make creating a new issue much harder. Usertags are not a
nice part of the BTS to interact with. We have been adding a couple
issues every week for a good while. See the weekly reports.

> 2. Why does packages.yml store the bug number(s) for each package? This
>information can easily retrieved from the bts and then will also not be
>outdated. packages.yml easily lags behind the actual bts information if not
>regularly updated by someone.

packages.yml was meant to be self-contained at first. Some bugs
affecting reproducibility could not be reproducibility issues per-se.

> 3. Why are the issues explained in issues.yml *and* in the wiki? There should
>be one canonical place to describe them because currently, any new issue
>that is identified requires to edit multiple resources and then link 
> between
>the two. This not only requires more work when creating the issue but when
>looking up issues it is also unclear which resource is the authoritative 
> one
>and which one will give the desired information. Instead, the information
>should be stored in one place only.

Here I can see a real reason: they have different audiences. issues.yml
is mainly for people involved in the whole effort where the wiki page
should be accessible to maintainers of a single package. Some issues are
systemic and individual maintainer should not really care about these.

The wiki has a richer syntax and makes nicer page.

> So my proposal is:
> 
> 1. Instead of using the current usertags "toolchain", "infrastructure",
>"timestamps" and so on, use issue names instead.
> 
>Since each bugs can have multiple usertags, the old tags could even be kept
>and the issue names be added in addition.
> 
>Since packages.yml exists, much of this conversion could probably be even
>automated (except for packages with more than one bug open for them).
> 
>Sometimes, reproducibility problems only affect a single package and in 
> that
>case it would create too much overhead to create a new issue for it. But in
>that case, why not just create a dummy issue just for the purpose to
>associate this kind of bugs to the reproducible builds team?

I tend to feel this would be much less flexible than how we currently do
things. We don't have an issue for every single type of patch.

> 2. Do not add bug numbers to packages.yml. The bts already stores the
>information which source package has which bugs by the reproducible builds
>team.

That means we have to tag every bug that affects the build on our
environment. I don't like the idea that much, but since Faux started
adding `ftbfs`, I guess this opened the gates.

> 3. Use the wiki only to describe issues and ditch issues.yml. The advantages
>are that the Debian wiki offers a much richer syntax and is also editable 
> by
>everybody in Debian and not only the reproducible builds team.

Creating a page on the wiki is much more work than adding a couple of
lines in issues.yml. Categorizing issues is not a super-fun task, and
the less frictions there are, the better. I've seen myself being lazy
and even if I saw a pattern, not create an issue straight away because
I wanted to avoid interacting with the wiki.

> 4. After this is done, it is hard to say why the notes.git is useful in the
>first place. The content of issues.yml is described in the Debian wiki and
>the bug numbers are stored in the bts. One last task of packages.yml would
>probably be to store some tiny notes for packages for which there doesn't
>exist a bug. But I'd say to also move these notes into the bts. I think 
> that
>filing a bug about a package's unreproducibility should be done even 
> without
>having a fix for it. In fact many packages with such bugs exist simply for
>the reason that at the time the bug was filed, jenkins did less checks than
>it does now, so the patch which is currently in the bts does not make the
>package fully reproducibly anymore. Furthermore, storing th

[Reproducible-builds] proposal: store information in one place instead of multiple ones

2015-07-28 Thread Johannes Schauer
Hi,

here are several questions I have which, for me boil down to information being
duplicated and stored in different locations, leading to possible confusion for
contributors and added work when adding new bugs and issues:

1. Why is the set of bts usertags different from the set of r-b issues? The bts
   usertags seem to be way more broad. For example there is the usertag
   "timestamps" which matches many issues. It is currently impossible get a
   machine readable mapping from bug number to the issue it fixes because the
   usertags are much too broad.

   A solution would be to ditch the current usertags and use the issue names
   instead. This would allow a one-to-one mapping between issue and bug number.

   What is the utility of the current usertags? Are they used for anything?

   Currently I'd rather say that it's confusing to have a package associated
   with two disjunct sets of tags: the usertags and the issues. Why is that
   useful?

   Another helpful thing would be if the bug subject line wasn't as generic
   (ie. if it was not just verbatim copied from the template) but that's
   another problem.

2. Why does packages.yml store the bug number(s) for each package? This
   information can easily retrieved from the bts and then will also not be
   outdated. packages.yml easily lags behind the actual bts information if not
   regularly updated by someone.

3. Why are the issues explained in issues.yml *and* in the wiki? There should
   be one canonical place to describe them because currently, any new issue
   that is identified requires to edit multiple resources and then link between
   the two. This not only requires more work when creating the issue but when
   looking up issues it is also unclear which resource is the authoritative one
   and which one will give the desired information. Instead, the information
   should be stored in one place only.

So my proposal is:

1. Instead of using the current usertags "toolchain", "infrastructure",
   "timestamps" and so on, use issue names instead.

   Since each bugs can have multiple usertags, the old tags could even be kept
   and the issue names be added in addition.

   Since packages.yml exists, much of this conversion could probably be even
   automated (except for packages with more than one bug open for them).

   Sometimes, reproducibility problems only affect a single package and in that
   case it would create too much overhead to create a new issue for it. But in
   that case, why not just create a dummy issue just for the purpose to
   associate this kind of bugs to the reproducible builds team?

2. Do not add bug numbers to packages.yml. The bts already stores the
   information which source package has which bugs by the reproducible builds
   team.

3. Use the wiki only to describe issues and ditch issues.yml. The advantages
   are that the Debian wiki offers a much richer syntax and is also editable by
   everybody in Debian and not only the reproducible builds team.

4. After this is done, it is hard to say why the notes.git is useful in the
   first place. The content of issues.yml is described in the Debian wiki and
   the bug numbers are stored in the bts. One last task of packages.yml would
   probably be to store some tiny notes for packages for which there doesn't
   exist a bug. But I'd say to also move these notes into the bts. I think that
   filing a bug about a package's unreproducibility should be done even without
   having a fix for it. In fact many packages with such bugs exist simply for
   the reason that at the time the bug was filed, jenkins did less checks than
   it does now, so the patch which is currently in the bts does not make the
   package fully reproducibly anymore. Furthermore, storing these notes in the
   bts might make the package maintainer aware of the issue and gives them a
   chance to comment on these notes. I would say it gives maintainers more
   incentive to react on the issue themselves that way.

On IRC the following problems were raised:

 - "you cannot grep packages.yml if it's in the bts because query-ing it takes
   ages"

 * true, but then just cache it. In fact, that's what you are already doing
   semi-manually in the notes.git by running the clean-notes script over
   it. So instead of having to wait until somebody runs clean-notes and
   then does `git add packages.yml && git commit && git push` from time to
   time, how about automating this process and then publishing a fully
   machine generated packages.yml which can then be grepped?

 - "there is more information in packages.yml than in the bts"

* true, and i think that's a bug. By having this information in
  packages.yml and through that on reproducible.d.n, you are not informing
  the maintainer of the package about the little info you just found
  analyzing their package for reproducibility issues. Instead, I think you
  should file a bug and write the small information