Bug#848859: FTBFS randomly (failing tests)

2017-01-14 Thread Santiago Vila
> >> My experience is that the buildds that are currently in use provide more
> >> build problems than the packages themself. BTW, why don't you count this
> >> as RC?
> > 
> > Can you clarify the question? I don't understand what you refer exactly.
> 
> It does not make sense to have 100% reliability on the package itself to
> build but only 95% on the buildds. If we want to have reliability, then
> every chain member counts -- and the weakest one in the moment for me
> does not seem to be the packages themselves, but the buildd setups we
> currently have. So, if you consider unreliable package builds as an RC
> problem, you should also consider the unreliable buildd setup as an RC
> problem.

Ok, I now understand the question, but now I don't understand the
purpose of the question :-)

First of all, I don't see that the buildds are unreliable, or at least
I don't see that they fail randomly 5% of the time.

But in either case, if they were so unreliable, nothing prevents
building the same set of packages in another autobuilder farm not
being the official one.

This is precisely what is done in reproducible builds, what Lucas Nussbaum does,
and what I do as well (in a much smaller scale).

Because we mainly care about the quality of the packages themselves,
which is what we ship to our users, not about the quality of the
official autobuilders, which we don't ship to our users.

In other words, it may be interesting to consider the whole chain
that you mention, but I'm only interested in the packages themselves.

> > We provide software which is free to be modified. If people is going
> > to modify it and then it fails because the package only builds ok in
> > buildd.debian.org, then we are removing one of the essential freedoms,
> > the freedom to modify it, and the package becomes de-facto non-free,
> > even if we still distribute it in main.
> 
> First, the multi-CPU buildd setup is free (as in speech), isn't it? So,
> depending on this does not make anything non-free. I don't see why the
> dependency on a specific build environment would change the DFSG
> freeness at all (as long as the build environment itself is DFSG free).
> 
> Then, we are speaking about random failures. If you use a single-core
> and see a build failure, you can just pragmatically repeat it.
> 
> Sorry, I don't see why this would restrict DFSG freedom here.

The single-core / multi-core issue was just an example of possible
reason why package might build fine in buildd.debian.org and not in my
autobuilders (or the opposite).

But it was just an example. The real problem is when it fails to build
in every autobuilder but buildd.debian.org and people still say
"it builds fine on buildd.debian.org so it's not RC" but *without*
giving the reason why it fails in every other autobuilder.

So this is like having a "build-depends: buildd.debian.org".
Maintainer does not investigate enough and basically says that all
the other autobuilders are wrong "for not being buildd.debian.org".

This is what IMO makes the package non-free, the need to build the package
in some of the official autobuilders for *unknown* reasons.

Thanks.



Bug#848859: FTBFS randomly (failing tests)

2017-01-05 Thread Ole Streicher
On 05.01.2017 11:36, Santiago Vila wrote:
> On Thu, Jan 05, 2017 at 11:06:50AM +0100, Ole Streicher wrote:
>> Hi all,
>>
>> On 04.01.2017 20:57, Santiago Vila wrote:
>>> I still want to build all packages and have 0 or 1 failures,
>>> so in this case the probability should be 1/50/2, i.e. 1%.
>>>
>>> I think this is still feasible.
>>
>> My experience is that the buildds that are currently in use provide more
>> build problems than the packages themself. BTW, why don't you count this
>> as RC?
> 
> Can you clarify the question? I don't understand what you refer exactly.

It does not make sense to have 100% reliability on the package itself to
build but only 95% on the buildds. If we want to have reliability, then
every chain member counts -- and the weakest one in the moment for me
does not seem to be the packages themselves, but the buildd setups we
currently have. So, if you consider unreliable package builds as an RC
problem, you should also consider the unreliable buildd setup as an RC
problem.

>>> But as far as there are people in this project who consider that a
>>> package which FTBFS on single-CPU machines more than 50% of the time
>>> is ok "because it does not usually fail in buildd.debian.org",
>>> we are doomed.
>>
>> We have 10 archs, and with a 50% chance of failure you will not get any
>> version built. Even when the buildds try it two or three times.
> 
> It's actually more subtle than that.
> 
> The package may be ready to be built on multi-core machines, but not on
> single-core machines. So, the probability that I measure (greater than 50%),
> might not be the same as the probability on official autobuilders.

Ususally it is the other way around: packages build fine
single-threaded, but don't do so on many threads (or CPUs). At the end
it it also depends on how many CPUs you actually get for the build -- if
there are some packages built in parallel, the result may be as for a
single CPU.

> Then there is the other subtle thing: The package may be Arch:all, in
> which case a 50% of probability (this time independent on the number
> of CPUs) may go unnoticed very easily, either because the maintainer
> uploaded the .deb him/herself, or, if the maintainer uploaded
> in source-only form, we can just be lucky for that time.

I would favour much to ignore any prebuilt binaries and to re-build
everything anyway.

> We provide software which is free to be modified. If people is going
> to modify it and then it fails because the package only builds ok in
> buildd.debian.org, then we are removing one of the essential freedoms,
> the freedom to modify it, and the package becomes de-facto non-free,
> even if we still distribute it in main.

First, the multi-CPU buildd setup is free (as in speech), isn't it? So,
depending on this does not make anything non-free. I don't see why the
dependency on a specific build environment would change the DFSG
freeness at all (as long as the build environment itself is DFSG free).

Then, we are speaking about random failures. If you use a single-core
and see a build failure, you can just pragmatically repeat it.

Sorry, I don't see why this would restrict DFSG freedom here.

Best regards

Ole



Bug#848859: FTBFS randomly (failing tests)

2017-01-05 Thread Santiago Vila
On Thu, Jan 05, 2017 at 11:06:50AM +0100, Ole Streicher wrote:
> Hi all,
> 
> On 04.01.2017 20:57, Santiago Vila wrote:
> > I still want to build all packages and have 0 or 1 failures,
> > so in this case the probability should be 1/50/2, i.e. 1%.
> > 
> > I think this is still feasible.
> 
> My experience is that the buildds that are currently in use provide more
> build problems than the packages themself. BTW, why don't you count this
> as RC?

Can you clarify the question? I don't understand what you refer exactly.

> [statistical]
> >> The "fix" for such cases is the increasing of the threshold or disabling
> >> the test completely. Because you can do nothing with it due to the
> >> nature of numerical simulations.
> 
> Just disabling would be bad, since the test still can point to some
> problem in the implementation, like optimization problems or such.
> 
> IMO the correct solution here would be to remove the randomness and to
> seed with a fixed value (which is known to give a result within the
> expectations).

Indeed. Sometimes I've asked that in my bug reports: "I would love to
provide a way to reproduce it, but you are using a different seed each
time..."

Removing randomness is certainly the right path to achieve both builds
that always work and also reproducible builds.

> > But as far as there are people in this project who consider that a
> > package which FTBFS on single-CPU machines more than 50% of the time
> > is ok "because it does not usually fail in buildd.debian.org",
> > we are doomed.
> 
> We have 10 archs, and with a 50% chance of failure you will not get any
> version built. Even when the buildds try it two or three times.

It's actually more subtle than that.

The package may be ready to be built on multi-core machines, but not on
single-core machines. So, the probability that I measure (greater than 50%),
might not be the same as the probability on official autobuilders.

This is one of the reasons I insist so much that the rule "If it does
not fail on official autobuilders, it's not RC" is wrong.

We pride ourselves to be the Universal Operating System. We can't make
multi-core a requirement to build. That's also mathematically
and logically wrong, because in an abstract sense, building a package
is like following an algorithm. It may be slower or faster, but
it should always succeed regardless of the number of CPU cores.


Then there is the other subtle thing: The package may be Arch:all, in
which case a 50% of probability (this time independent on the number
of CPUs) may go unnoticed very easily, either because the maintainer
uploaded the .deb him/herself, or, if the maintainer uploaded
in source-only form, we can just be lucky for that time.

> > The problem I see with this threshold thing is that every maintainer
> > seems to have his own threshold, different from the others.
> > 
> > In case we decide about RC-ness depending on probability of failure:
> > What threshold do you think we should use for a single package and why?
> > 
> > [ I say that 1% of failure is the maximum we should allow, and I've
> >   explained why, but I would love to hear your opinion on this ].
> 
> I would not put any direct number here, but be pragmatic: We need to
> build the package from source, and this has to be done on all supported
> architectures, and on the buildds. As long as this is done, I see no
> reason for an RC bug. [...]

My objection to consider the official buildds the "standard" by which
RC-ness is measured is that it makes packages to have an implicit
"build-depends: buildd.debian.org".

We provide software which is free to be modified. If people is going
to modify it and then it fails because the package only builds ok in
buildd.debian.org, then we are removing one of the essential freedoms,
the freedom to modify it, and the package becomes de-facto non-free,
even if we still distribute it in main.

That's why I think that packages must build ok on any autobuilder which
is not misconfigured, not just on buildd.debian.org.

Thanks.



Bug#848859: FTBFS randomly (failing tests)

2017-01-05 Thread Ole Streicher
Hi all,

On 04.01.2017 20:57, Santiago Vila wrote:
> I still want to build all packages and have 0 or 1 failures,
> so in this case the probability should be 1/50/2, i.e. 1%.
> 
> I think this is still feasible.

My experience is that the buildds that are currently in use provide more
build problems than the packages themself. BTW, why don't you count this
as RC?

[statistical]
>> The "fix" for such cases is the increasing of the threshold or disabling
>> the test completely. Because you can do nothing with it due to the
>> nature of numerical simulations.

Just disabling would be bad, since the test still can point to some
problem in the implementation, like optimization problems or such.

IMO the correct solution here would be to remove the randomness and to
seed with a fixed value (which is known to give a result within the
expectations).

> But as far as there are people in this project who consider that a
> package which FTBFS on single-CPU machines more than 50% of the time
> is ok "because it does not usually fail in buildd.debian.org",
> we are doomed.

We have 10 archs, and with a 50% chance of failure you will not get any
version built. Even when the buildds try it two or three times.

> The problem I see with this threshold thing is that every maintainer
> seems to have his own threshold, different from the others.
> 
> In case we decide about RC-ness depending on probability of failure:
> What threshold do you think we should use for a single package and why?
> 
> [ I say that 1% of failure is the maximum we should allow, and I've
>   explained why, but I would love to hear your opinion on this ].

I would not put any direct number here, but be pragmatic: We need to
build the package from source, and this has to be done on all supported
architectures, and on the buildds. As long as this is done, I see no
reason for an RC bug. If a package fails on a supported arch on the
buildds, this is RC, independent of whether this came from an
architectural difference or from a random build failure. Your tests with
repeated builds help the maintainer to find the cause in that case, and
for this they are helpful. But not release critical by themself.

Cheers

Ole



Bug#848859: FTBFS randomly (failing tests)

2017-01-04 Thread Santiago Vila
On Wed, Jan 04, 2017 at 07:58:43PM +0100, Anton Gladky wrote:
> 2017-01-04 13:26 GMT+01:00 Santiago Vila :
> > No matter how much glitch-free is the autobuilder you use to build the
> > above package, it will fail to build 1 every 147 times on average,
> > mathematically, because the test is wrongly designed.
> 
> That is not always true. If you look in many tests from numerical
> simulation packages, there is usually a "threshold" for test result
> which should not be exceeded. And the test result varies in
> the limits, which are set by upstream authors. This result
> can be different even on the same machine, running the simulation
> several time. And it is normal.
> [...]

I know what you mean. I've seen it several times in statistical packages.

In my opinion, upstream authors may do as they wish, but Debian aims
at having reproducible builds (in some future). Reproducible builds
means each time you build the package, the same .deb is created.

This is of course not policy yet, but I see it as fundamentally
incompatible with packages which FTBFS from time to time. If we want
the end result to be always the same, then failing from time to time
(which is sometimes the end result) should never happen.

In other words, if we don't take deterministic builds seriously
(as in "every time I try to build the package, the build does not fail")
how can we expect to be reproducible in the future?

It is interesting, however, what you mention about thresholds,
statistical packages, and simulations, so here is the math
I do applied to Debian:

Let's say that we have 25000 source packages in stretch, and
I want to build all of them and not have a single failure.

Since, as you point out, there are quite a bunch of statistical
packages with tests based on random numbers, the mathematical
probability that there is some failure will always be > 0.

Ok, then. Let's suppose that I'm happy enough if the expected number
of packages that fail to build is closer to 0 than to 1.

So, let's make the probability of failure for each package to be half
of 1/25000. That would be 0.002%.

Not realistic enough? Let's assume additionally that 24950 source
packages build ok all the time, and only 50 of them have a probability
of failure > 0.

I still want to build all packages and have 0 or 1 failures,
so in this case the probability should be 1/50/2, i.e. 1%.

I think this is still feasible.

> The "fix" for such cases is the increasing of the threshold or disabling
> the test completely. Because you can do nothing with it due to the
> nature of numerical simulations.

I really wish it would always be as simple as that.

But as far as there are people in this project who consider that a
package which FTBFS on single-CPU machines more than 50% of the time
is ok "because it does not usually fail in buildd.debian.org",
we are doomed.

See the FTBFS-randomly bugs open against rygel, libsecret
or libsoup2.4, for example.

> > Really, we need more people doing QA, and not stop doing it "because
> > we are near the freeze".
> 
> If you are maintaining the package several years, fixing most of its
> bugs, hoping to see it in release and trying to escape major changes
> several months before the freeze.. Sure, it will actively be defended
> from maintainers if some pseudo-reasons for its removal appear just
> before the freeze. This fact has to be considered as well.

Well, you will see that I've downgraded all the bugs of this type to
important (btw: please do not call this "pseudo-reasons").

The problem I see with this threshold thing is that every maintainer
seems to have his own threshold, different from the others.

In case we decide about RC-ness depending on probability of failure:
What threshold do you think we should use for a single package and why?

[ I say that 1% of failure is the maximum we should allow, and I've
  explained why, but I would love to hear your opinion on this ].

Thanks a lot.



Bug#848859: FTBFS randomly (failing tests)

2017-01-04 Thread Thibaut Paumard
Le 04/01/2017 à 19:58, Anton Gladky a écrit :
> 2017-01-04 13:26 GMT+01:00 Santiago Vila :
>> No matter how much glitch-free is the autobuilder you use to build the
>> above package, it will fail to build 1 every 147 times on average,
>> mathematically, because the test is wrongly designed.
> 
> That is not always true. If you look in many tests from numerical
> simulation packages, there is usually a "threshold" for test result
> which should not be exceeded. And the test result varies in
> the limits, which are set by upstream authors. This result
> can be different even on the same machine, running the simulation
> several time. And it is normal.
> 
> The "fix" for such cases is the increasing of the threshold or disabling
> the test completely. Because you can do nothing with it due to the
> nature of numerical simulations.

I addition a build-time test is made to detect potential bugs, which may
themselves be of low severity. What makes the bug RC is that the test
causes FTBFS. Once such a bug has been identified, I tend to fill a bug
with appropriate severity against my own package and disable the test
until the real, low to normal severity bug is fixed.

>> Really, we need more people doing QA, and not stop doing it "because
>> we are near the freeze".
> 
> If you are maintaining the package several years, fixing most of its
> bugs, hoping to see it in release and trying to escape major changes
> several months before the freeze.. Sure, it will actively be defended
> from maintainers if some pseudo-reasons for its removal appear just
> before the freeze. This fact has to be considered as well.

I guess that's a case for a stretch-ignore tag. TBD with the release
team. Regards, Thibaut.

Regards, Thibaut.



Bug#848859: FTBFS randomly (failing tests)

2017-01-04 Thread Anton Gladky
2017-01-04 13:26 GMT+01:00 Santiago Vila :
> No matter how much glitch-free is the autobuilder you use to build the
> above package, it will fail to build 1 every 147 times on average,
> mathematically, because the test is wrongly designed.

That is not always true. If you look in many tests from numerical
simulation packages, there is usually a "threshold" for test result
which should not be exceeded. And the test result varies in
the limits, which are set by upstream authors. This result
can be different even on the same machine, running the simulation
several time. And it is normal.

The "fix" for such cases is the increasing of the threshold or disabling
the test completely. Because you can do nothing with it due to the
nature of numerical simulations.

> Really, we need more people doing QA, and not stop doing it "because
> we are near the freeze".

If you are maintaining the package several years, fixing most of its
bugs, hoping to see it in release and trying to escape major changes
several months before the freeze.. Sure, it will actively be defended
from maintainers if some pseudo-reasons for its removal appear just
before the freeze. This fact has to be considered as well.

Best regards

Anton



Bug#848859: FTBFS randomly (failing tests)

2017-01-04 Thread Santiago Vila
On Wed, Jan 04, 2017 at 08:44:17AM +0100, Ole Streicher wrote:

> > It's in Release Policy: Packages *must* autobuild *without* failure.
> > 
> > If a package fails to build from time to time, that's a failure.
> 
> Packages actually *do* fail from time to time, when I look into my
> autobuilder. Not due to the package, but due to glitches within the
> buildd infrastructure. Would you consider this a failure?

If the package is not to blame, of course not.

I'm speaking about packages which intrinsically fail with a
probability p such that 0 < p < 1. Funny example of what
I call "instrinsically":

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=838828

No matter how much glitch-free is the autobuilder you use to build the
above package, it will fail to build 1 every 147 times on average,
mathematically, because the test is wrongly designed.

> >> I totally agree that catching random failures
> >> is a good quality measure, but this is IMO severity "important" at maximum.
> > 
> > Well, would you say it's RC if it fails 99% of the time?
> > I guess you would.
> 
> I would consider a bug RC if it actually doesn't build on our buildds.

Aha! But *that* is what is not written in policy anywhere.

Not only it's not written anywhere, it's invalidated by current practice
every day. Examples here:

https://bugs.debian.org/cgi-bin/pkgreport.cgi?include=subject%3AFTBFS;submitter=lamby%40debian.org

or here:

https://bugs.debian.org/cgi-bin/pkgreport.cgi?include=subject%3AFTBFS;submitter=lucas%40debian.org

or even here:

https://bugs.debian.org/cgi-bin/pkgreport.cgi?include=subject%3AFTBFS;submitter=sanvila%40debian.org


Are you proposing that Lucas Nussbaum, Chris Lamb or myself stop
reporting FTBFS bugs as serious unless we can point to a failed build
log at buildd.debian.org?

That restricted way of reporting bugs surely may not be right.

> [...]
> Doing release QA just before the release leads to quick hacks to keep
> things there, while a continious QA really solves them.

Well, I started doing QA more than a year ago, to check for
"dpkg-buildpackage -A". As a side effect, I started to report each
and every package which FTBFS for whatever reason.

Really, we need more people doing QA, and not stop doing it "because
we are near the freeze".

Thanks.



Bug#848859: FTBFS randomly (failing tests)

2017-01-03 Thread Ole Streicher
Hi Santiago,

On 04.01.2017 01:41, Santiago Vila wrote:
> On Tue, 27 Dec 2016, Ole Streicher wrote:
> Hello Ole. Thanks for your reply. Please don't forget to Cc: me if you
> expect your message to be read.

OK, however I usually assume that a bug submitter actually reads the
messages for his submitted bugs.

>>> In particular, if something happens 1 every 20 times on average, the
>>> fact that it did not happen when you try 10 times does not mean in any
>>> way that it may not happen.
>>
>>
>> I must however say that I don't see why a package that fails to build
>> once in 20 builds would have a release critical problem:
> 
> It's in Release Policy: Packages *must* autobuild *without* failure.
> 
> If a package fails to build from time to time, that's a failure.

Packages actually *do* fail from time to time, when I look into my
autobuilder. Not due to the package, but due to glitches within the
buildd infrastructure. Would you consider this a failure?

>> I totally agree that catching random failures
>> is a good quality measure, but this is IMO severity "important" at maximum.
> 
> Well, would you say it's RC if it fails 99% of the time?
> I guess you would.

I would consider a bug RC if it actually doesn't build on our buildds.

>> And it would be nice to have this QA test not just before the release,
>> but already early in the release cycle. This would help a lot to avoid
>> stressful bugfixes just before the freeze.
> 
> Well, if you see my list of FTBFS-randomly bug here:
> 
> https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=ftbfs-randomly;users=sanv...@debian.org
> 
> you will see that I started to report FTBFS-randomly bugs several
> months ago, not last week, and not last month.
> 
> What would really help is to have somebody else reporting these bugs,
> because apparently very few people want to report them.
> 
> Also, you can't seriously "complain" that a bug reporter didn't not
> report something earlier! Of course that it would be nice to do this
> kind of QA stuff at every point in the release cycle, but as the same
> software that we maintain, my bug reports come with no warranty, not
> even the warranty that the bug is reported "as soon as it exists"! :-)

What I do complain about is that doing this just before the release adds
additional pressure to the package maintainers: The problem for us
ordinary people is that most of the time my packages look fine, without
serious problems, and just before the release all those RC bugs pop up
in bunches, with quite some time pressure to solve them. If the RC bugs
were randomly distributed over time, it would be more time to solve
them, and the quality of the fixes would be better: for example, in the
moment, I would just disable build time tests that randomly fail, while
usually I would try to see why they fail and fix that. In the moment,
there is no time for this.

Doing release QA just before the release leads to quick hacks to keep
things there, while a continious QA really solves them.

Best regards

Ole



Bug#848859: FTBFS randomly (failing tests)

2017-01-03 Thread Santiago Vila
On Tue, 27 Dec 2016, Ole Streicher wrote:

> Hi Santiago,

Hello Ole. Thanks for your reply. Please don't forget to Cc: me if you
expect your message to be read.

> > In particular, if something happens 1 every 20 times on average, the
> > fact that it did not happen when you try 10 times does not mean in any
> > way that it may not happen.
> 
> 
> I must however say that I don't see why a package that fails to build
> once in 20 builds would have a release critical problem:

It's in Release Policy: Packages *must* autobuild *without* failure.

If a package fails to build from time to time, that's a failure.

> There is nothing in our policy that requires a reproducible or even a
> always successful build.

Please don't mix "reproducible" with "always successful build".

They are different things and nobody is making "reproducible builds"
policy yet.

But "always successful" has always been release policy:

https://release.debian.org/stretch/rc_policy.txt

Please read the paragraph saying "packages must autobuild without
failure". That paragraph is the very reason FTBFS bugs are serious
since a lng time.

> I totally agree that catching random failures
> is a good quality measure, but this is IMO severity "important" at maximum.

Well, would you say it's RC if it fails 99% of the time?
I guess you would.

Would you say it's RC if it fails 0.001% of the time?
I guess you would not, since you have just said that 0.05% does not
deserve to be serious.

So, you are implicitly saying that packages which FTBFS with a
low probability does not deserve a serious bug.

However, there is no threshold at all in policy, and if you think
about it, any such threshold would be quite arbitrary indeed:
Why would 50% of failure be RC while 49% of failure is not?

[ To me, the mere fact that we are able to *measure* the probability
  of failure already means that it is too high ].


Anyway, there will be plenty of time to discuss about this because I'm
going to downgrade all the FTBFS-randomly bugs I've detected until
Release Managers say something about the subject.

Then I expect most (if not all) of them will become serious and RC
again.

> And it would be nice to have this QA test not just before the release,
> but already early in the release cycle. This would help a lot to avoid
> stressful bugfixes just before the freeze.

Well, if you see my list of FTBFS-randomly bug here:

https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=ftbfs-randomly;users=sanv...@debian.org

you will see that I started to report FTBFS-randomly bugs several
months ago, not last week, and not last month.

What would really help is to have somebody else reporting these bugs,
because apparently very few people want to report them.

Also, you can't seriously "complain" that a bug reporter didn't not
report something earlier! Of course that it would be nice to do this
kind of QA stuff at every point in the release cycle, but as the same
software that we maintain, my bug reports come with no warranty, not
even the warranty that the bug is reported "as soon as it exists"! :-)

Thanks.



Bug#848859: FTBFS randomly (failing tests)

2016-12-27 Thread Ole Streicher
Hi Santiago,

> In particular, if something happens 1 every 20 times on average, the
> fact that it did not happen when you try 10 times does not mean in any
> way that it may not happen.


I must however say that I don't see why a package that fails to build
once in 20 builds would have a release critical problem:

There is nothing in our policy that requires a reproducible or even a
always successful build. I totally agree that catching random failures
is a good quality measure, but this is IMO severity "important" at maximum.

And it would be nice to have this QA test not just before the release,
but already early in the release cycle. This would help a lot to avoid
stressful bugfixes just before the freeze.

Best regards

Ole