Bug#848859: FTBFS randomly (failing tests)
> >> My experience is that the buildds that are currently in use provide more > >> build problems than the packages themself. BTW, why don't you count this > >> as RC? > > > > Can you clarify the question? I don't understand what you refer exactly. > > It does not make sense to have 100% reliability on the package itself to > build but only 95% on the buildds. If we want to have reliability, then > every chain member counts -- and the weakest one in the moment for me > does not seem to be the packages themselves, but the buildd setups we > currently have. So, if you consider unreliable package builds as an RC > problem, you should also consider the unreliable buildd setup as an RC > problem. Ok, I now understand the question, but now I don't understand the purpose of the question :-) First of all, I don't see that the buildds are unreliable, or at least I don't see that they fail randomly 5% of the time. But in either case, if they were so unreliable, nothing prevents building the same set of packages in another autobuilder farm not being the official one. This is precisely what is done in reproducible builds, what Lucas Nussbaum does, and what I do as well (in a much smaller scale). Because we mainly care about the quality of the packages themselves, which is what we ship to our users, not about the quality of the official autobuilders, which we don't ship to our users. In other words, it may be interesting to consider the whole chain that you mention, but I'm only interested in the packages themselves. > > We provide software which is free to be modified. If people is going > > to modify it and then it fails because the package only builds ok in > > buildd.debian.org, then we are removing one of the essential freedoms, > > the freedom to modify it, and the package becomes de-facto non-free, > > even if we still distribute it in main. > > First, the multi-CPU buildd setup is free (as in speech), isn't it? So, > depending on this does not make anything non-free. I don't see why the > dependency on a specific build environment would change the DFSG > freeness at all (as long as the build environment itself is DFSG free). > > Then, we are speaking about random failures. If you use a single-core > and see a build failure, you can just pragmatically repeat it. > > Sorry, I don't see why this would restrict DFSG freedom here. The single-core / multi-core issue was just an example of possible reason why package might build fine in buildd.debian.org and not in my autobuilders (or the opposite). But it was just an example. The real problem is when it fails to build in every autobuilder but buildd.debian.org and people still say "it builds fine on buildd.debian.org so it's not RC" but *without* giving the reason why it fails in every other autobuilder. So this is like having a "build-depends: buildd.debian.org". Maintainer does not investigate enough and basically says that all the other autobuilders are wrong "for not being buildd.debian.org". This is what IMO makes the package non-free, the need to build the package in some of the official autobuilders for *unknown* reasons. Thanks.
Bug#848859: FTBFS randomly (failing tests)
On 05.01.2017 11:36, Santiago Vila wrote: > On Thu, Jan 05, 2017 at 11:06:50AM +0100, Ole Streicher wrote: >> Hi all, >> >> On 04.01.2017 20:57, Santiago Vila wrote: >>> I still want to build all packages and have 0 or 1 failures, >>> so in this case the probability should be 1/50/2, i.e. 1%. >>> >>> I think this is still feasible. >> >> My experience is that the buildds that are currently in use provide more >> build problems than the packages themself. BTW, why don't you count this >> as RC? > > Can you clarify the question? I don't understand what you refer exactly. It does not make sense to have 100% reliability on the package itself to build but only 95% on the buildds. If we want to have reliability, then every chain member counts -- and the weakest one in the moment for me does not seem to be the packages themselves, but the buildd setups we currently have. So, if you consider unreliable package builds as an RC problem, you should also consider the unreliable buildd setup as an RC problem. >>> But as far as there are people in this project who consider that a >>> package which FTBFS on single-CPU machines more than 50% of the time >>> is ok "because it does not usually fail in buildd.debian.org", >>> we are doomed. >> >> We have 10 archs, and with a 50% chance of failure you will not get any >> version built. Even when the buildds try it two or three times. > > It's actually more subtle than that. > > The package may be ready to be built on multi-core machines, but not on > single-core machines. So, the probability that I measure (greater than 50%), > might not be the same as the probability on official autobuilders. Ususally it is the other way around: packages build fine single-threaded, but don't do so on many threads (or CPUs). At the end it it also depends on how many CPUs you actually get for the build -- if there are some packages built in parallel, the result may be as for a single CPU. > Then there is the other subtle thing: The package may be Arch:all, in > which case a 50% of probability (this time independent on the number > of CPUs) may go unnoticed very easily, either because the maintainer > uploaded the .deb him/herself, or, if the maintainer uploaded > in source-only form, we can just be lucky for that time. I would favour much to ignore any prebuilt binaries and to re-build everything anyway. > We provide software which is free to be modified. If people is going > to modify it and then it fails because the package only builds ok in > buildd.debian.org, then we are removing one of the essential freedoms, > the freedom to modify it, and the package becomes de-facto non-free, > even if we still distribute it in main. First, the multi-CPU buildd setup is free (as in speech), isn't it? So, depending on this does not make anything non-free. I don't see why the dependency on a specific build environment would change the DFSG freeness at all (as long as the build environment itself is DFSG free). Then, we are speaking about random failures. If you use a single-core and see a build failure, you can just pragmatically repeat it. Sorry, I don't see why this would restrict DFSG freedom here. Best regards Ole
Bug#848859: FTBFS randomly (failing tests)
On Thu, Jan 05, 2017 at 11:06:50AM +0100, Ole Streicher wrote: > Hi all, > > On 04.01.2017 20:57, Santiago Vila wrote: > > I still want to build all packages and have 0 or 1 failures, > > so in this case the probability should be 1/50/2, i.e. 1%. > > > > I think this is still feasible. > > My experience is that the buildds that are currently in use provide more > build problems than the packages themself. BTW, why don't you count this > as RC? Can you clarify the question? I don't understand what you refer exactly. > [statistical] > >> The "fix" for such cases is the increasing of the threshold or disabling > >> the test completely. Because you can do nothing with it due to the > >> nature of numerical simulations. > > Just disabling would be bad, since the test still can point to some > problem in the implementation, like optimization problems or such. > > IMO the correct solution here would be to remove the randomness and to > seed with a fixed value (which is known to give a result within the > expectations). Indeed. Sometimes I've asked that in my bug reports: "I would love to provide a way to reproduce it, but you are using a different seed each time..." Removing randomness is certainly the right path to achieve both builds that always work and also reproducible builds. > > But as far as there are people in this project who consider that a > > package which FTBFS on single-CPU machines more than 50% of the time > > is ok "because it does not usually fail in buildd.debian.org", > > we are doomed. > > We have 10 archs, and with a 50% chance of failure you will not get any > version built. Even when the buildds try it two or three times. It's actually more subtle than that. The package may be ready to be built on multi-core machines, but not on single-core machines. So, the probability that I measure (greater than 50%), might not be the same as the probability on official autobuilders. This is one of the reasons I insist so much that the rule "If it does not fail on official autobuilders, it's not RC" is wrong. We pride ourselves to be the Universal Operating System. We can't make multi-core a requirement to build. That's also mathematically and logically wrong, because in an abstract sense, building a package is like following an algorithm. It may be slower or faster, but it should always succeed regardless of the number of CPU cores. Then there is the other subtle thing: The package may be Arch:all, in which case a 50% of probability (this time independent on the number of CPUs) may go unnoticed very easily, either because the maintainer uploaded the .deb him/herself, or, if the maintainer uploaded in source-only form, we can just be lucky for that time. > > The problem I see with this threshold thing is that every maintainer > > seems to have his own threshold, different from the others. > > > > In case we decide about RC-ness depending on probability of failure: > > What threshold do you think we should use for a single package and why? > > > > [ I say that 1% of failure is the maximum we should allow, and I've > > explained why, but I would love to hear your opinion on this ]. > > I would not put any direct number here, but be pragmatic: We need to > build the package from source, and this has to be done on all supported > architectures, and on the buildds. As long as this is done, I see no > reason for an RC bug. [...] My objection to consider the official buildds the "standard" by which RC-ness is measured is that it makes packages to have an implicit "build-depends: buildd.debian.org". We provide software which is free to be modified. If people is going to modify it and then it fails because the package only builds ok in buildd.debian.org, then we are removing one of the essential freedoms, the freedom to modify it, and the package becomes de-facto non-free, even if we still distribute it in main. That's why I think that packages must build ok on any autobuilder which is not misconfigured, not just on buildd.debian.org. Thanks.
Bug#848859: FTBFS randomly (failing tests)
Hi all, On 04.01.2017 20:57, Santiago Vila wrote: > I still want to build all packages and have 0 or 1 failures, > so in this case the probability should be 1/50/2, i.e. 1%. > > I think this is still feasible. My experience is that the buildds that are currently in use provide more build problems than the packages themself. BTW, why don't you count this as RC? [statistical] >> The "fix" for such cases is the increasing of the threshold or disabling >> the test completely. Because you can do nothing with it due to the >> nature of numerical simulations. Just disabling would be bad, since the test still can point to some problem in the implementation, like optimization problems or such. IMO the correct solution here would be to remove the randomness and to seed with a fixed value (which is known to give a result within the expectations). > But as far as there are people in this project who consider that a > package which FTBFS on single-CPU machines more than 50% of the time > is ok "because it does not usually fail in buildd.debian.org", > we are doomed. We have 10 archs, and with a 50% chance of failure you will not get any version built. Even when the buildds try it two or three times. > The problem I see with this threshold thing is that every maintainer > seems to have his own threshold, different from the others. > > In case we decide about RC-ness depending on probability of failure: > What threshold do you think we should use for a single package and why? > > [ I say that 1% of failure is the maximum we should allow, and I've > explained why, but I would love to hear your opinion on this ]. I would not put any direct number here, but be pragmatic: We need to build the package from source, and this has to be done on all supported architectures, and on the buildds. As long as this is done, I see no reason for an RC bug. If a package fails on a supported arch on the buildds, this is RC, independent of whether this came from an architectural difference or from a random build failure. Your tests with repeated builds help the maintainer to find the cause in that case, and for this they are helpful. But not release critical by themself. Cheers Ole
Bug#848859: FTBFS randomly (failing tests)
On Wed, Jan 04, 2017 at 07:58:43PM +0100, Anton Gladky wrote: > 2017-01-04 13:26 GMT+01:00 Santiago Vila: > > No matter how much glitch-free is the autobuilder you use to build the > > above package, it will fail to build 1 every 147 times on average, > > mathematically, because the test is wrongly designed. > > That is not always true. If you look in many tests from numerical > simulation packages, there is usually a "threshold" for test result > which should not be exceeded. And the test result varies in > the limits, which are set by upstream authors. This result > can be different even on the same machine, running the simulation > several time. And it is normal. > [...] I know what you mean. I've seen it several times in statistical packages. In my opinion, upstream authors may do as they wish, but Debian aims at having reproducible builds (in some future). Reproducible builds means each time you build the package, the same .deb is created. This is of course not policy yet, but I see it as fundamentally incompatible with packages which FTBFS from time to time. If we want the end result to be always the same, then failing from time to time (which is sometimes the end result) should never happen. In other words, if we don't take deterministic builds seriously (as in "every time I try to build the package, the build does not fail") how can we expect to be reproducible in the future? It is interesting, however, what you mention about thresholds, statistical packages, and simulations, so here is the math I do applied to Debian: Let's say that we have 25000 source packages in stretch, and I want to build all of them and not have a single failure. Since, as you point out, there are quite a bunch of statistical packages with tests based on random numbers, the mathematical probability that there is some failure will always be > 0. Ok, then. Let's suppose that I'm happy enough if the expected number of packages that fail to build is closer to 0 than to 1. So, let's make the probability of failure for each package to be half of 1/25000. That would be 0.002%. Not realistic enough? Let's assume additionally that 24950 source packages build ok all the time, and only 50 of them have a probability of failure > 0. I still want to build all packages and have 0 or 1 failures, so in this case the probability should be 1/50/2, i.e. 1%. I think this is still feasible. > The "fix" for such cases is the increasing of the threshold or disabling > the test completely. Because you can do nothing with it due to the > nature of numerical simulations. I really wish it would always be as simple as that. But as far as there are people in this project who consider that a package which FTBFS on single-CPU machines more than 50% of the time is ok "because it does not usually fail in buildd.debian.org", we are doomed. See the FTBFS-randomly bugs open against rygel, libsecret or libsoup2.4, for example. > > Really, we need more people doing QA, and not stop doing it "because > > we are near the freeze". > > If you are maintaining the package several years, fixing most of its > bugs, hoping to see it in release and trying to escape major changes > several months before the freeze.. Sure, it will actively be defended > from maintainers if some pseudo-reasons for its removal appear just > before the freeze. This fact has to be considered as well. Well, you will see that I've downgraded all the bugs of this type to important (btw: please do not call this "pseudo-reasons"). The problem I see with this threshold thing is that every maintainer seems to have his own threshold, different from the others. In case we decide about RC-ness depending on probability of failure: What threshold do you think we should use for a single package and why? [ I say that 1% of failure is the maximum we should allow, and I've explained why, but I would love to hear your opinion on this ]. Thanks a lot.
Bug#848859: FTBFS randomly (failing tests)
Le 04/01/2017 à 19:58, Anton Gladky a écrit : > 2017-01-04 13:26 GMT+01:00 Santiago Vila: >> No matter how much glitch-free is the autobuilder you use to build the >> above package, it will fail to build 1 every 147 times on average, >> mathematically, because the test is wrongly designed. > > That is not always true. If you look in many tests from numerical > simulation packages, there is usually a "threshold" for test result > which should not be exceeded. And the test result varies in > the limits, which are set by upstream authors. This result > can be different even on the same machine, running the simulation > several time. And it is normal. > > The "fix" for such cases is the increasing of the threshold or disabling > the test completely. Because you can do nothing with it due to the > nature of numerical simulations. I addition a build-time test is made to detect potential bugs, which may themselves be of low severity. What makes the bug RC is that the test causes FTBFS. Once such a bug has been identified, I tend to fill a bug with appropriate severity against my own package and disable the test until the real, low to normal severity bug is fixed. >> Really, we need more people doing QA, and not stop doing it "because >> we are near the freeze". > > If you are maintaining the package several years, fixing most of its > bugs, hoping to see it in release and trying to escape major changes > several months before the freeze.. Sure, it will actively be defended > from maintainers if some pseudo-reasons for its removal appear just > before the freeze. This fact has to be considered as well. I guess that's a case for a stretch-ignore tag. TBD with the release team. Regards, Thibaut. Regards, Thibaut.
Bug#848859: FTBFS randomly (failing tests)
2017-01-04 13:26 GMT+01:00 Santiago Vila: > No matter how much glitch-free is the autobuilder you use to build the > above package, it will fail to build 1 every 147 times on average, > mathematically, because the test is wrongly designed. That is not always true. If you look in many tests from numerical simulation packages, there is usually a "threshold" for test result which should not be exceeded. And the test result varies in the limits, which are set by upstream authors. This result can be different even on the same machine, running the simulation several time. And it is normal. The "fix" for such cases is the increasing of the threshold or disabling the test completely. Because you can do nothing with it due to the nature of numerical simulations. > Really, we need more people doing QA, and not stop doing it "because > we are near the freeze". If you are maintaining the package several years, fixing most of its bugs, hoping to see it in release and trying to escape major changes several months before the freeze.. Sure, it will actively be defended from maintainers if some pseudo-reasons for its removal appear just before the freeze. This fact has to be considered as well. Best regards Anton
Bug#848859: FTBFS randomly (failing tests)
On Wed, Jan 04, 2017 at 08:44:17AM +0100, Ole Streicher wrote: > > It's in Release Policy: Packages *must* autobuild *without* failure. > > > > If a package fails to build from time to time, that's a failure. > > Packages actually *do* fail from time to time, when I look into my > autobuilder. Not due to the package, but due to glitches within the > buildd infrastructure. Would you consider this a failure? If the package is not to blame, of course not. I'm speaking about packages which intrinsically fail with a probability p such that 0 < p < 1. Funny example of what I call "instrinsically": https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=838828 No matter how much glitch-free is the autobuilder you use to build the above package, it will fail to build 1 every 147 times on average, mathematically, because the test is wrongly designed. > >> I totally agree that catching random failures > >> is a good quality measure, but this is IMO severity "important" at maximum. > > > > Well, would you say it's RC if it fails 99% of the time? > > I guess you would. > > I would consider a bug RC if it actually doesn't build on our buildds. Aha! But *that* is what is not written in policy anywhere. Not only it's not written anywhere, it's invalidated by current practice every day. Examples here: https://bugs.debian.org/cgi-bin/pkgreport.cgi?include=subject%3AFTBFS;submitter=lamby%40debian.org or here: https://bugs.debian.org/cgi-bin/pkgreport.cgi?include=subject%3AFTBFS;submitter=lucas%40debian.org or even here: https://bugs.debian.org/cgi-bin/pkgreport.cgi?include=subject%3AFTBFS;submitter=sanvila%40debian.org Are you proposing that Lucas Nussbaum, Chris Lamb or myself stop reporting FTBFS bugs as serious unless we can point to a failed build log at buildd.debian.org? That restricted way of reporting bugs surely may not be right. > [...] > Doing release QA just before the release leads to quick hacks to keep > things there, while a continious QA really solves them. Well, I started doing QA more than a year ago, to check for "dpkg-buildpackage -A". As a side effect, I started to report each and every package which FTBFS for whatever reason. Really, we need more people doing QA, and not stop doing it "because we are near the freeze". Thanks.
Bug#848859: FTBFS randomly (failing tests)
Hi Santiago, On 04.01.2017 01:41, Santiago Vila wrote: > On Tue, 27 Dec 2016, Ole Streicher wrote: > Hello Ole. Thanks for your reply. Please don't forget to Cc: me if you > expect your message to be read. OK, however I usually assume that a bug submitter actually reads the messages for his submitted bugs. >>> In particular, if something happens 1 every 20 times on average, the >>> fact that it did not happen when you try 10 times does not mean in any >>> way that it may not happen. >> >> >> I must however say that I don't see why a package that fails to build >> once in 20 builds would have a release critical problem: > > It's in Release Policy: Packages *must* autobuild *without* failure. > > If a package fails to build from time to time, that's a failure. Packages actually *do* fail from time to time, when I look into my autobuilder. Not due to the package, but due to glitches within the buildd infrastructure. Would you consider this a failure? >> I totally agree that catching random failures >> is a good quality measure, but this is IMO severity "important" at maximum. > > Well, would you say it's RC if it fails 99% of the time? > I guess you would. I would consider a bug RC if it actually doesn't build on our buildds. >> And it would be nice to have this QA test not just before the release, >> but already early in the release cycle. This would help a lot to avoid >> stressful bugfixes just before the freeze. > > Well, if you see my list of FTBFS-randomly bug here: > > https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=ftbfs-randomly;users=sanv...@debian.org > > you will see that I started to report FTBFS-randomly bugs several > months ago, not last week, and not last month. > > What would really help is to have somebody else reporting these bugs, > because apparently very few people want to report them. > > Also, you can't seriously "complain" that a bug reporter didn't not > report something earlier! Of course that it would be nice to do this > kind of QA stuff at every point in the release cycle, but as the same > software that we maintain, my bug reports come with no warranty, not > even the warranty that the bug is reported "as soon as it exists"! :-) What I do complain about is that doing this just before the release adds additional pressure to the package maintainers: The problem for us ordinary people is that most of the time my packages look fine, without serious problems, and just before the release all those RC bugs pop up in bunches, with quite some time pressure to solve them. If the RC bugs were randomly distributed over time, it would be more time to solve them, and the quality of the fixes would be better: for example, in the moment, I would just disable build time tests that randomly fail, while usually I would try to see why they fail and fix that. In the moment, there is no time for this. Doing release QA just before the release leads to quick hacks to keep things there, while a continious QA really solves them. Best regards Ole
Bug#848859: FTBFS randomly (failing tests)
On Tue, 27 Dec 2016, Ole Streicher wrote: > Hi Santiago, Hello Ole. Thanks for your reply. Please don't forget to Cc: me if you expect your message to be read. > > In particular, if something happens 1 every 20 times on average, the > > fact that it did not happen when you try 10 times does not mean in any > > way that it may not happen. > > > I must however say that I don't see why a package that fails to build > once in 20 builds would have a release critical problem: It's in Release Policy: Packages *must* autobuild *without* failure. If a package fails to build from time to time, that's a failure. > There is nothing in our policy that requires a reproducible or even a > always successful build. Please don't mix "reproducible" with "always successful build". They are different things and nobody is making "reproducible builds" policy yet. But "always successful" has always been release policy: https://release.debian.org/stretch/rc_policy.txt Please read the paragraph saying "packages must autobuild without failure". That paragraph is the very reason FTBFS bugs are serious since a lng time. > I totally agree that catching random failures > is a good quality measure, but this is IMO severity "important" at maximum. Well, would you say it's RC if it fails 99% of the time? I guess you would. Would you say it's RC if it fails 0.001% of the time? I guess you would not, since you have just said that 0.05% does not deserve to be serious. So, you are implicitly saying that packages which FTBFS with a low probability does not deserve a serious bug. However, there is no threshold at all in policy, and if you think about it, any such threshold would be quite arbitrary indeed: Why would 50% of failure be RC while 49% of failure is not? [ To me, the mere fact that we are able to *measure* the probability of failure already means that it is too high ]. Anyway, there will be plenty of time to discuss about this because I'm going to downgrade all the FTBFS-randomly bugs I've detected until Release Managers say something about the subject. Then I expect most (if not all) of them will become serious and RC again. > And it would be nice to have this QA test not just before the release, > but already early in the release cycle. This would help a lot to avoid > stressful bugfixes just before the freeze. Well, if you see my list of FTBFS-randomly bug here: https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=ftbfs-randomly;users=sanv...@debian.org you will see that I started to report FTBFS-randomly bugs several months ago, not last week, and not last month. What would really help is to have somebody else reporting these bugs, because apparently very few people want to report them. Also, you can't seriously "complain" that a bug reporter didn't not report something earlier! Of course that it would be nice to do this kind of QA stuff at every point in the release cycle, but as the same software that we maintain, my bug reports come with no warranty, not even the warranty that the bug is reported "as soon as it exists"! :-) Thanks.
Bug#848859: FTBFS randomly (failing tests)
Hi Santiago, > In particular, if something happens 1 every 20 times on average, the > fact that it did not happen when you try 10 times does not mean in any > way that it may not happen. I must however say that I don't see why a package that fails to build once in 20 builds would have a release critical problem: There is nothing in our policy that requires a reproducible or even a always successful build. I totally agree that catching random failures is a good quality measure, but this is IMO severity "important" at maximum. And it would be nice to have this QA test not just before the release, but already early in the release cycle. This would help a lot to avoid stressful bugfixes just before the freeze. Best regards Ole