OK, let's keep an eye on the flaky list of master and branch-2 till this weekend.
If it is in a bad state then let's discussion again. Stack <st...@duboce.net> 于2020年3月5日周四 上午12:41写道: > On Wed, Mar 4, 2020 at 3:42 AM 张铎(Duo Zhang) <palomino...@gmail.com> > wrote: > > > And speak a little more on increasing the forkCount. In fact, the test > > category is not too rough. The LargeTests just means the test will run a > > bit long, does not mean it will consume more resources. Maybe the tests > > just have lots of Thread.sleep so we declare it as LargeTests. > > > > > I've done a few passes on test categorization of late. The notion had > rotted pretty bad but should be cleaned up now. > > > > What I can see is that, all the replication related tests are flaky now. > > This is reasonable. In replication tests, usually we have to set up at > > least two mini clusters, and the replication system itself will make use > of > > lots of threads. So if you run several replication related tests > together, > > it will easy to overload and cause the UTs to timeout or OOM. > > > > > We have at least one test that makes four clusters inside the one JVM. > > Yeah, the resource usage in general needs weeding. > > Perhaps you are arguing that we just let the state of tests as they are? > That we let long tests run in series in case two or more might run together > and fail because they are profligate in their resource use? > I mean increasing the fork count will lead to a random test result as the test category can not describe the resource usage clearly. You can run maybe 20+ light-weighted UTs without problem, but if you run 5 tests which set up 4 mini clusters, the resource will be exhausted and cause the tests to fail, or at least make it really slow and fail the tests... > > > > > So, again, let's do this on a feature branch. It is fine to mess things > up > > on a feature branch. You can do everything you want as the intermediate > > state does not effect others. On master and branch-2 it is another > story. I > > do not think this should be a blocker for 2.3.0 or 3.0.0. > > > > See previous note. > > Thanks, > S > > > > Thanks. > > > > 张铎(Duo Zhang) <palomino...@gmail.com> 于2020年3月4日周三 下午7:34写道: > > > > > Due to the resource limit I do not think it is a good idea to increase > > the > > > forkCount... > > > > > > FWIW, can we do this on a feature branch and move master and branch-2 > > back? > > > > > > See here > > > > > > https://github.com/apache/hbase/pull/1221 > > > > > > We tried several times and always got a large amount of failed UTs > which > > > are not related to the patch. And we even excluded hundreds of UTs due > to > > > the flaky list! > > > > > > This makes it almost impossible to contribute to the project. Even > after > > > several tries we get a green result, due to the excluded hundreds of > UTs, > > > no one know if the patch breaks something. > > > > > > Thanks. > > > > > > Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道: > > > > > >> Upstream branch-2 and master nightlies don't look too bad currently. > > There > > >> are a few bad runs where there were a bunch of hangs which makes > things > > >> look bad. I upped the number of tests we show from 5 to 10 on branch-2 > > and > > >> master which makes it so a failed tests shows longer in the top half > of > > >> the > > >> flakies page -- and more flakies are listed. On the bottom half, I'd > > upped > > >> the ferocity with which we run on GCE to draw out flakies. Needless to > > >> say, > > >> they fail more often when contended resources. I might knock the > > ferocity > > >> down in the next day or so but am trying to land some patches that cut > > >> down > > >> on resource usage and want to see how these do in the flakie runs > first. > > >> > > >> Master I haven't looked at much... looks like branch-2? Branch-2.2 > and > > >> branch-2.1 look sleepy. Similar amounts of flakies in the nightlies. > > They > > >> don't have the ferocity upped so the lower-half GCE section looks > > >> 'better'. > > >> I can make them look like branch-2 and master if folks want (smile) > but > > >> its > > >> probably ok letting the flakies lie in branches that are being > bypassed. > > >> > > >> Generally, I've been working on unit tests with inspiration and help > > from > > >> Mark Miller and Nick. Our tests are in a poor state. They take so > long, > > >> they don't get run anywhere else other than up on jenkins. They rarely > > >> pass > > >> and only then on accident if minimal parallelism and jitter. On > > multi-core > > >> machines, they use 1 to 2 cores only -- even if the machine has tens > of > > >> them. > > >> > > >> I have been trying to burn down the flakies, make the tests complete > > >> successfully in less time with more parallelism, using all of the > > machine, > > >> and make them pass both on jenkins and locally. Of late, have been > > focused > > >> on branch-2 since it is calming down getting ready for a 2.3.0RC0. > > Having > > >> some success but its a nasty job where it is hard to claim advances > > >> because the flakies vary w/ the context in which the tests are run. > > >> Hopefully we'll turn a corner on jenkins soon for folks to enjoy. > > >> > > >> Shout if need more detail. > > >> S > > >> > > >> > > >> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <palomino...@gmail.com> > > >> wrote: > > >> > > >> > But why branch-2.2 and branch-2.1 are still fine? > > >> > > > >> > Sean Busbey <bus...@apache.org> 于2020年3月4日周三 上午9:24写道: > > >> > > > >> > > I agree in principle that excluding 100s of UTs isn't good. But we > > >> don't > > >> > > really have better options given the state of tests and testing > > >> hardware > > >> > > currently available to us. > > >> > > > > >> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <palomino...@gmail.com> > > >> wrote: > > >> > > > > >> > > > I think the problem is all UTs are failing randomly... > > >> > > > > > >> > > > And it is also not a good idea to exclude hundreds of UTs in pre > > >> > commit? > > >> > > > > > >> > > > Sean Busbey <bus...@apache.org> 于2020年3月4日周三 上午9:11写道: > > >> > > > > > >> > > > > Everything in the flake list should be skipped at precommit > > time. > > >> Is > > >> > > that > > >> > > > > not happening? > > >> > > > > > > >> > > > > Are we keeping a shorter flake window so things are bouncing > in > > >> and > > >> > out > > >> > > > of > > >> > > > > the list? > > >> > > > > > > >> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) < > palomino...@gmail.com > > > > > >> > > wrote: > > >> > > > > > > >> > > > > > I see recently there are lots of 'flaky tests' related > issues > > >> been > > >> > > > > resolved > > >> > > > > > but seems the situation is getting worse? For branch-2.2 the > > >> flaky > > >> > > page > > >> > > > > is > > >> > > > > > fine, but for master it is totally a mess... > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html > > >> > > > > > > > >> > > > > > Lots of UTs are in trouble and it makes it really hard to > pass > > >> the > > >> > > pre > > >> > > > > > commit check which means it is really hard to contribute to > > the > > >> > > > > project... > > >> > > > > > > > >> > > > > > We need to fix this soon... > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > >