Re: What is the situation for our UTs now?

Duo Zhang Wed, 04 Mar 2020 15:25:23 -0800

OK, let's keep an eye on the flaky list of master and branch-2 till this
weekend.


If it is in a bad state then let's discussion again.

Stack <st...@duboce.net> 于2020年3月5日周四 上午12:41写道：

> On Wed, Mar 4, 2020 at 3:42 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> wrote:
>
> > And speak a little more on increasing the forkCount. In fact, the test
> > category is not too rough. The LargeTests just means the test will run a
> > bit long, does not mean it will consume more resources. Maybe the tests
> > just have lots of Thread.sleep so we declare it as LargeTests.
> >
> >
> I've done a few passes on test categorization of late. The notion had
> rotted pretty bad but should be cleaned up now.
>
>
> > What I can see is that, all the replication related tests are flaky now.
> > This is reasonable. In replication tests, usually we have to set up at
> > least two mini clusters, and the replication system itself will make use
> of
> > lots of threads. So if you run several replication related tests
> together,
> > it will easy to overload and cause the UTs to timeout or OOM.
> >
> >
> We have at least one test that makes four clusters inside the one JVM.
>
> Yeah, the resource usage in general needs weeding.
>
> Perhaps you are arguing that we just let the state of tests as they are?
> That we let long tests run in series in case two or more might run together
> and fail because they are profligate in their resource use?
>
I mean increasing the fork count will lead to a random test result as the
test category can not describe the resource usage clearly. You can run
maybe 20+ light-weighted UTs without problem, but if you run 5 tests which
set up 4 mini clusters, the resource will be exhausted and cause the tests
to fail, or at least make it really slow and fail the tests...

>
>
>
> > So, again, let's do this on a feature branch. It is fine to mess things
> up
> > on a feature branch. You can do everything you want as the intermediate
> > state does not effect others. On master and branch-2 it is another
> story. I
> > do not think this should be a blocker for 2.3.0 or 3.0.0.
> >
> > See previous note.
>
> Thanks,
> S
>
>
> > Thanks.
> >
> > 张铎(Duo Zhang) <palomino...@gmail.com> 于2020年3月4日周三 下午7:34写道：
> >
> > > Due to the resource limit I do not think it is a good idea to increase
> > the
> > > forkCount...
> > >
> > > FWIW, can we do this on a feature branch and move master and branch-2
> > back?
> > >
> > > See here
> > >
> > > https://github.com/apache/hbase/pull/1221
> > >
> > > We tried several times and always got a large amount of failed UTs
> which
> > > are not related to the patch. And we even excluded hundreds of UTs due
> to
> > > the flaky list!
> > >
> > > This makes it almost impossible to contribute to the project. Even
> after
> > > several tries we get a green result, due to the excluded hundreds of
> UTs,
> > > no one know if the patch breaks something.
> > >
> > > Thanks.
> > >
> > > Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道：
> > >
> > >> Upstream branch-2 and master nightlies don't look too bad currently.
> > There
> > >> are a few bad runs where there were a bunch of hangs which makes
> things
> > >> look bad. I upped the number of tests we show from 5 to 10 on branch-2
> > and
> > >> master which makes it so a failed tests shows longer in the top half
> of
> > >> the
> > >> flakies page -- and more flakies are listed. On the bottom half, I'd
> > upped
> > >> the ferocity with which we run on GCE to draw out flakies. Needless to
> > >> say,
> > >> they fail more often when contended resources. I might knock the
> > ferocity
> > >> down in the next day or so but am trying to land some patches that cut
> > >> down
> > >> on resource usage and want to see how these do in the flakie runs
> first.
> > >>
> > >> Master I haven't looked at much... looks like branch-2?  Branch-2.2
> and
> > >> branch-2.1 look sleepy. Similar amounts of flakies in the nightlies.
> > They
> > >> don't have the ferocity upped so the lower-half GCE section looks
> > >> 'better'.
> > >> I can make them look like branch-2 and master if folks want (smile)
> but
> > >> its
> > >> probably ok letting the flakies lie in branches that are being
> bypassed.
> > >>
> > >> Generally,  I've been working on unit tests with inspiration and help
> > from
> > >> Mark Miller and Nick. Our tests are in a poor state. They take so
> long,
> > >> they don't get run anywhere else other than up on jenkins. They rarely
> > >> pass
> > >> and only then on accident if minimal parallelism and jitter. On
> > multi-core
> > >> machines, they use 1 to 2 cores only -- even if the machine has tens
> of
> > >> them.
> > >>
> > >> I have been trying to burn down the flakies, make the tests complete
> > >> successfully in less time with more parallelism, using all of the
> > machine,
> > >> and make them pass both on jenkins and locally. Of late, have been
> > focused
> > >> on branch-2 since it is calming down getting ready for a 2.3.0RC0.
> > Having
> > >> some success but its a  nasty job where it is hard to claim advances
> > >> because the flakies vary w/ the context in which the tests are run.
> > >> Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
> > >>
> > >> Shout if need more detail.
> > >> S
> > >>
> > >>
> > >> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <palomino...@gmail.com>
> > >> wrote:
> > >>
> > >> > But why branch-2.2 and branch-2.1 are still fine?
> > >> >
> > >> > Sean Busbey <bus...@apache.org> 于2020年3月4日周三 上午9:24写道：
> > >> >
> > >> > > I agree in principle that excluding 100s of UTs isn't good. But we
> > >> don't
> > >> > > really have better options given the state of tests and testing
> > >> hardware
> > >> > > currently available to us.
> > >> > >
> > >> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <palomino...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > I think the problem is all UTs are failing randomly...
> > >> > > >
> > >> > > > And it is also not a good idea to exclude hundreds of UTs in pre
> > >> > commit?
> > >> > > >
> > >> > > > Sean Busbey <bus...@apache.org> 于2020年3月4日周三 上午9:11写道：
> > >> > > >
> > >> > > > > Everything in the flake list should be skipped at precommit
> > time.
> > >> Is
> > >> > > that
> > >> > > > > not happening?
> > >> > > > >
> > >> > > > > Are we keeping a shorter flake window so things are bouncing
> in
> > >> and
> > >> > out
> > >> > > > of
> > >> > > > > the list?
> > >> > > > >
> > >> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <
> palomino...@gmail.com
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > I see recently there are lots of 'flaky tests' related
> issues
> > >> been
> > >> > > > > resolved
> > >> > > > > > but seems the situation is getting worse? For branch-2.2 the
> > >> flaky
> > >> > > page
> > >> > > > > is
> > >> > > > > > fine, but for master it is totally a mess...
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > >> > > > > >
> > >> > > > > > Lots of UTs are in trouble and it makes it really hard to
> pass
> > >> the
> > >> > > pre
> > >> > > > > > commit check which means it is really hard to contribute to
> > the
> > >> > > > > project...
> > >> > > > > >
> > >> > > > > > We need to fix this soon...
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: What is the situation for our UTs now?

Reply via email to