Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-19 Thread Maxime Ripard
On Mon, Sep 18, 2023 at 06:35:13PM -0300, Helen Koike wrote:
> > > > I don't quite see the same picture from your side though. For example,
> > > > my reading of what you've said is that flaky tests are utterly
> > > > unacceptable, as are partial runs, and we shouldn't pretend otherwise.
> > > > With your concrete example (which is really helpful, so thanks), what
> > > > happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
> > > > until it's perfect, or does MT8173 testing always fail because that
> > > > test does?
> > > 
> > > It's not clear to me why that test is even running in the first place?
> > > There's been some confusion on my side here about what we're going to
> > > test with this. You've mentioned Mesa and GPUs before, but that's a KMS
> > > test so there must be more to it.
> > > 
> > > Either way, it's a relevant test so I guess why not. It turns out that
> > > the test is indeed flaky, I guess we could add it to the flaky tests
> > > list.
> > > 
> > > BUT
> > > 
> > > I want to have every opportunity to fix whatever that failure is.
> > 
> > Agreed so far!
> > 
> > > So:
> > > 
> > >- Is the test broken? If so, we should report it to IGT dev and remove
> > >  it from the test suite.
> > >- If not, is that test failure have been reported to the driver author?
> > >- If no answer/fix, we can add it to the flaky tests list, but do we
> > >  have some way to reproduce the test failure?
> > > 
> > > The last part is especially critical. Looking at the list itself, I have
> > > no idea what board, kernel version, configuration, or what the failure
> > > rate was. Assuming I spend some time looking at the infra to find the
> > > board and configuration, how many times do I have to run the tests to
> > > expect to reproduce the failure (and thus consider it fixed if it
> > > doesn't occur anymore).
> > > 
> > > Like, with that board and test, if my first 100 runs of the test work
> > > fine, is it reasonable for me to consider it fixed, or is it only
> > > supposed to happen once every 1000 runs?
> 
> I wonder if this should be an overall policy or just let the maintainer to
> decide.
> 
> In any case these stress tests must be run from time to time to verify if
> flakes are still flakes. We could do it automatically but we need to
> evaluate how to do it properly so it doesn't overload the infra.

That would be a great thing to do, but we can also reasonably expect
that we will have other farms that may not run those tests on a regular
basis, and we will have some manual testing too so I think it's still
valuable.

> > > So, ideally, having some (mandatory) metadata in the test lists with a
> > > link to the bug report, the board (DT name?) it happened with, the
> > > version and configuration it was first seen with, and an approximation
> > > of the failure rate for every flaky test list.
> > > 
> > > I understand that it's probably difficult to get that after the fact on
> > > the tests that were already merged, but I'd really like to get that
> > > enforced for every new test going forward.
> > > 
> > > That should hopefully get us in a much better position to fix some of
> > > those tests issues. And failing that, I can't see how that's
> > > sustainable.
> > 
> > OK yeah, and we're still agreed here. That is definitely the standard
> > we should be aiming for.  It is there for some - see
> > drivers/gpu/drm/ci/xfails/rockchip-rk3288-skips.txt, but should be
> > there for the rest, it's true. (The specific board/DT it was observed
> > on can be easily retconned because we only run on one specific board
> > type per driver, again to make things more predictable; we could go
> > back and retrospectively add those in a header comment?)
> > 
> > For flakes, it can be hard to pin them down, because, well, they're
> > flaky. Usually when we add things in Mesa (sorry to keep coming back
> > to Mesa - it's not to say that it's the objective best thing that
> > everything should follow, only that it's the thing we have the most
> > experience with that we know works well), we do a manual bisect and
> > try to pin the blame on a specific merge request which looks like the
> > most likely culprit. If nothing obvious jumps out, we just note when
> > it was first observed and provide some sample job logs. But yeah, it
> > should be more verbose.
> > 
> > FWIW, the reason it wasn't done here - not to say that it shouldn't
> > have been done better, but here we are - is that we just hammered a
> > load of test runs, vacuumed up the results with a script, and that's
> > what generated those files. Given the number of tests and devices, it
> > was hard to narrow each down individually, but yeah, it is something
> > which really wants further analysis and drilling into. It's a good
> > to-do, and I agree it should be the standard going forward.
> 
> Yes, during development I was just worried about to get a pipeline that
> would succeed most of the time (otherwise people 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-19 Thread Maxime Ripard
On Fri, Sep 15, 2023 at 04:08:42PM +0100, Daniel Stone wrote:
> > > I don't quite see the same picture from your side though. For example,
> > > my reading of what you've said is that flaky tests are utterly
> > > unacceptable, as are partial runs, and we shouldn't pretend otherwise.
> > > With your concrete example (which is really helpful, so thanks), what
> > > happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
> > > until it's perfect, or does MT8173 testing always fail because that
> > > test does?
> >
> > It's not clear to me why that test is even running in the first place?
> > There's been some confusion on my side here about what we're going to
> > test with this. You've mentioned Mesa and GPUs before, but that's a KMS
> > test so there must be more to it.
> >
> > Either way, it's a relevant test so I guess why not. It turns out that
> > the test is indeed flaky, I guess we could add it to the flaky tests
> > list.
> >
> > BUT
> >
> > I want to have every opportunity to fix whatever that failure is.
> 
> Agreed so far!
> 
> > So:
> >
> >   - Is the test broken? If so, we should report it to IGT dev and remove
> > it from the test suite.
> >   - If not, is that test failure have been reported to the driver author?
> >   - If no answer/fix, we can add it to the flaky tests list, but do we
> > have some way to reproduce the test failure?
> >
> > The last part is especially critical. Looking at the list itself, I have
> > no idea what board, kernel version, configuration, or what the failure
> > rate was. Assuming I spend some time looking at the infra to find the
> > board and configuration, how many times do I have to run the tests to
> > expect to reproduce the failure (and thus consider it fixed if it
> > doesn't occur anymore).
> >
> > Like, with that board and test, if my first 100 runs of the test work
> > fine, is it reasonable for me to consider it fixed, or is it only
> > supposed to happen once every 1000 runs?
> >
> > So, ideally, having some (mandatory) metadata in the test lists with a
> > link to the bug report, the board (DT name?) it happened with, the
> > version and configuration it was first seen with, and an approximation
> > of the failure rate for every flaky test list.
> >
> > I understand that it's probably difficult to get that after the fact on
> > the tests that were already merged, but I'd really like to get that
> > enforced for every new test going forward.
> >
> > That should hopefully get us in a much better position to fix some of
> > those tests issues. And failing that, I can't see how that's
> > sustainable.
> 
> OK yeah, and we're still agreed here. That is definitely the standard
> we should be aiming for.  It is there for some - see
> drivers/gpu/drm/ci/xfails/rockchip-rk3288-skips.txt, but should be
> there for the rest, it's true. (The specific board/DT it was observed
> on can be easily retconned because we only run on one specific board
> type per driver, again to make things more predictable; we could go
> back and retrospectively add those in a header comment?)

Yeah, metadata was meant to sound professional, but putting all that in
a somewhat parseable comment is good enough for me :)

> For flakes, it can be hard to pin them down, because, well, they're
> flaky. Usually when we add things in Mesa (sorry to keep coming back
> to Mesa - it's not to say that it's the objective best thing that
> everything should follow, only that it's the thing we have the most
> experience with that we know works well), we do a manual bisect and
> try to pin the blame on a specific merge request which looks like the
> most likely culprit. If nothing obvious jumps out, we just note when
> it was first observed and provide some sample job logs. But yeah, it
> should be more verbose.

Sure, I guess we should document that for flakes, it's not meant as a
precise indication. Still it provides some kind of "time anchor". If it
was first observed 10 years ago and I can't reproduce it, chances are
it's been fixed one way or the other since. If it was observed last
month, it's more likely something's up with my setup.

It's not definitive, but it provides some hints.

> FWIW, the reason it wasn't done here - not to say that it shouldn't
> have been done better, but here we are - is that we just hammered a
> load of test runs, vacuumed up the results with a script, and that's
> what generated those files. Given the number of tests and devices, it
> was hard to narrow each down individually, but yeah, it is something
> which really wants further analysis and drilling into. It's a good
> to-do, and I agree it should be the standard going forward.

Awesome :)

> > And Mesa does show what I'm talking about:
> >
> > $ find -name *-flakes.txt | xargs git diff --stat  
> > e58a10af640ba58b6001f5c5ad750b782547da76
> > [...]
> >
> > In the history of Mesa, there's never been a single test removed from a
> > flaky test list.
> 
> As Rob says, that's definitely 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-18 Thread Helen Koike




On 15/09/2023 12:08, Daniel Stone wrote:

Hey,

On Thu, 14 Sept 2023 at 10:54, Maxime Ripard  wrote:

On Tue, Sep 12, 2023 at 02:16:41PM +0100, Daniel Stone wrote:

Hopefully less mangled formatting this time: turns out Thunderbird +
plain text is utterly unreadable, so that's one less MUA that is
actually usable to send email to kernel lists without getting shouted
at.


Sorry if it felt that way, it definitely wasn't my intention to shout at
you. Email is indeed kind of a pain to deal with, and I wanted to keep
the discussion going.


My bad - I didn't mean you _at all_. I was thinking of other, much
less pleasant, kernel maintainers, and the ongoing struggle of
attempting to actually send well-formatted email to kernel lists in
2023.


I don't quite see the same picture from your side though. For example,
my reading of what you've said is that flaky tests are utterly
unacceptable, as are partial runs, and we shouldn't pretend otherwise.
With your concrete example (which is really helpful, so thanks), what
happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
until it's perfect, or does MT8173 testing always fail because that
test does?


It's not clear to me why that test is even running in the first place?
There's been some confusion on my side here about what we're going to
test with this. You've mentioned Mesa and GPUs before, but that's a KMS
test so there must be more to it.

Either way, it's a relevant test so I guess why not. It turns out that
the test is indeed flaky, I guess we could add it to the flaky tests
list.

BUT

I want to have every opportunity to fix whatever that failure is.


Agreed so far!


So:

   - Is the test broken? If so, we should report it to IGT dev and remove
 it from the test suite.
   - If not, is that test failure have been reported to the driver author?
   - If no answer/fix, we can add it to the flaky tests list, but do we
 have some way to reproduce the test failure?

The last part is especially critical. Looking at the list itself, I have
no idea what board, kernel version, configuration, or what the failure
rate was. Assuming I spend some time looking at the infra to find the
board and configuration, how many times do I have to run the tests to
expect to reproduce the failure (and thus consider it fixed if it
doesn't occur anymore).

Like, with that board and test, if my first 100 runs of the test work
fine, is it reasonable for me to consider it fixed, or is it only
supposed to happen once every 1000 runs?


I wonder if this should be an overall policy or just let the maintainer 
to decide.


In any case these stress tests must be run from time to time to verify 
if flakes are still flakes. We could do it automatically but we need to 
evaluate how to do it properly so it doesn't overload the infra.




So, ideally, having some (mandatory) metadata in the test lists with a
link to the bug report, the board (DT name?) it happened with, the
version and configuration it was first seen with, and an approximation
of the failure rate for every flaky test list.

I understand that it's probably difficult to get that after the fact on
the tests that were already merged, but I'd really like to get that
enforced for every new test going forward.

That should hopefully get us in a much better position to fix some of
those tests issues. And failing that, I can't see how that's
sustainable.


OK yeah, and we're still agreed here. That is definitely the standard
we should be aiming for.  It is there for some - see
drivers/gpu/drm/ci/xfails/rockchip-rk3288-skips.txt, but should be
there for the rest, it's true. (The specific board/DT it was observed
on can be easily retconned because we only run on one specific board
type per driver, again to make things more predictable; we could go
back and retrospectively add those in a header comment?)

For flakes, it can be hard to pin them down, because, well, they're
flaky. Usually when we add things in Mesa (sorry to keep coming back
to Mesa - it's not to say that it's the objective best thing that
everything should follow, only that it's the thing we have the most
experience with that we know works well), we do a manual bisect and
try to pin the blame on a specific merge request which looks like the
most likely culprit. If nothing obvious jumps out, we just note when
it was first observed and provide some sample job logs. But yeah, it
should be more verbose.

FWIW, the reason it wasn't done here - not to say that it shouldn't
have been done better, but here we are - is that we just hammered a
load of test runs, vacuumed up the results with a script, and that's
what generated those files. Given the number of tests and devices, it
was hard to narrow each down individually, but yeah, it is something
which really wants further analysis and drilling into. It's a good
to-do, and I agree it should be the standard going forward.


Yes, during development I was just worried about to get a pipeline that 
would succeed 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-15 Thread Daniel Stone
Hey,

On Thu, 14 Sept 2023 at 10:54, Maxime Ripard  wrote:
> On Tue, Sep 12, 2023 at 02:16:41PM +0100, Daniel Stone wrote:
> > Hopefully less mangled formatting this time: turns out Thunderbird +
> > plain text is utterly unreadable, so that's one less MUA that is
> > actually usable to send email to kernel lists without getting shouted
> > at.
>
> Sorry if it felt that way, it definitely wasn't my intention to shout at
> you. Email is indeed kind of a pain to deal with, and I wanted to keep
> the discussion going.

My bad - I didn't mean you _at all_. I was thinking of other, much
less pleasant, kernel maintainers, and the ongoing struggle of
attempting to actually send well-formatted email to kernel lists in
2023.

> > I don't quite see the same picture from your side though. For example,
> > my reading of what you've said is that flaky tests are utterly
> > unacceptable, as are partial runs, and we shouldn't pretend otherwise.
> > With your concrete example (which is really helpful, so thanks), what
> > happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
> > until it's perfect, or does MT8173 testing always fail because that
> > test does?
>
> It's not clear to me why that test is even running in the first place?
> There's been some confusion on my side here about what we're going to
> test with this. You've mentioned Mesa and GPUs before, but that's a KMS
> test so there must be more to it.
>
> Either way, it's a relevant test so I guess why not. It turns out that
> the test is indeed flaky, I guess we could add it to the flaky tests
> list.
>
> BUT
>
> I want to have every opportunity to fix whatever that failure is.

Agreed so far!

> So:
>
>   - Is the test broken? If so, we should report it to IGT dev and remove
> it from the test suite.
>   - If not, is that test failure have been reported to the driver author?
>   - If no answer/fix, we can add it to the flaky tests list, but do we
> have some way to reproduce the test failure?
>
> The last part is especially critical. Looking at the list itself, I have
> no idea what board, kernel version, configuration, or what the failure
> rate was. Assuming I spend some time looking at the infra to find the
> board and configuration, how many times do I have to run the tests to
> expect to reproduce the failure (and thus consider it fixed if it
> doesn't occur anymore).
>
> Like, with that board and test, if my first 100 runs of the test work
> fine, is it reasonable for me to consider it fixed, or is it only
> supposed to happen once every 1000 runs?
>
> So, ideally, having some (mandatory) metadata in the test lists with a
> link to the bug report, the board (DT name?) it happened with, the
> version and configuration it was first seen with, and an approximation
> of the failure rate for every flaky test list.
>
> I understand that it's probably difficult to get that after the fact on
> the tests that were already merged, but I'd really like to get that
> enforced for every new test going forward.
>
> That should hopefully get us in a much better position to fix some of
> those tests issues. And failing that, I can't see how that's
> sustainable.

OK yeah, and we're still agreed here. That is definitely the standard
we should be aiming for.  It is there for some - see
drivers/gpu/drm/ci/xfails/rockchip-rk3288-skips.txt, but should be
there for the rest, it's true. (The specific board/DT it was observed
on can be easily retconned because we only run on one specific board
type per driver, again to make things more predictable; we could go
back and retrospectively add those in a header comment?)

For flakes, it can be hard to pin them down, because, well, they're
flaky. Usually when we add things in Mesa (sorry to keep coming back
to Mesa - it's not to say that it's the objective best thing that
everything should follow, only that it's the thing we have the most
experience with that we know works well), we do a manual bisect and
try to pin the blame on a specific merge request which looks like the
most likely culprit. If nothing obvious jumps out, we just note when
it was first observed and provide some sample job logs. But yeah, it
should be more verbose.

FWIW, the reason it wasn't done here - not to say that it shouldn't
have been done better, but here we are - is that we just hammered a
load of test runs, vacuumed up the results with a script, and that's
what generated those files. Given the number of tests and devices, it
was hard to narrow each down individually, but yeah, it is something
which really wants further analysis and drilling into. It's a good
to-do, and I agree it should be the standard going forward.

> And Mesa does show what I'm talking about:
>
> $ find -name *-flakes.txt | xargs git diff --stat  
> e58a10af640ba58b6001f5c5ad750b782547da76
> [...]
>
> In the history of Mesa, there's never been a single test removed from a
> flaky test list.

As Rob says, that's definitely wrong. But there is a good point in

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-14 Thread Rob Clark
On Thu, Sep 14, 2023 at 2:54 AM Maxime Ripard  wrote:
>
> Hi,
>
> On Tue, Sep 12, 2023 at 02:16:41PM +0100, Daniel Stone wrote:
> > Hopefully less mangled formatting this time: turns out Thunderbird +
> > plain text is utterly unreadable, so that's one less MUA that is
> > actually usable to send email to kernel lists without getting shouted
> > at.
>
> Sorry if it felt that way, it definitely wasn't my intention to shout at
> you. Email is indeed kind of a pain to deal with, and I wanted to keep
> the discussion going.
>
> > On Mon, 11 Sept 2023 at 15:46, Maxime Ripard  wrote:
> > > On Mon, Sep 11, 2023 at 03:30:55PM +0200, Michel Dänzer wrote:
> > > > > There's in 6.6-rc1 around 240 reported flaky tests. None of them have
> > > > > any context. That new series hads a few dozens too, without any 
> > > > > context
> > > > > either. And there's no mention about that being a plan, or a patch
> > > > > adding a new policy for all tests going forward.
> > > >
> > > > That does sound bad, would need to be raised in review.
> > > >
> > > > > Any concern I raised were met with a giant "it worked on Mesa" 
> > > > > handwave
> > > >
> > > > Lessons learned from years of experience with big real-world CI
> > > > systems like this are hardly "handwaving".
> > >
> > > Your (and others) experience certainly isn't. It is valuable, welcome,
> > > and very much appreciated.
> > >
> > > However, my questions and concerns being ignored time and time again
> > > about things like what is the process is going to be like, what is going
> > > to be tested, who is going to be maintaining that test list, how that
> > > interacts with stable, how we can possibly audit the flaky tests list,
> > > etc. have felt like they were being handwaived away.
> >
> > Sorry it ended up coming across like that. It wasn't the intent.
> >
> > > I'm not saying that because I disagree, I still do on some, but that's
> > > fine to some extent. However, most of these issues are not so much an
> > > infrastructure issue, but a community issue. And I don't even expect a
> > > perfect solution right now, unlike what you seem to think. But I do
> > > expect some kind of plan instead of just ignoring that problem.
> > >
> > > Like, I had to ask the MT8173 question 3 times in order to get an
> > > answer, and I'm still not sure what is going to be done to address that
> > > particular issue.
> > >
> > > So, I'm sorry, but I certainly feel like it here.
> >
> > I don't quite see the same picture from your side though. For example,
> > my reading of what you've said is that flaky tests are utterly
> > unacceptable, as are partial runs, and we shouldn't pretend otherwise.
> > With your concrete example (which is really helpful, so thanks), what
> > happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
> > until it's perfect, or does MT8173 testing always fail because that
> > test does?
>
> It's not clear to me why that test is even running in the first place?
> There's been some confusion on my side here about what we're going to
> test with this. You've mentioned Mesa and GPUs before, but that's a KMS
> test so there must be more to it.
>
> Either way, it's a relevant test so I guess why not. It turns out that
> the test is indeed flaky, I guess we could add it to the flaky tests
> list.
>
> BUT
>
> I want to have every opportunity to fix whatever that failure is.
>
> So:
>
>   - Is the test broken? If so, we should report it to IGT dev and remove
> it from the test suite.
>   - If not, is that test failure have been reported to the driver author?
>   - If no answer/fix, we can add it to the flaky tests list, but do we
> have some way to reproduce the test failure?
>
> The last part is especially critical. Looking at the list itself, I have
> no idea what board, kernel version, configuration, or what the failure
> rate was. Assuming I spend some time looking at the infra to find the
> board and configuration, how many times do I have to run the tests to
> expect to reproduce the failure (and thus consider it fixed if it
> doesn't occur anymore).

Are you looking for a link to the pipeline, where you can get job
output from the various stages (which should also include igt output)?

Perhaps someone can point you to a pipeline with mtk (I don't have
that in the drm/msm tree yet)

>
> Like, with that board and test, if my first 100 runs of the test work
> fine, is it reasonable for me to consider it fixed, or is it only
> supposed to happen once every 1000 runs?
>
> So, ideally, having some (mandatory) metadata in the test lists with a
> link to the bug report, the board (DT name?) it happened with, the
> version and configuration it was first seen with, and an approximation
> of the failure rate for every flaky test list.
>
> I understand that it's probably difficult to get that after the fact on
> the tests that were already merged, but I'd really like to get that
> enforced for every new test going forward.
>
> That should 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-14 Thread Maxime Ripard
Hi,

On Tue, Sep 12, 2023 at 02:16:41PM +0100, Daniel Stone wrote:
> Hopefully less mangled formatting this time: turns out Thunderbird +
> plain text is utterly unreadable, so that's one less MUA that is
> actually usable to send email to kernel lists without getting shouted
> at.

Sorry if it felt that way, it definitely wasn't my intention to shout at
you. Email is indeed kind of a pain to deal with, and I wanted to keep
the discussion going.

> On Mon, 11 Sept 2023 at 15:46, Maxime Ripard  wrote:
> > On Mon, Sep 11, 2023 at 03:30:55PM +0200, Michel Dänzer wrote:
> > > > There's in 6.6-rc1 around 240 reported flaky tests. None of them have
> > > > any context. That new series hads a few dozens too, without any context
> > > > either. And there's no mention about that being a plan, or a patch
> > > > adding a new policy for all tests going forward.
> > >
> > > That does sound bad, would need to be raised in review.
> > >
> > > > Any concern I raised were met with a giant "it worked on Mesa" handwave
> > >
> > > Lessons learned from years of experience with big real-world CI
> > > systems like this are hardly "handwaving".
> >
> > Your (and others) experience certainly isn't. It is valuable, welcome,
> > and very much appreciated.
> >
> > However, my questions and concerns being ignored time and time again
> > about things like what is the process is going to be like, what is going
> > to be tested, who is going to be maintaining that test list, how that
> > interacts with stable, how we can possibly audit the flaky tests list,
> > etc. have felt like they were being handwaived away.
> 
> Sorry it ended up coming across like that. It wasn't the intent.
> 
> > I'm not saying that because I disagree, I still do on some, but that's
> > fine to some extent. However, most of these issues are not so much an
> > infrastructure issue, but a community issue. And I don't even expect a
> > perfect solution right now, unlike what you seem to think. But I do
> > expect some kind of plan instead of just ignoring that problem.
> >
> > Like, I had to ask the MT8173 question 3 times in order to get an
> > answer, and I'm still not sure what is going to be done to address that
> > particular issue.
> >
> > So, I'm sorry, but I certainly feel like it here.
> 
> I don't quite see the same picture from your side though. For example,
> my reading of what you've said is that flaky tests are utterly
> unacceptable, as are partial runs, and we shouldn't pretend otherwise.
> With your concrete example (which is really helpful, so thanks), what
> happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
> until it's perfect, or does MT8173 testing always fail because that
> test does?

It's not clear to me why that test is even running in the first place?
There's been some confusion on my side here about what we're going to
test with this. You've mentioned Mesa and GPUs before, but that's a KMS
test so there must be more to it.

Either way, it's a relevant test so I guess why not. It turns out that
the test is indeed flaky, I guess we could add it to the flaky tests
list.

BUT

I want to have every opportunity to fix whatever that failure is.

So:

  - Is the test broken? If so, we should report it to IGT dev and remove
it from the test suite.
  - If not, is that test failure have been reported to the driver author?
  - If no answer/fix, we can add it to the flaky tests list, but do we
have some way to reproduce the test failure?

The last part is especially critical. Looking at the list itself, I have
no idea what board, kernel version, configuration, or what the failure
rate was. Assuming I spend some time looking at the infra to find the
board and configuration, how many times do I have to run the tests to
expect to reproduce the failure (and thus consider it fixed if it
doesn't occur anymore).

Like, with that board and test, if my first 100 runs of the test work
fine, is it reasonable for me to consider it fixed, or is it only
supposed to happen once every 1000 runs?

So, ideally, having some (mandatory) metadata in the test lists with a
link to the bug report, the board (DT name?) it happened with, the
version and configuration it was first seen with, and an approximation
of the failure rate for every flaky test list.

I understand that it's probably difficult to get that after the fact on
the tests that were already merged, but I'd really like to get that
enforced for every new test going forward.

That should hopefully get us in a much better position to fix some of
those tests issues. And failing that, I can't see how that's
sustainable.

And Mesa does show what I'm talking about:

$ find -name *-flakes.txt | xargs git diff --stat  
e58a10af640ba58b6001f5c5ad750b782547da76
 src/amd/ci/deqp-radv-stoney-aco-flakes.txt|  18 +
 src/broadcom/ci/deqp-v3d-rpi4-flakes.txt  |   2 +
 src/broadcom/ci/deqp-v3dv-rpi4-flakes.txt |  15 
 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-12 Thread Daniel Stone
Hi Maxime,
Hopefully less mangled formatting this time: turns out Thunderbird +
plain text is utterly unreadable, so that's one less MUA that is
actually usable to send email to kernel lists without getting shouted
at.

On Mon, 11 Sept 2023 at 15:46, Maxime Ripard  wrote:
> On Mon, Sep 11, 2023 at 03:30:55PM +0200, Michel Dänzer wrote:
> > > There's in 6.6-rc1 around 240 reported flaky tests. None of them have
> > > any context. That new series hads a few dozens too, without any context
> > > either. And there's no mention about that being a plan, or a patch
> > > adding a new policy for all tests going forward.
> >
> > That does sound bad, would need to be raised in review.
> >
> > > Any concern I raised were met with a giant "it worked on Mesa" handwave
> >
> > Lessons learned from years of experience with big real-world CI
> > systems like this are hardly "handwaving".
>
> Your (and others) experience certainly isn't. It is valuable, welcome,
> and very much appreciated.
>
> However, my questions and concerns being ignored time and time again
> about things like what is the process is going to be like, what is going
> to be tested, who is going to be maintaining that test list, how that
> interacts with stable, how we can possibly audit the flaky tests list,
> etc. have felt like they were being handwaived away.

Sorry it ended up coming across like that. It wasn't the intent.

> I'm not saying that because I disagree, I still do on some, but that's
> fine to some extent. However, most of these issues are not so much an
> infrastructure issue, but a community issue. And I don't even expect a
> perfect solution right now, unlike what you seem to think. But I do
> expect some kind of plan instead of just ignoring that problem.
>
> Like, I had to ask the MT8173 question 3 times in order to get an
> answer, and I'm still not sure what is going to be done to address that
> particular issue.
>
> So, I'm sorry, but I certainly feel like it here.

I don't quite see the same picture from your side though. For example,
my reading of what you've said is that flaky tests are utterly
unacceptable, as are partial runs, and we shouldn't pretend otherwise.
With your concrete example (which is really helpful, so thanks), what
happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
until it's perfect, or does MT8173 testing always fail because that
test does?

Both have their downsides. Not doing any testing has the obvious
downside, and means that the driver can get worse until it gets
perfect. Always marking the test as failed means that the test results
are useless: if failure is expected, then red is good. I mean, say
you're contributing a patch to fix some documentation or add a helper
to common code which only v3d uses. The test results come back, and
your branch is failing tests on MT8173, specifically the
hdmi-inject@4k test. What then? Either as a senior contributor you
'know' that's the case, or as a casual contributor you get told 'oh
yeah, don't worry about the test results, they always fail'. Both lead
to the same outcome, which is that no-one pays any attention to the
results, and they get worse.

What we do agree on is that yes, those tests should absolutely be
fixed, and not just swept under the rug. Part of this is having
maintainers actually meaningfully own their test results. For example,
I'm looking at the expectation lists for the Intel gen in my laptop,
and I'm seeing a lot of breakage in blending tests, as well as
dual-display fails which include the resolution of my external
display. I'd expect the Intel driver maintainers to look at them, get
them fixed, and gradually prune those xfails/flakes down towards zero.

If the maintainers don't own it though, then it's not going to get
fixed. And we are exactly where we are today: broken plane blending
and 1440p on KBL, broken EDID injection on MT8173, and broken atomic
commits on stoney. Without stronger action from the maintainers (e.g.
throwing i915 out of the tree until it has 100% pass 100% of the
time), adding testing isn't making the situation better or worse in
and of itself. What it _is_ doing though, is giving really clear
documentation of the status of each driver, and backing that up by
verifying it.

Only maintainers can actually fix the drivers (or the tests tbf). But
doing the testing does let us be really clear to everyone what the
actual state is, and that way people can make informed decisions too.
And the only way we're going to drive the test rate down is by the
subsystem maintainers enforcing it.

Does that make sense on where I'm (and I think a lot of others are) coming from?

To answer the other question about 'where are the logs?': some of them
have the failure data in them, others don't. They all should going
forward at least though.

Cheers,
Daniel


Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-11 Thread Maxime Ripard
Replying one more time, because I certainly don't want to create any
hard feeling here.

On Mon, Sep 11, 2023 at 03:30:55PM +0200, Michel Dänzer wrote:
>  By keeping those sets of expectations, we've been able to keep Mesa 
>  pretty
>  clear of regressions, whilst having a very clear set of things that 
>  should
>  be fixed to point to. It would be great if those set of things were zero,
>  but it just isn't. Having that is far better than the two alternatives:
>  either not testing at all (obviously bad), or having the test always be 
>  red
>  so it's always ignored (might as well just not test).
> >>>
> >>> Isn't that what happens with flaky tests anyway?
> >>
> >> For a small minority of tests. Daniel was referring to whole test suites.
> >>
> >>> Even more so since we have 0 context when updating that list.
> >>
> >> The commit log can provide whatever context is needed.
> > 
> > Sure, I've yet to see that though.
> > 
> > There's in 6.6-rc1 around 240 reported flaky tests. None of them have
> > any context. That new series hads a few dozens too, without any context
> > either. And there's no mention about that being a plan, or a patch
> > adding a new policy for all tests going forward.
> 
> That does sound bad, would need to be raised in review.
>
> > Any concern I raised were met with a giant "it worked on Mesa" handwave
> 
> Lessons learned from years of experience with big real-world CI
> systems like this are hardly "handwaving".

Your (and others) experience certainly isn't. It is valuable, welcome,
and very much appreciated.

However, my questions and concerns being ignored time and time again
about things like what is the process is going to be like, what is going
to be tested, who is going to be maintaining that test list, how that
interacts with stable, how we can possibly audit the flaky tests list,
etc. have felt like they were being handwaived away.

I'm not saying that because I disagree, I still do on some, but that's
fine to some extent. However, most of these issues are not so much an
infrastructure issue, but a community issue. And I don't even expect a
perfect solution right now, unlike what you seem to think. But I do
expect some kind of plan instead of just ignoring that problem.

Like, I had to ask the MT8173 question 3 times in order to get an
answer, and I'm still not sure what is going to be done to address that
particular issue.

So, I'm sorry, but I certainly feel like it here.

Maxime


signature.asc
Description: PGP signature


Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-11 Thread Michel Dänzer
On 9/11/23 14:51, Maxime Ripard wrote:
> On Mon, Sep 11, 2023 at 02:13:43PM +0200, Michel Dänzer wrote:
>> On 9/11/23 11:34, Maxime Ripard wrote:
>>> On Thu, Sep 07, 2023 at 01:40:02PM +0200, Daniel Stone wrote:

 Secondly, we will never be there. If we could pause for five years and sit
 down making all the current usecases for all the current hardware on the
 current kernel run perfectly, we'd probably get there. But we can't: 
 there's
 new hardware, new userspace, and hundreds of new kernel trees.
>>>
>>> [...]
>>> 
>>> I'm not sure it's actually an argument, really. 10 years ago, we would
>>> never have been at "every GPU on the market has an open-source driver"
>>> here. 5 years ago, we would never have been at this-series-here. That
>>> didn't stop anyone making progress, everyone involved in that thread
>>> included.
>>
>> Even assuming perfection is achievable at all (which is very doubtful,
>> given the experience from the last few years of CI in Mesa and other
>> projects), if you demand perfection before even taking the first step,
>> it will never get off the ground.
> 
> Perfection and scale from the get-go isn't reasonable, yes. Building a
> small, "perfect" (your words, not mine) system that you can later expand
> is doable.

I mean "perfect" as in every single available test runs, is reliable and gates 
CI. Which seems to be what you're asking for. The only possible expansion of 
such a system would be adding new 100% reliable tests.

What is being proposed here is an "imperfect" system which takes into account 
the reality that some tests are not 100% reliable, and can be improved 
gradually while already preventing some regressions from getting merged.


>>> How are we even supposed to detect those failures in the first
>>> place if tests are flagged as unreliable?
>>
>> Based on experience with Mesa, only a relatively small minority of
>> tests should need to be marked as flaky / not run at all. The majority
>> of tests are reliable and can catch regressions even while some tests
>> are not yet.
> 
> I understand and acknowledge that it worked with Mesa. That's great for
> Mesa. That still doesn't mean that it's the panacea and is for every
> project.

Not sure what you're referring to by panacea, or how it relates to "some tests 
can be useful even while others aren't yet".


>>> No matter what we do here, what you describe will always happen. Like,
>>> if we do flag those tests as unreliable, what exactly prevents another
>>> issue to come on top undetected, and what will happen when we re-enable
>>> testing?
>>
>> Any issues affecting a test will need to be fixed before (re-)enabling
>> the test for CI.
> 
> If that underlying issue is never fixed, at which point do we consider
> that it's a failure and should never be re-enabled? Who has that role?

Not sure what you're asking. Anybody can (re-)enable a test in CI, they just 
need to make sure first that it is reliable. Until somebody does that work, 
it'll stay disabled in CI.


>>> It might or might not be an issue for Linus' release, but I can
>>> definitely see the trouble already for stable releases where fixes will
>>> be backported, but the test state list certainly won't be updated.
>>
>> If the stable branch maintainers want to take advantage of CI for the
>> stable branches, they may need to hunt for corresponding state list
>> commits sometimes. They'll need to take that into account for their
>> decision.
> 
> So we just expect the stable maintainers to track each and every patches
> involved in a test run, make sure that they are in a stable tree, and
> then update the test list? Without having consulted them at all?

I don't expect them to do anything. See the If at the start of what I wrote.


 By keeping those sets of expectations, we've been able to keep Mesa pretty
 clear of regressions, whilst having a very clear set of things that should
 be fixed to point to. It would be great if those set of things were zero,
 but it just isn't. Having that is far better than the two alternatives:
 either not testing at all (obviously bad), or having the test always be red
 so it's always ignored (might as well just not test).
>>>
>>> Isn't that what happens with flaky tests anyway?
>>
>> For a small minority of tests. Daniel was referring to whole test suites.
>>
>>> Even more so since we have 0 context when updating that list.
>>
>> The commit log can provide whatever context is needed.
> 
> Sure, I've yet to see that though.
> 
> There's in 6.6-rc1 around 240 reported flaky tests. None of them have
> any context. That new series hads a few dozens too, without any context
> either. And there's no mention about that being a plan, or a patch
> adding a new policy for all tests going forward.

That does sound bad, would need to be raised in review.


> Any concern I raised were met with a giant "it worked on Mesa" handwave

Lessons learned from years of experience with big 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-11 Thread Maxime Ripard
On Mon, Sep 11, 2023 at 02:13:43PM +0200, Michel Dänzer wrote:
> On 9/11/23 11:34, Maxime Ripard wrote:
> > On Thu, Sep 07, 2023 at 01:40:02PM +0200, Daniel Stone wrote:
> >> Yeah, this is what our experience with Mesa (in particular) has taught us.
> >>
> >> Having 100% of the tests pass 100% of the time on 100% of the platforms is 
> >> a
> >> great goal that everyone should aim for. But it will also never happen.
> >>
> >> Firstly, we're just not there yet today. Every single GPU-side DRM driver
> >> has userspace-triggerable faults which cause occasional errors in GL/Vulkan
> >> tests. Every single one. We deal with these in Mesa by retrying; if we
> >> didn't retry, across the breadth of hardware we test, I'd expect 99% of
> >> should-succeed merges to fail because of these intermittent bugs in the DRM
> >> drivers.
> > 
> > So the plan is only to ever test rendering devices? It should have been
> > made clearer then.
> > 
> >> We don't have the same figure for KMS - because we don't test it - but
> >> I'd be willing to bet no driver is 100% if you run tests often enough.
> > 
> > And I would still consider that a bug that we ought to fix, and
> > certainly not something we should sweep under the rug. If half the tests
> > are not running on a driver, then fine, they aren't. I'm not really
> > against having failing tests, I'm against not flagging unreliable tests
> > on a given hardware as failing tests.
> 
> A flaky test will by definition give a "pass" result at least some of the 
> time, which would be considered a failure by the CI if the test is marked as 
> failing.
> 
> 
> >> Secondly, we will never be there. If we could pause for five years and sit
> >> down making all the current usecases for all the current hardware on the
> >> current kernel run perfectly, we'd probably get there. But we can't: 
> >> there's
> >> new hardware, new userspace, and hundreds of new kernel trees.
> > 
> > Not with that attitude :)
> 
> Attitude is not the issue, the complexity of the multiple systems
> involved is.

FTR, that was a meme/joke.

> > I'm not sure it's actually an argument, really. 10 years ago, we would
> > never have been at "every GPU on the market has an open-source driver"
> > here. 5 years ago, we would never have been at this-series-here. That
> > didn't stop anyone making progress, everyone involved in that thread
> > included.
> 
> Even assuming perfection is achievable at all (which is very doubtful,
> given the experience from the last few years of CI in Mesa and other
> projects), if you demand perfection before even taking the first step,
> it will never get off the ground.

Perfection and scale from the get-go isn't reasonable, yes. Building a
small, "perfect" (your words, not mine) system that you can later expand
is doable. And that's very much a design choice.

> > How are we even supposed to detect those failures in the first
> > place if tests are flagged as unreliable?
> 
> Based on experience with Mesa, only a relatively small minority of
> tests should need to be marked as flaky / not run at all. The majority
> of tests are reliable and can catch regressions even while some tests
> are not yet.

I understand and acknowledge that it worked with Mesa. That's great for
Mesa. That still doesn't mean that it's the panacea and is for every
project.

> > No matter what we do here, what you describe will always happen. Like,
> > if we do flag those tests as unreliable, what exactly prevents another
> > issue to come on top undetected, and what will happen when we re-enable
> > testing?
> 
> Any issues affecting a test will need to be fixed before (re-)enabling
> the test for CI.

If that underlying issue is never fixed, at which point do we consider
that it's a failure and should never be re-enabled? Who has that role?

> > On top of that, you kind of hinted at that yourself, but what set of
> > tests will pass is a property linked to a single commit. Having that
> > list within the kernel already alters that: you'll need to merge a new
> > branch, add a bunch of fixes and then change the test list state. You
> > won't have the same tree you originally tested (and defined the test
> > state list for).
> 
> Ideally, the test state lists should be changed in the same commits
> which affect the test results. It'll probably take a while yet to get
> there for the kernel.
> 
> > It might or might not be an issue for Linus' release, but I can
> > definitely see the trouble already for stable releases where fixes will
> > be backported, but the test state list certainly won't be updated.
> 
> If the stable branch maintainers want to take advantage of CI for the
> stable branches, they may need to hunt for corresponding state list
> commits sometimes. They'll need to take that into account for their
> decision.

So we just expect the stable maintainers to track each and every patches
involved in a test run, make sure that they are in a stable tree, and
then update the test list? Without 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-11 Thread Michel Dänzer
On 9/11/23 11:34, Maxime Ripard wrote:
> On Thu, Sep 07, 2023 at 01:40:02PM +0200, Daniel Stone wrote:
>> Yeah, this is what our experience with Mesa (in particular) has taught us.
>>
>> Having 100% of the tests pass 100% of the time on 100% of the platforms is a
>> great goal that everyone should aim for. But it will also never happen.
>>
>> Firstly, we're just not there yet today. Every single GPU-side DRM driver
>> has userspace-triggerable faults which cause occasional errors in GL/Vulkan
>> tests. Every single one. We deal with these in Mesa by retrying; if we
>> didn't retry, across the breadth of hardware we test, I'd expect 99% of
>> should-succeed merges to fail because of these intermittent bugs in the DRM
>> drivers.
> 
> So the plan is only to ever test rendering devices? It should have been
> made clearer then.
> 
>> We don't have the same figure for KMS - because we don't test it - but
>> I'd be willing to bet no driver is 100% if you run tests often enough.
> 
> And I would still consider that a bug that we ought to fix, and
> certainly not something we should sweep under the rug. If half the tests
> are not running on a driver, then fine, they aren't. I'm not really
> against having failing tests, I'm against not flagging unreliable tests
> on a given hardware as failing tests.

A flaky test will by definition give a "pass" result at least some of the time, 
which would be considered a failure by the CI if the test is marked as failing.


>> Secondly, we will never be there. If we could pause for five years and sit
>> down making all the current usecases for all the current hardware on the
>> current kernel run perfectly, we'd probably get there. But we can't: there's
>> new hardware, new userspace, and hundreds of new kernel trees.
> 
> Not with that attitude :)

Attitude is not the issue, the complexity of the multiple systems involved is.


> I'm not sure it's actually an argument, really. 10 years ago, we would
> never have been at "every GPU on the market has an open-source driver"
> here. 5 years ago, we would never have been at this-series-here. That
> didn't stop anyone making progress, everyone involved in that thread
> included.

Even assuming perfection is achievable at all (which is very doubtful, given 
the experience from the last few years of CI in Mesa and other projects), if 
you demand perfection before even taking the first step, it will never get off 
the ground.


> How are we even supposed to detect those failures in the first
> place if tests are flagged as unreliable?

Based on experience with Mesa, only a relatively small minority of tests should 
need to be marked as flaky / not run at all. The majority of tests are reliable 
and can catch regressions even while some tests are not yet.


> No matter what we do here, what you describe will always happen. Like,
> if we do flag those tests as unreliable, what exactly prevents another
> issue to come on top undetected, and what will happen when we re-enable
> testing?

Any issues affecting a test will need to be fixed before (re-)enabling the test 
for CI.


> On top of that, you kind of hinted at that yourself, but what set of
> tests will pass is a property linked to a single commit. Having that
> list within the kernel already alters that: you'll need to merge a new
> branch, add a bunch of fixes and then change the test list state. You
> won't have the same tree you originally tested (and defined the test
> state list for).

Ideally, the test state lists should be changed in the same commits which 
affect the test results. It'll probably take a while yet to get there for the 
kernel.

> It might or might not be an issue for Linus' release, but I can
> definitely see the trouble already for stable releases where fixes will
> be backported, but the test state list certainly won't be updated.

If the stable branch maintainers want to take advantage of CI for the stable 
branches, they may need to hunt for corresponding state list commits sometimes. 
They'll need to take that into account for their decision.


>> By keeping those sets of expectations, we've been able to keep Mesa pretty
>> clear of regressions, whilst having a very clear set of things that should
>> be fixed to point to. It would be great if those set of things were zero,
>> but it just isn't. Having that is far better than the two alternatives:
>> either not testing at all (obviously bad), or having the test always be red
>> so it's always ignored (might as well just not test).
> 
> Isn't that what happens with flaky tests anyway?

For a small minority of tests. Daniel was referring to whole test suites.

> Even more so since we have 0 context when updating that list.

The commit log can provide whatever context is needed.


> I've asked a couple of times, I'll ask again. In that other series, on
> the MT8173, kms_hdmi_inject@inject-4k is setup as flaky (which is a KMS
> test btw).
> 
> I'm a maintainer for that part of the kernel, I'd like to look 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-11 Thread Maxime Ripard
Hi

(Removing most of the context that got scrambled)

On Thu, Sep 07, 2023 at 01:40:02PM +0200, Daniel Stone wrote:
> Yeah, this is what our experience with Mesa (in particular) has taught us.
> 
> Having 100% of the tests pass 100% of the time on 100% of the platforms is a
> great goal that everyone should aim for. But it will also never happen.
> 
> Firstly, we're just not there yet today. Every single GPU-side DRM driver
> has userspace-triggerable faults which cause occasional errors in GL/Vulkan
> tests. Every single one. We deal with these in Mesa by retrying; if we
> didn't retry, across the breadth of hardware we test, I'd expect 99% of
> should-succeed merges to fail because of these intermittent bugs in the DRM
> drivers.

So the plan is only to ever test rendering devices? It should have been
made clearer then.

> We don't have the same figure for KMS - because we don't test it - but
> I'd be willing to bet no driver is 100% if you run tests often enough.

And I would still consider that a bug that we ought to fix, and
certainly not something we should sweep under the rug. If half the tests
are not running on a driver, then fine, they aren't. I'm not really
against having failing tests, I'm against not flagging unreliable tests
on a given hardware as failing tests.

> Secondly, we will never be there. If we could pause for five years and sit
> down making all the current usecases for all the current hardware on the
> current kernel run perfectly, we'd probably get there. But we can't: there's
> new hardware, new userspace, and hundreds of new kernel trees.

Not with that attitude :)

I'm not sure it's actually an argument, really. 10 years ago, we would
never have been at "every GPU on the market has an open-source driver"
here. 5 years ago, we would never have been at this-series-here. That
didn't stop anyone making progress, everyone involved in that thread
included.

> Even without the first two, what happens when the Arm SMMU maintainers
> (choosing a random target to pick on, sorry Robin) introduce subtle
> breakage which makes a lot of tests fail some of the time? Do we
> refuse to backmerge Linus into DRM until it's fixed, or do we disable
> all testing on Arm until it's fixed? When we've done that, what
> happens when we re-enable testing, and discover that a bunch of tests
> get broken because we haven't been testing?

I guess that's another thing that needs to be made clearer then. Do you
want to test Mesa, or the kernel?

For Mesa, I'd very much expect to rely on a stable kernel, and for the
kernel on a stable Mesa.

And if we're testing the kernel, then let's turn it the other way
around. How are we even supposed to detect those failures in the first
place if tests are flagged as unreliable?

No matter what we do here, what you describe will always happen. Like,
if we do flag those tests as unreliable, what exactly prevents another
issue to come on top undetected, and what will happen when we re-enable
testing?

On top of that, you kind of hinted at that yourself, but what set of
tests will pass is a property linked to a single commit. Having that
list within the kernel already alters that: you'll need to merge a new
branch, add a bunch of fixes and then change the test list state. You
won't have the same tree you originally tested (and defined the test
state list for).

It might or might not be an issue for Linus' release, but I can
definitely see the trouble already for stable releases where fixes will
be backported, but the test state list certainly won't be updated.

> Thirdly, hardware is capricious. 'This board doesn't make it to u-boot' is a
> clear infrastructure error, but if you test at sufficient scale, cold solder
> or failing caps surface way more often than you might think. And you can't
> really pick those out by any other means than running at scale, dealing with
> non-binary results, and looking at the trends over time. (Again this is
> something we do in Mesa - we graph test failures per DUT, look for outliers,
> and pull DUTs out of the rotation when they're clearly defective. But that
> only works if you actually run enough tests on them in the first place to
> discover trends - if you stop at the first failed test, it's impossible to
> tell the difference between 'infuriatingly infrequent kernel/test bug?' and
> 'cracked main board maybe?'.)
> 
> What we do know is that we _can_ classify tests four ways in expectations.
> Always-passing tests should always pass. Always-failing tests should always
> fail (and update the expectations if you make them pass). Flaking tests work
> often enough that they'll always pass if you run them a couple/few times,
> but fail often enough that you can't rely on them. Then you just skip tests
> which exhibit catastrophic failure i.e. local DoS which affects the whole
> test suite.
> 
> By keeping those sets of expectations, we've been able to keep Mesa pretty
> clear of regressions, whilst having a very clear set of things that 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-07 Thread Daniel Stone

Hi,

On 04/09/2023 09:54, Daniel Vetter wrote:
On Wed, 30 Aug 2023 at 17:14, Helen Koike   > wrote: >> >> On 30/08/2023 11:57, Maxime Ripard wrote: >>> >>> I 
agree that we need a baseline, but that baseline should be >>> defined 
by the tests own merits, not their outcome on a >>> particular platform. 
>>> >>> In other words, I want all drivers to follow that baseline, and 
>>> if they don't it's a bug we should fix, and we should be vocal >>> 
about it. We shouldn't ignore the test because it's broken. >>> >>> 
Going back to the example I used previously, >>> 
kms_hdmi_inject@inject-4k shouldn't fail on mt8173, ever. That's >>> a 
bug. Ignoring it and reporting that "all tests are good" isn't >>> ok. 
There's something wrong with that driver and we should fix >>> it. >>> 
>>> Or at the very least, explain in much details what is the >>> 
breakage, how we noticed it, why we can't fix it, and how to >>> 
reproduce it. >>> >>> Because in its current state, there's no chance 
we'll ever go >>> over that test list and remove some of them. Or even 
know if, if >>> we ever fix a bug somewhere, we should remove a flaky or 
failing >>> test. >>> >>> [...] >>>  we need to have a clear view 
about which tests are not  corresponding to it, so we can start 
fixing. First we need to  be aware of the issues so we can start 
fixing them, otherwise  we will stay in the "no tests no failures" 
ground :) >>> >>> I think we have somewhat contradicting goals. You want 
to make >>> regression testing, so whatever test used to work in the 
past >>> should keep working. That's fine, but it's different from >>> 
"expectations about what the DRM drivers are supposed to pass in >>> the 
IGT test suite" which is about validation, ie "all KMS >>> drivers must 
behave this way". >> >> [...] >> >> >> We could have some policy: if you 
want to enable a certain device >> in the CI, you need to make sure it 
passes all tests first to force >> people to go fix the issues, but 
maybe it would be a big barrier. >> >> I'm afraid that, if a test fail 
(and it is a clear bug), people >> would just say "work for most of the 
cases, this is not a priority >> to fix" and just start ignoring the CI, 
this is why I think >> regression tests is a good way to start with. > > 
I think eventually we need to get to both goals, but currently > driver 
and test quality just isn't remotely there. > > I think a good approach 
would be if CI work focuses on the pure sw > tests first, so kunit and 
running igt against vgem/vkms. And then we > could use that to polish a 
set of must-pass igt testcases, which > also drivers in general are 
supposed to pass. Plus ideally weed out > the bad igts that aren't 
reliable enough or have bad assumptions. > > For hardware I think it 
will take a very long time until we get to a > point where CI can work 
without a test result list, we're nowhere > close to that. But for 
virtual driver this really should be > achievable, albeit with a huge 
amount of effort required to get > there I think.

Yeah, this is what our experience with Mesa (in particular) has taught us.

Having 100% of the tests pass 100% of the time on 100% of the platforms 
is a great goal that everyone should aim for. But it will also never happen.


Firstly, we're just not there yet today. Every single GPU-side DRM 
driver has userspace-triggerable faults which cause occasional errors in 
GL/Vulkan tests. Every single one. We deal with these in Mesa by 
retrying; if we didn't retry, across the breadth of hardware we test, 
I'd expect 99% of should-succeed merges to fail because of these 
intermittent bugs in the DRM drivers. We don't have the same figure for 
KMS - because we don't test it - but I'd be willing to bet no driver is 
100% if you run tests often enough.


Secondly, we will never be there. If we could pause for five years and 
sit down making all the current usecases for all the current hardware on 
the current kernel run perfectly, we'd probably get there. But we can't: 
there's new hardware, new userspace, and hundreds of new kernel trees. 
Even without the first two, what happens when the Arm SMMU maintainers 
(choosing a random target to pick on, sorry Robin) introduce subtle 
breakage which makes a lot of tests fail some of the time? Do we refuse 
to backmerge Linus into DRM until it's fixed, or do we disable all 
testing on Arm until it's fixed? When we've done that, what happens when 
we re-enable testing, and discover that a bunch of tests get broken 
because we haven't been testing?


Thirdly, hardware is capricious. 'This board doesn't make it to u-boot' 
is a clear infrastructure error, but if you test at sufficient scale, 
cold solder or failing caps surface way more often than you might think. 
And you can't really pick those out by any other means than running at 
scale, dealing with non-binary results, and looking at the trends over 
time. (Again this is something we do in Mesa - we graph test failures 
per 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-09-04 Thread Daniel Vetter
On Wed, 30 Aug 2023 at 17:14, Helen Koike  wrote:
>
>
>
> On 30/08/2023 11:57, Maxime Ripard wrote:
> > On Wed, Aug 30, 2023 at 10:24:49AM -0300, Helen Koike wrote:
> >> Hi all,
> >>
> >> Thanks for you comments.
> >>
> >> On 30/08/2023 08:37, Maxime Ripard wrote:
> >>> On Wed, Aug 30, 2023 at 01:58:31PM +0300, Jani Nikula wrote:
>  On Wed, 30 Aug 2023, Maxime Ripard  wrote:
> > On Tue, Aug 22, 2023 at 04:26:06PM +0200, Daniel Vetter wrote:
> >> On Fri, Aug 11, 2023 at 02:19:53PM -0300, Helen Koike wrote:
> >>> From: Tomeu Vizoso 
> >>>
> >>> Developers can easily execute several tests on different devices
> >>> by just pushing their branch to their fork in a repository hosted
> >>> on gitlab.freedesktop.org which has an infrastructure to run jobs
> >>> in several runners and farms with different devices.
> >>>
> >>> There are also other automated tools that uprev dependencies,
> >>> monitor the infra, and so on that are already used by the Mesa
> >>> project, and we can reuse them too.
> >>>
> >>> Also, store expectations about what the DRM drivers are supposed
> >>> to pass in the IGT test suite. By storing the test expectations
> >>> along with the code, we can make sure both stay in sync with each
> >>> other so we can know when a code change breaks those expectations.
> >>>
> >>> Also, include a configuration file that points to the out-of-tree
> >>> CI scripts.
> >>>
> >>> This will allow all contributors to drm to reuse the infrastructure
> >>> already in gitlab.freedesktop.org to test the driver on several
> >>> generations of the hardware.
> >>>
> >>> Signed-off-by: Tomeu Vizoso 
> >>> Signed-off-by: Helen Koike 
> >>> Acked-by: Daniel Stone 
> >>> Acked-by: Rob Clark 
> >>> Tested-by: Rob Clark 
> >>
> >> Ok I pushed this into a topic/drm-ci branch in drm.git and asked sfr to
> >> include that branch in linux-next.
> >>
> >> But also I'd like to see a lot more acks here, we should be able to at
> >> least pile up a bunch of (driver) maintainers from drm-misc in support 
> >> of
> >> this. Also maybe media, at least I've heard noises that they're maybe
> >> interested too? Plus anyone else, the more the better.
> >
> > I'm not really convinced by that approach at all, and most of the issues
> > I see are shown by the follow-up series here:
> 
>  I'm not fully convinced either, more like "let's see". In that narrow
>  sense, ack. I don't see harm in trying, if you're also open to backing
>  off in case it does not pan out.
> 
> > https://lore.kernel.org/dri-devel/20230825122435.316272-1-vignesh.ra...@collabora.com/
> >
> > * We hardcode a CI farm setup into the kernel
> >>
> >>
> >> These could be out of tree.
> >>
> >> There is a version outside the kernel tree where you just point the CI
> >> configuration to a url:
> >> https://gitlab.freedesktop.org/gfx-ci/drm-ci/-/merge_requests/1
> >>
> >> We were discussing it here 
> >> https://www.linuxtv.org/cgi-bin/mailman/private/linuxtv-ci/2023-August/27.html
> >
> > It looks like it's private
> >
> >> (I guess Sima's reply didn't got into the mailing list) but the argument of
> >> not having out of tree repo is due to historical bad experience of having 
> >> to
> >> sync the kernel with the code and it can become messy.
> >
> > My point is that even though the test strategy might be considered a
> > "property" of the kernel, how you execute it is definitely not and you
> > will have as many setups as you have CI farms. You can't put that into
> > the kernel, just like we don't put the kernel command line in it for
> > example. >
> >
> > * We cannot trust that the code being run is actually the one being
> >   pushed into gitlab
> >>
> >>
> >> We can improve this if this is a requirement.
> >> For DTS configuration we can work with overlays (which is the current
> >> modification on that patchset). For other changes that are not suitable to
> >> upstream (and should be rare) we can see if we work with the
> >> `-external-fixes` approach or another approach, we can check it case by 
> >> case
> >> to understand why it is not suitable for upstream.
> >
> > The existence of that branch in itself is an issue to me. Again, it's a
> > matter of trust. How can I trust a branch I barely know about, of which
> > the development is not clear and isn't reviewed by any of the
> > maintainers of the code that might affect the test outcomes.
> >
> > Or put another way, if I run the tests on my machine, it won't work. Why
> > should it work on the CI farm? The branch itself is broken. It might not
> > be due to any of the work I did, but it's broken still.
> >
> >
> > * IMO, and I know we disagree here, any IGT test we enable for a 
> > given
> >   platform should work, period. Allowing failures and flaky tests 
> 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-08-30 Thread Helen Koike




On 30/08/2023 11:57, Maxime Ripard wrote:

On Wed, Aug 30, 2023 at 10:24:49AM -0300, Helen Koike wrote:

Hi all,

Thanks for you comments.

On 30/08/2023 08:37, Maxime Ripard wrote:

On Wed, Aug 30, 2023 at 01:58:31PM +0300, Jani Nikula wrote:

On Wed, 30 Aug 2023, Maxime Ripard  wrote:

On Tue, Aug 22, 2023 at 04:26:06PM +0200, Daniel Vetter wrote:

On Fri, Aug 11, 2023 at 02:19:53PM -0300, Helen Koike wrote:

From: Tomeu Vizoso 

Developers can easily execute several tests on different devices
by just pushing their branch to their fork in a repository hosted
on gitlab.freedesktop.org which has an infrastructure to run jobs
in several runners and farms with different devices.

There are also other automated tools that uprev dependencies,
monitor the infra, and so on that are already used by the Mesa
project, and we can reuse them too.

Also, store expectations about what the DRM drivers are supposed
to pass in the IGT test suite. By storing the test expectations
along with the code, we can make sure both stay in sync with each
other so we can know when a code change breaks those expectations.

Also, include a configuration file that points to the out-of-tree
CI scripts.

This will allow all contributors to drm to reuse the infrastructure
already in gitlab.freedesktop.org to test the driver on several
generations of the hardware.

Signed-off-by: Tomeu Vizoso 
Signed-off-by: Helen Koike 
Acked-by: Daniel Stone 
Acked-by: Rob Clark 
Tested-by: Rob Clark 


Ok I pushed this into a topic/drm-ci branch in drm.git and asked sfr to
include that branch in linux-next.

But also I'd like to see a lot more acks here, we should be able to at
least pile up a bunch of (driver) maintainers from drm-misc in support of
this. Also maybe media, at least I've heard noises that they're maybe
interested too? Plus anyone else, the more the better.


I'm not really convinced by that approach at all, and most of the issues
I see are shown by the follow-up series here:


I'm not fully convinced either, more like "let's see". In that narrow
sense, ack. I don't see harm in trying, if you're also open to backing
off in case it does not pan out.


https://lore.kernel.org/dri-devel/20230825122435.316272-1-vignesh.ra...@collabora.com/

* We hardcode a CI farm setup into the kernel



These could be out of tree.

There is a version outside the kernel tree where you just point the CI
configuration to a url:
https://gitlab.freedesktop.org/gfx-ci/drm-ci/-/merge_requests/1

We were discussing it here 
https://www.linuxtv.org/cgi-bin/mailman/private/linuxtv-ci/2023-August/27.html


It looks like it's private


(I guess Sima's reply didn't got into the mailing list) but the argument of
not having out of tree repo is due to historical bad experience of having to
sync the kernel with the code and it can become messy.


My point is that even though the test strategy might be considered a
"property" of the kernel, how you execute it is definitely not and you
will have as many setups as you have CI farms. You can't put that into
the kernel, just like we don't put the kernel command line in it for
example. >


* We cannot trust that the code being run is actually the one being
  pushed into gitlab



We can improve this if this is a requirement.
For DTS configuration we can work with overlays (which is the current
modification on that patchset). For other changes that are not suitable to
upstream (and should be rare) we can see if we work with the
`-external-fixes` approach or another approach, we can check it case by case
to understand why it is not suitable for upstream.


The existence of that branch in itself is an issue to me. Again, it's a
matter of trust. How can I trust a branch I barely know about, of which
the development is not clear and isn't reviewed by any of the
maintainers of the code that might affect the test outcomes.

Or put another way, if I run the tests on my machine, it won't work. Why
should it work on the CI farm? The branch itself is broken. It might not
be due to any of the work I did, but it's broken still.



* IMO, and I know we disagree here, any IGT test we enable for a given
  platform should work, period. Allowing failures and flaky tests just
  sweeps whatever issue is there under the rug. If the test is at
  fault, we should fix the test, if the driver / kernel is at fault,
  then I certainly want to know about it.


I believe we need a baseline and understand the current status of tests. If
you check the xfails folder in the patch you can see that I had to add a few
tests on *-skips.txt since those tests crashes the system and other on
*-fails.txt that are consistently not passing.


I agree that we need a baseline, but that baseline should be defined by
the tests own merits, not their outcome on a particular platform.

In other words, I want all drivers to follow that baseline, and if they
don't it's a bug we should fix, and we should be vocal about it. We
shouldn't 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-08-30 Thread Maxime Ripard
On Wed, Aug 30, 2023 at 10:24:49AM -0300, Helen Koike wrote:
> Hi all,
> 
> Thanks for you comments.
> 
> On 30/08/2023 08:37, Maxime Ripard wrote:
> > On Wed, Aug 30, 2023 at 01:58:31PM +0300, Jani Nikula wrote:
> > > On Wed, 30 Aug 2023, Maxime Ripard  wrote:
> > > > On Tue, Aug 22, 2023 at 04:26:06PM +0200, Daniel Vetter wrote:
> > > > > On Fri, Aug 11, 2023 at 02:19:53PM -0300, Helen Koike wrote:
> > > > > > From: Tomeu Vizoso 
> > > > > > 
> > > > > > Developers can easily execute several tests on different devices
> > > > > > by just pushing their branch to their fork in a repository hosted
> > > > > > on gitlab.freedesktop.org which has an infrastructure to run jobs
> > > > > > in several runners and farms with different devices.
> > > > > > 
> > > > > > There are also other automated tools that uprev dependencies,
> > > > > > monitor the infra, and so on that are already used by the Mesa
> > > > > > project, and we can reuse them too.
> > > > > > 
> > > > > > Also, store expectations about what the DRM drivers are supposed
> > > > > > to pass in the IGT test suite. By storing the test expectations
> > > > > > along with the code, we can make sure both stay in sync with each
> > > > > > other so we can know when a code change breaks those expectations.
> > > > > > 
> > > > > > Also, include a configuration file that points to the out-of-tree
> > > > > > CI scripts.
> > > > > > 
> > > > > > This will allow all contributors to drm to reuse the infrastructure
> > > > > > already in gitlab.freedesktop.org to test the driver on several
> > > > > > generations of the hardware.
> > > > > > 
> > > > > > Signed-off-by: Tomeu Vizoso 
> > > > > > Signed-off-by: Helen Koike 
> > > > > > Acked-by: Daniel Stone 
> > > > > > Acked-by: Rob Clark 
> > > > > > Tested-by: Rob Clark 
> > > > > 
> > > > > Ok I pushed this into a topic/drm-ci branch in drm.git and asked sfr 
> > > > > to
> > > > > include that branch in linux-next.
> > > > > 
> > > > > But also I'd like to see a lot more acks here, we should be able to at
> > > > > least pile up a bunch of (driver) maintainers from drm-misc in 
> > > > > support of
> > > > > this. Also maybe media, at least I've heard noises that they're maybe
> > > > > interested too? Plus anyone else, the more the better.
> > > > 
> > > > I'm not really convinced by that approach at all, and most of the issues
> > > > I see are shown by the follow-up series here:
> > > 
> > > I'm not fully convinced either, more like "let's see". In that narrow
> > > sense, ack. I don't see harm in trying, if you're also open to backing
> > > off in case it does not pan out.
> > > 
> > > > https://lore.kernel.org/dri-devel/20230825122435.316272-1-vignesh.ra...@collabora.com/
> > > > 
> > > >* We hardcode a CI farm setup into the kernel
> 
> 
> These could be out of tree.
> 
> There is a version outside the kernel tree where you just point the CI
> configuration to a url:
> https://gitlab.freedesktop.org/gfx-ci/drm-ci/-/merge_requests/1
> 
> We were discussing it here 
> https://www.linuxtv.org/cgi-bin/mailman/private/linuxtv-ci/2023-August/27.html

It looks like it's private

> (I guess Sima's reply didn't got into the mailing list) but the argument of
> not having out of tree repo is due to historical bad experience of having to
> sync the kernel with the code and it can become messy.

My point is that even though the test strategy might be considered a
"property" of the kernel, how you execute it is definitely not and you
will have as many setups as you have CI farms. You can't put that into
the kernel, just like we don't put the kernel command line in it for
example.

> > > > 
> > > >* We cannot trust that the code being run is actually the one being
> > > >  pushed into gitlab
> 
> 
> We can improve this if this is a requirement.
> For DTS configuration we can work with overlays (which is the current
> modification on that patchset). For other changes that are not suitable to
> upstream (and should be rare) we can see if we work with the
> `-external-fixes` approach or another approach, we can check it case by case
> to understand why it is not suitable for upstream.

The existence of that branch in itself is an issue to me. Again, it's a
matter of trust. How can I trust a branch I barely know about, of which
the development is not clear and isn't reviewed by any of the
maintainers of the code that might affect the test outcomes.

Or put another way, if I run the tests on my machine, it won't work. Why
should it work on the CI farm? The branch itself is broken. It might not
be due to any of the work I did, but it's broken still.

> > > >
> > > >* IMO, and I know we disagree here, any IGT test we enable for a 
> > > > given
> > > >  platform should work, period. Allowing failures and flaky tests 
> > > > just
> > > >  sweeps whatever issue is there under the rug. If the test is at
> > > >  fault, we should fix the test, if the driver / kernel is at fault,
> > > > 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-08-30 Thread Rob Clark
replying to a couple points on this thread

On Wed, Aug 30, 2023 at 6:25 AM Helen Koike  wrote:
>
> Hi all,
>
> Thanks for you comments.
>
> On 30/08/2023 08:37, Maxime Ripard wrote:
> > On Wed, Aug 30, 2023 at 01:58:31PM +0300, Jani Nikula wrote:
> >> On Wed, 30 Aug 2023, Maxime Ripard  wrote:
> >>> On Tue, Aug 22, 2023 at 04:26:06PM +0200, Daniel Vetter wrote:
>  On Fri, Aug 11, 2023 at 02:19:53PM -0300, Helen Koike wrote:
> > From: Tomeu Vizoso 
> >
> > Developers can easily execute several tests on different devices
> > by just pushing their branch to their fork in a repository hosted
> > on gitlab.freedesktop.org which has an infrastructure to run jobs
> > in several runners and farms with different devices.
> >
> > There are also other automated tools that uprev dependencies,
> > monitor the infra, and so on that are already used by the Mesa
> > project, and we can reuse them too.
> >
> > Also, store expectations about what the DRM drivers are supposed
> > to pass in the IGT test suite. By storing the test expectations
> > along with the code, we can make sure both stay in sync with each
> > other so we can know when a code change breaks those expectations.
> >
> > Also, include a configuration file that points to the out-of-tree
> > CI scripts.
> >
> > This will allow all contributors to drm to reuse the infrastructure
> > already in gitlab.freedesktop.org to test the driver on several
> > generations of the hardware.
> >
> > Signed-off-by: Tomeu Vizoso 
> > Signed-off-by: Helen Koike 
> > Acked-by: Daniel Stone 
> > Acked-by: Rob Clark 
> > Tested-by: Rob Clark 
> 
>  Ok I pushed this into a topic/drm-ci branch in drm.git and asked sfr to
>  include that branch in linux-next.
> 
>  But also I'd like to see a lot more acks here, we should be able to at
>  least pile up a bunch of (driver) maintainers from drm-misc in support of
>  this. Also maybe media, at least I've heard noises that they're maybe
>  interested too? Plus anyone else, the more the better.
> >>>
> >>> I'm not really convinced by that approach at all, and most of the issues
> >>> I see are shown by the follow-up series here:
> >>
> >> I'm not fully convinced either, more like "let's see". In that narrow
> >> sense, ack. I don't see harm in trying, if you're also open to backing
> >> off in case it does not pan out.
> >>
> >>> https://lore.kernel.org/dri-devel/20230825122435.316272-1-vignesh.ra...@collabora.com/
> >>>
> >>>* We hardcode a CI farm setup into the kernel

We do have farm status out of tree so we don't need to push a kernel
patch for outages.

Other than that, I can go either way, with the scripts and related yml
in tree or out of tree.  But the expectation files (ie. the patch you
are complaining about) absolutely need to be in-tree.  It must be
possible to update them in sync with driver changes that fix tests.
There may be a bit of related churn initially ie. based on our lengthy
experience at this point with mesa CI, we can't know about all the
flaky tests until people start using CI in anger ;-)

>
>
> These could be out of tree.
>
> There is a version outside the kernel tree where you just point the CI
> configuration to a url:
> https://gitlab.freedesktop.org/gfx-ci/drm-ci/-/merge_requests/1
>
> We were discussing it here
> https://www.linuxtv.org/cgi-bin/mailman/private/linuxtv-ci/2023-August/27.html
>
> (I guess Sima's reply didn't got into the mailing list) but the argument
> of not having out of tree repo is due to historical bad experience of
> having to sync the kernel with the code and it can become messy.
>
>
> >>>
> >>>* We cannot trust that the code being run is actually the one being
> >>>  pushed into gitlab
>
>
> We can improve this if this is a requirement.
> For DTS configuration we can work with overlays (which is the current
> modification on that patchset). For other changes that are not suitable
> to upstream (and should be rare) we can see if we work with the
> `-external-fixes` approach or another approach, we can check it case by
> case to understand why it is not suitable for upstream.
>

IMHO the occasional need for extra patches is a fact of life with the
kernel development process, and we'll have to live with this until
_all_ kernel patches run thru pre-merge CI.  (I'm not holding my
breath, but who knows..)

If some particular board doesn't boot because of some early-rc
breakage elsewhere in the kernel, then we can't run CI.

>
> >>>
> >>>* IMO, and I know we disagree here, any IGT test we enable for a given
> >>>  platform should work, period. Allowing failures and flaky tests just
> >>>  sweeps whatever issue is there under the rug. If the test is at
> >>>  fault, we should fix the test, if the driver / kernel is at fault,
> >>>  then I certainly want to know about it.
>
>
> I believe we need a baseline 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-08-30 Thread Helen Koike

Hi all,

Thanks for you comments.

On 30/08/2023 08:37, Maxime Ripard wrote:

On Wed, Aug 30, 2023 at 01:58:31PM +0300, Jani Nikula wrote:

On Wed, 30 Aug 2023, Maxime Ripard  wrote:

On Tue, Aug 22, 2023 at 04:26:06PM +0200, Daniel Vetter wrote:

On Fri, Aug 11, 2023 at 02:19:53PM -0300, Helen Koike wrote:

From: Tomeu Vizoso 

Developers can easily execute several tests on different devices
by just pushing their branch to their fork in a repository hosted
on gitlab.freedesktop.org which has an infrastructure to run jobs
in several runners and farms with different devices.

There are also other automated tools that uprev dependencies,
monitor the infra, and so on that are already used by the Mesa
project, and we can reuse them too.

Also, store expectations about what the DRM drivers are supposed
to pass in the IGT test suite. By storing the test expectations
along with the code, we can make sure both stay in sync with each
other so we can know when a code change breaks those expectations.

Also, include a configuration file that points to the out-of-tree
CI scripts.

This will allow all contributors to drm to reuse the infrastructure
already in gitlab.freedesktop.org to test the driver on several
generations of the hardware.

Signed-off-by: Tomeu Vizoso 
Signed-off-by: Helen Koike 
Acked-by: Daniel Stone 
Acked-by: Rob Clark 
Tested-by: Rob Clark 


Ok I pushed this into a topic/drm-ci branch in drm.git and asked sfr to
include that branch in linux-next.

But also I'd like to see a lot more acks here, we should be able to at
least pile up a bunch of (driver) maintainers from drm-misc in support of
this. Also maybe media, at least I've heard noises that they're maybe
interested too? Plus anyone else, the more the better.


I'm not really convinced by that approach at all, and most of the issues
I see are shown by the follow-up series here:


I'm not fully convinced either, more like "let's see". In that narrow
sense, ack. I don't see harm in trying, if you're also open to backing
off in case it does not pan out.


https://lore.kernel.org/dri-devel/20230825122435.316272-1-vignesh.ra...@collabora.com/

   * We hardcode a CI farm setup into the kernel



These could be out of tree.

There is a version outside the kernel tree where you just point the CI
configuration to a url:
https://gitlab.freedesktop.org/gfx-ci/drm-ci/-/merge_requests/1

We were discussing it here 
https://www.linuxtv.org/cgi-bin/mailman/private/linuxtv-ci/2023-August/27.html


(I guess Sima's reply didn't got into the mailing list) but the argument 
of not having out of tree repo is due to historical bad experience of 
having to sync the kernel with the code and it can become messy.





   * We cannot trust that the code being run is actually the one being
 pushed into gitlab



We can improve this if this is a requirement.
For DTS configuration we can work with overlays (which is the current 
modification on that patchset). For other changes that are not suitable 
to upstream (and should be rare) we can see if we work with the 
`-external-fixes` approach or another approach, we can check it case by 
case to understand why it is not suitable for upstream.





   * IMO, and I know we disagree here, any IGT test we enable for a given
 platform should work, period. Allowing failures and flaky tests just
 sweeps whatever issue is there under the rug. If the test is at
 fault, we should fix the test, if the driver / kernel is at fault,
 then I certainly want to know about it.



I believe we need a baseline and understand the current status of tests. 
If you check the xfails folder in the patch you can see that I had to 
add a few tests on *-skips.txt since those tests crashes the system and 
other on *-fails.txt that are consistently not passing.


Since the "any IGT test we enable for a given platform should work" is 
not a reality atm, we need to have a clear view about which tests are 
not corresponding to it, so we can start fixing. First we need to be 
aware of the issues so we can start fixing them, otherwise we will stay 
in the "no tests no failures" ground :)





At least for display, where this also depends on peripheral hardware,
it's not an easy problem, really.


Aside from the Chamelium tests, which tests actually rely on peripheral
hardware? On EDID and hotplug, sure, but that can easily be set up from
the userspace, or something like

https://www.lindy-international.com/HDMI-2-0-EDID-Emulator.htm?websale8=ld0101.ld021102=32115


How reliable do you need it to be? How many nines? Who is going to
debug the issues that need hundreds or thousands of runs to reproduce?
If a commit makes some test less reliable, how long is it going to
take to even see that or pinpoint that?


I mean, that's also true for failures or success then. How many times do
you need a test to run properly to qualify it as a meaningful test? How
do you know that it's not a flaky test?

Ultimately, it's about trust. If, for a given 

Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-08-30 Thread Maxime Ripard
On Wed, Aug 30, 2023 at 01:58:31PM +0300, Jani Nikula wrote:
> On Wed, 30 Aug 2023, Maxime Ripard  wrote:
> > On Tue, Aug 22, 2023 at 04:26:06PM +0200, Daniel Vetter wrote:
> >> On Fri, Aug 11, 2023 at 02:19:53PM -0300, Helen Koike wrote:
> >> > From: Tomeu Vizoso 
> >> > 
> >> > Developers can easily execute several tests on different devices
> >> > by just pushing their branch to their fork in a repository hosted
> >> > on gitlab.freedesktop.org which has an infrastructure to run jobs
> >> > in several runners and farms with different devices.
> >> > 
> >> > There are also other automated tools that uprev dependencies,
> >> > monitor the infra, and so on that are already used by the Mesa
> >> > project, and we can reuse them too.
> >> > 
> >> > Also, store expectations about what the DRM drivers are supposed
> >> > to pass in the IGT test suite. By storing the test expectations
> >> > along with the code, we can make sure both stay in sync with each
> >> > other so we can know when a code change breaks those expectations.
> >> > 
> >> > Also, include a configuration file that points to the out-of-tree
> >> > CI scripts.
> >> > 
> >> > This will allow all contributors to drm to reuse the infrastructure
> >> > already in gitlab.freedesktop.org to test the driver on several
> >> > generations of the hardware.
> >> > 
> >> > Signed-off-by: Tomeu Vizoso 
> >> > Signed-off-by: Helen Koike 
> >> > Acked-by: Daniel Stone 
> >> > Acked-by: Rob Clark 
> >> > Tested-by: Rob Clark 
> >> 
> >> Ok I pushed this into a topic/drm-ci branch in drm.git and asked sfr to
> >> include that branch in linux-next.
> >> 
> >> But also I'd like to see a lot more acks here, we should be able to at
> >> least pile up a bunch of (driver) maintainers from drm-misc in support of
> >> this. Also maybe media, at least I've heard noises that they're maybe
> >> interested too? Plus anyone else, the more the better.
> >
> > I'm not really convinced by that approach at all, and most of the issues
> > I see are shown by the follow-up series here:
> 
> I'm not fully convinced either, more like "let's see". In that narrow
> sense, ack. I don't see harm in trying, if you're also open to backing
> off in case it does not pan out.
> 
> > https://lore.kernel.org/dri-devel/20230825122435.316272-1-vignesh.ra...@collabora.com/
> >
> >   * We hardcode a CI farm setup into the kernel
> >
> >   * We cannot trust that the code being run is actually the one being
> > pushed into gitlab
> >
> >   * IMO, and I know we disagree here, any IGT test we enable for a given
> > platform should work, period. Allowing failures and flaky tests just
> > sweeps whatever issue is there under the rug. If the test is at
> > fault, we should fix the test, if the driver / kernel is at fault,
> > then I certainly want to know about it.
> 
> At least for display, where this also depends on peripheral hardware,
> it's not an easy problem, really.

Aside from the Chamelium tests, which tests actually rely on peripheral
hardware? On EDID and hotplug, sure, but that can easily be set up from
the userspace, or something like

https://www.lindy-international.com/HDMI-2-0-EDID-Emulator.htm?websale8=ld0101.ld021102=32115

> How reliable do you need it to be? How many nines? Who is going to
> debug the issues that need hundreds or thousands of runs to reproduce?
> If a commit makes some test less reliable, how long is it going to
> take to even see that or pinpoint that?

I mean, that's also true for failures or success then. How many times do
you need a test to run properly to qualify it as a meaningful test? How
do you know that it's not a flaky test?

Ultimately, it's about trust. If, for a given test that just failed, I
can't be certain that it's because of the branch I just submitted, I
will just ignore the tests results after a while.

This is already what plagues kernelci, and we should do better.

And I'm sorry, but if some part of the kernel or driver just isn't
reliable, then we shouldn't claim it is (except for all the times it
isn't). If no-one has the time to look into it, fine, but flagging it
under a flaky test doesn't help anyone.

Like, from that patch, how can I know what is the issue with
kms_hdmi_inject@inject-4k or kms_addfb_basic@addfb25-bad-modifier on
mt8173. I certainly can't. And neither of those have anything to do with
peripheral hardware.

Maxime


signature.asc
Description: PGP signature


Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-08-30 Thread Jani Nikula
On Wed, 30 Aug 2023, Maxime Ripard  wrote:
> On Tue, Aug 22, 2023 at 04:26:06PM +0200, Daniel Vetter wrote:
>> On Fri, Aug 11, 2023 at 02:19:53PM -0300, Helen Koike wrote:
>> > From: Tomeu Vizoso 
>> > 
>> > Developers can easily execute several tests on different devices
>> > by just pushing their branch to their fork in a repository hosted
>> > on gitlab.freedesktop.org which has an infrastructure to run jobs
>> > in several runners and farms with different devices.
>> > 
>> > There are also other automated tools that uprev dependencies,
>> > monitor the infra, and so on that are already used by the Mesa
>> > project, and we can reuse them too.
>> > 
>> > Also, store expectations about what the DRM drivers are supposed
>> > to pass in the IGT test suite. By storing the test expectations
>> > along with the code, we can make sure both stay in sync with each
>> > other so we can know when a code change breaks those expectations.
>> > 
>> > Also, include a configuration file that points to the out-of-tree
>> > CI scripts.
>> > 
>> > This will allow all contributors to drm to reuse the infrastructure
>> > already in gitlab.freedesktop.org to test the driver on several
>> > generations of the hardware.
>> > 
>> > Signed-off-by: Tomeu Vizoso 
>> > Signed-off-by: Helen Koike 
>> > Acked-by: Daniel Stone 
>> > Acked-by: Rob Clark 
>> > Tested-by: Rob Clark 
>> 
>> Ok I pushed this into a topic/drm-ci branch in drm.git and asked sfr to
>> include that branch in linux-next.
>> 
>> But also I'd like to see a lot more acks here, we should be able to at
>> least pile up a bunch of (driver) maintainers from drm-misc in support of
>> this. Also maybe media, at least I've heard noises that they're maybe
>> interested too? Plus anyone else, the more the better.
>
> I'm not really convinced by that approach at all, and most of the issues
> I see are shown by the follow-up series here:

I'm not fully convinced either, more like "let's see". In that narrow
sense, ack. I don't see harm in trying, if you're also open to backing
off in case it does not pan out.

> https://lore.kernel.org/dri-devel/20230825122435.316272-1-vignesh.ra...@collabora.com/
>
>   * We hardcode a CI farm setup into the kernel
>
>   * We cannot trust that the code being run is actually the one being
> pushed into gitlab
>
>   * IMO, and I know we disagree here, any IGT test we enable for a given
> platform should work, period. Allowing failures and flaky tests just
> sweeps whatever issue is there under the rug. If the test is at
> fault, we should fix the test, if the driver / kernel is at fault,
> then I certainly want to know about it.

At least for display, where this also depends on peripheral hardware,
it's not an easy problem, really. How reliable do you need it to be?
How many nines? Who is going to debug the issues that need hundreds or
thousands of runs to reproduce? If a commit makes some test less
reliable, how long is it going to take to even see that or pinpoint
that?

It's a kind of cop out, but this is not filesystems. In many cases I
think we might be able to make things more robust by failing faster and
failing more, but the users probably want us to plunge forward despite
some errors to try to get something on screen.

(Come to think of it, perhaps we should classify tests based on whether
external hardware plays a role.)

So I'm not so concerned about the filter lists per se, but rather about
having them in kernel.

BR,
Jani.

>
>   * This then leads to patches like this one:
> 
> https://lore.kernel.org/dri-devel/20230825122435.316272-6-vignesh.ra...@collabora.com/
>
> Which (and it's definitely not the author's fault) are just plain
> unreadable, reproducable or auditable by anyone not heavily involved
> in the CI farm operations and the platforms being tested.
>
> That being said, I don't have anything better to suggest than what I
> already did, and it looks like I'm alone in thinking that those are
> problems, so feel free to add my ack if you want to.
>
> Maxime

-- 
Jani Nikula, Intel Open Source Graphics Center


Re: [PATCH v11] drm: Add initial ci/ subdirectory

2023-08-30 Thread Maxime Ripard
On Tue, Aug 22, 2023 at 04:26:06PM +0200, Daniel Vetter wrote:
> On Fri, Aug 11, 2023 at 02:19:53PM -0300, Helen Koike wrote:
> > From: Tomeu Vizoso 
> > 
> > Developers can easily execute several tests on different devices
> > by just pushing their branch to their fork in a repository hosted
> > on gitlab.freedesktop.org which has an infrastructure to run jobs
> > in several runners and farms with different devices.
> > 
> > There are also other automated tools that uprev dependencies,
> > monitor the infra, and so on that are already used by the Mesa
> > project, and we can reuse them too.
> > 
> > Also, store expectations about what the DRM drivers are supposed
> > to pass in the IGT test suite. By storing the test expectations
> > along with the code, we can make sure both stay in sync with each
> > other so we can know when a code change breaks those expectations.
> > 
> > Also, include a configuration file that points to the out-of-tree
> > CI scripts.
> > 
> > This will allow all contributors to drm to reuse the infrastructure
> > already in gitlab.freedesktop.org to test the driver on several
> > generations of the hardware.
> > 
> > Signed-off-by: Tomeu Vizoso 
> > Signed-off-by: Helen Koike 
> > Acked-by: Daniel Stone 
> > Acked-by: Rob Clark 
> > Tested-by: Rob Clark 
> 
> Ok I pushed this into a topic/drm-ci branch in drm.git and asked sfr to
> include that branch in linux-next.
> 
> But also I'd like to see a lot more acks here, we should be able to at
> least pile up a bunch of (driver) maintainers from drm-misc in support of
> this. Also maybe media, at least I've heard noises that they're maybe
> interested too? Plus anyone else, the more the better.

I'm not really convinced by that approach at all, and most of the issues
I see are shown by the follow-up series here:

https://lore.kernel.org/dri-devel/20230825122435.316272-1-vignesh.ra...@collabora.com/

  * We hardcode a CI farm setup into the kernel

  * We cannot trust that the code being run is actually the one being
pushed into gitlab

  * IMO, and I know we disagree here, any IGT test we enable for a given
platform should work, period. Allowing failures and flaky tests just
sweeps whatever issue is there under the rug. If the test is at
fault, we should fix the test, if the driver / kernel is at fault,
then I certainly want to know about it.

  * This then leads to patches like this one:

https://lore.kernel.org/dri-devel/20230825122435.316272-6-vignesh.ra...@collabora.com/

Which (and it's definitely not the author's fault) are just plain
unreadable, reproducable or auditable by anyone not heavily involved
in the CI farm operations and the platforms being tested.

That being said, I don't have anything better to suggest than what I
already did, and it looks like I'm alone in thinking that those are
problems, so feel free to add my ack if you want to.

Maxime


signature.asc
Description: PGP signature