Bug#1030096: Any ideas Re: #1030096 dask.distributed intermittent autopkgtest fail ?

2023-02-07 Thread Diane Trout
On Tue, 2023-02-07 at 07:31 +, Rebecca N. Palmer wrote:
> On 07/02/2023 03:20, Diane Trout wrote:
> > What's your test environment like?
> 
> Salsa CI.
> 
> > I don't think head is hugely different from what was released in -
> > 1.
> > 
> > The diff looks like Andreas adjusted the dask dependency version,
> > configured a salsa CI run, and added some upstream metadata files
> 
> That sounds like you're not looking at my branch at all.  As
> previously 
> stated, that's
> https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/tree/fix1030096
> (It's in a fork because I can't push to debian-python.)
> 
> See earlier in this bug for which test failures still remain.

I merged in most of Rebecca's changes via git cherry pick, though I
slightly edited the changelog. (making most entries a bullet point
instead of subheadings of the one line I left out).

I think I got the code to detect if IPv6 is available to work correctly
so I could set the DISABLE_IPV6 environment variable that
dask.distrubted supports.

I went with skipping the 32bit tests instead of xfailed because I don't
think they can work as written since I really think they're making
really large memory requests that can't ever succeed on 32bit.

You did a lot of work on trying to get the flaky tests to work more
reliably, and all that's included. Well except for the apply a patch
and then revert it.

All the merges are pushed to salsa debian/master. They also passed on
my local build host running i386.

Diane



Bug#1030096: Any ideas Re: #1030096 dask.distributed intermittent autopkgtest fail ?

2023-02-06 Thread Rebecca N. Palmer

On 07/02/2023 03:20, Diane Trout wrote:

What's your test environment like?


Salsa CI.


I don't think head is hugely different from what was released in -1.

The diff looks like Andreas adjusted the dask dependency version,
configured a salsa CI run, and added some upstream metadata files


That sounds like you're not looking at my branch at all.  As previously 
stated, that's

https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/tree/fix1030096
(It's in a fork because I can't push to debian-python.)

See earlier in this bug for which test failures still remain.



Bug#1030096: Any ideas Re: #1030096 dask.distributed intermittent autopkgtest fail ?

2023-02-06 Thread Diane Trout
On Mon, 2023-02-06 at 21:39 +, Rebecca N. Palmer wrote:
> I agree that xfailing the tests *may* be a reasonable solution.  I'm 
> only saying that it should be done by someone with more idea than me
> of 
> whether these particular tests are important, because blindly
> xfailing 
> everything that fails is effectively not having tests.
> 
> If we do choose that approach, at least test_balance_expensive_tasks 
> needs to be an outright xfail/skip not just a flaky, because when it 
> fails it fails repeatedly.

So my efforts at debugging are made harder by it working for me. I'm
using

a9771f68a28dfc65cae3ac6acf70451c264f3227

from Debian HEAD.

= 2745 passed, 93 skipped, 216 deselected, 18 xfailed, 8 xpassed in
1992.20s (0:33:12) =  

I looked at the last log on ci.debian.org for dask.distributed
https://ci.debian.net/data/autopkgtest/unstable/amd64/d/dask.distributed/31090863/log.gz

And it looks like several of those errors are networking related.

CI with the previously released 2022.12.1+ds.1-1 version is failing
with these tests:

test_defaults 
test_hostport 
test_file 
test_default_client_server_ipv6[tornado] 
test_default_client_server_ipv6[asyncio] 
test_tcp_client_server_ipv6[tornado] 
test_tcp_client_server_ipv6[asyncio] 
test_only_local_access 
test_remote_access 
test_adapt_then_manual 
test_local_tls[True] 
test_local_tls[False] 
test_run_spec 
test_balance_expensive_tasks[enough work to steal] 

I think several of those may depend on a proper network. The host I'm
using actually has both ipv4 and ipv6 working. I'm using sbuild
automatically running autopkgtests on a oldish 2x4 8 core xeon server
with ~24 GB of ram 

What's your test environment like?

I don't think head is hugely different from what was released in -1.

The diff looks like Andreas adjusted the dask dependency version,
configured a salsa CI run, and added some upstream metadata files

He had problems with a salsa build failure but that was with i386, I'm
currently setting up i386 to see if I can replicate the salsa failure.

Diane



Bug#1030096: Any ideas Re: #1030096 dask.distributed intermittent autopkgtest fail ?

2023-02-06 Thread Rebecca N. Palmer
I agree that xfailing the tests *may* be a reasonable solution.  I'm 
only saying that it should be done by someone with more idea than me of 
whether these particular tests are important, because blindly xfailing 
everything that fails is effectively not having tests.


If we do choose that approach, at least test_balance_expensive_tasks 
needs to be an outright xfail/skip not just a flaky, because when it 
fails it fails repeatedly.


On 06/02/2023 19:38, Diane Trout wrote:

The most important thing about dask / dask.distributed is they really
should be at about the same upstream version.


I knew that, and was planning on 2022.12.1 of both when I decided to go 
ahead with pandas.  What went wrong was that I only tested a build, not 
an autopkgtest, and thought the failing tests were dask.distributed's 
(known) inability to run all its tests in a buildd environment.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027254#21


I'm not 100% sure how to
mark that in the d/control file.


Possibly
Depends: python3-dask (>= 2022.12.1~), python3-dask (<< 2022.12.2~)
but I haven't tested that.



Bug#1030096: Any ideas Re: #1030096 dask.distributed intermittent autopkgtest fail ?

2023-02-06 Thread Diane Trout
On Mon, 2023-02-06 at 11:13 +0100, Andreas Tille wrote:
> Hi Rebecca,
> 
> Am Mon, Feb 06, 2023 at 07:59:17AM + schrieb Rebecca N. Palmer:
> > (Background: the pandas + dask transition broke dask.distributed
> > and it was
> > hence removed from testing; I didn't notice at the time that if we
> > don't get
> > it back in we lose Spyder.)
> 
> as far as I know Diane has put quite some effort into dask and I
> understood that dask and dask.distributed are closely interconnected.
>  

Hi now.

My fragments of time were spent fighting with numba, and I didn't have
the energy to be thinking about dask.distributed.

Numba should be in a better place right now. So I can set my build
machine to trying to build it and seeing where we are with it right
now.

The most important thing about dask / dask.distributed is they really
should be at about the same upstream version. I'm not 100% sure how to
mark that in the d/control file. Also upstream might have some ability
to do minor releases independently.

But if we do a new upstream release of dask, it needs to be paired with
a new upstream release of dask.distributed. And in my experience
dask.distributed is the one that's harder to get to work right.

Diane



Bug#1030096: Any ideas Re: #1030096 dask.distributed intermittent autopkgtest fail ?

2023-02-06 Thread Andreas Tille
Hi Rebecca,

Am Mon, Feb 06, 2023 at 07:59:17AM + schrieb Rebecca N. Palmer:
> (Background: the pandas + dask transition broke dask.distributed and it was
> hence removed from testing; I didn't notice at the time that if we don't get
> it back in we lose Spyder.)

as far as I know Diane has put quite some effort into dask and I
understood that dask and dask.distributed are closely interconnected.
 
> And now test_failing_task_increments_suspicious (once):
> https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/jobs/3903956
> (We don't have to pass build-i386 (as this is an arch:all package) or
> reprotest, but if these are effectively-random failures, they might also be
> able to occur in build or autopkgtest.)
> 
> I'm probably the wrong person to be working on this - I don't know enough
> about this package to say whether ignoring this kind of intermittent failure
> (as my 'flaky' marks do) is appropriate, or to have much idea how to
> actually fix it.

In several cases we decided to ignore some tests.  While I like the idea
to mark a test flaky instead ignoring it completely given your
experience I think ignoring these tests is a valid way to proceed with
this package for the moment.
 
> We could also try upgrading dask + dask.distributed to 2023.1, but that's a
> risky move at this point.

I agree that it is risky.  We might discuss this with upstream and
possibly use an experimental branch to verify how it works.  It might be
that later versions work better with later Pandas / Python3.11.
However, the window of opportunity to get something in before the freeze
is closing and I'm afraid we do not have time for experiments.

Kind regards
   Andreas.

-- 
http://fam-tille.de



Bug#1030096: Any ideas Re: #1030096 dask.distributed intermittent autopkgtest fail ?

2023-02-06 Thread Rebecca N. Palmer
(Background: the pandas + dask transition broke dask.distributed and it 
was hence removed from testing; I didn't notice at the time that if we 
don't get it back in we lose Spyder.)


On 05/02/2023 21:44, Rebecca N. Palmer wrote:
I currently have this in a state where it sometimes succeeds and 
sometimes doesn't:

https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/tree/fix1030096

Tests I've seen to fail multiple times (and don't have a fix for):
test_balance_expensive_tasks[enough work to steal]
https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/jobs/3902376
(Seems to be the most common problem.  Using @pytest.mark.flaky to try 3 
times doesn't seem to have helped, suggesting that if it fails once it 
keeps failing in that run.  Applying part of upstream pull 7253 seemed 
to make things worse, but I haven't tried applying the whole thing.)

test_popen_timeout
https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/jobs/3902745

Tests I've seen to fail once:
test_stress_scatter_death
https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/jobs/3902040
test_tcp_many_listeners[asyncio]
https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/jobs/3896327


And now test_failing_task_increments_suspicious (once):
https://salsa.debian.org/rnpalmer-guest/dask.distributed/-/jobs/3903956
(We don't have to pass build-i386 (as this is an arch:all package) or 
reprotest, but if these are effectively-random failures, they might also 
be able to occur in build or autopkgtest.)


I'm probably the wrong person to be working on this - I don't know 
enough about this package to say whether ignoring this kind of 
intermittent failure (as my 'flaky' marks do) is appropriate, or to have 
much idea how to actually fix it.


We could also try upgrading dask + dask.distributed to 2023.1, but 
that's a risky move at this point.