On Mon, Dec 9, 2019 at 5:36 PM Udi Meiri <eh...@google.com> wrote:

> I have given this some thought honestly don't know if splitting into
> separate jobs will help.
> - I have seen race conditions with running setuptools in parallel, so more
> isolation is better.
>

What race conditions have you seen?  I think if we're doing things right,
this should not be happening, but I don't think we're doing things right.
One thing that I've noticed is that we're building into the source
directory, but I also think we're also doing weird things like trying to
copy the source directory beforehand.  I really think this system is
tripping over many non-standard choices that have been made along the way.
I have never these sorts of problems with in unittests that use tox, even
when many are running in parallel.  I got pulled away from it, but I'm
really hoping to address these issues here:
https://github.com/apache/beam/pull/10038.

>
> What benefits do you see from splitting up the jobs?
>

The biggest problem is that the jobs are doing too much and take too long.
This simple fact compounds all of the other problems.  It seems pretty
obvious that we need to do less in each job, as long as the sum of all of
these smaller jobs is not substantially longer than the one monolithic job.

Benefits:

- failures specific to a particular python version will be easier to spot
in the jenkins error summary, and cheaper to re-queue.  right now the
jenkins report mushes all of the failures together in a way that makes it
nearly impossible to tell which python version they correspond to.  only
the gradle scan gives you this insight, but it doesn't break the errors by
test.
- failures common to all python versions will be reported to the user
earlier, at which point they can cancel the other jobs if desired.  *this
is by far the biggest benefit. * why wait for 2 hours to see the same
failure reported for 5 versions of python?  if that had run on one version
of python I could maybe see that error in 30 minutes (while potentially
other python versions waited in the queue).  Repeat for each change pushed.
- flaky jobs will be cheaper to requeue (since it will affect a
smaller/shorter job)
- if xdist is giving us the parallel boost we're hoping for we should get
under the 2 hour mark every time

Basically we're talking about getting feedback to users faster.

I really don't mind pasting a few more phrases if it means faster feedback.

-chad




>
> On Mon, Dec 9, 2019 at 4:17 PM Chad Dombrova <chad...@gmail.com> wrote:
>
>> After this PR goes in should we revisit breaking up the python tests into
>> separate jenkins jobs by python version?  One of the problems with that
>> plan originally was that we lost the parallelism that gradle provides
>> because we were left with only one tox task per jenkins job, and so the
>> total time to complete all python jenkins jobs went up a lot.  With
>> pytest + xdist we should hopefully be able to keep the parallelism even
>> with just one tox task.  This could be a big win.  I feel like I'm spending
>> more time monitoring and re-queuing timed-out jenkins jobs lately than I am
>> writing code.
>>
>> On Mon, Dec 9, 2019 at 10:32 AM Udi Meiri <eh...@google.com> wrote:
>>
>>> This PR <https://github.com/apache/beam/pull/10322> (in review)
>>> migrates py27-gcp to using pytest.
>>> It reduces the testPy2Gcp task down to ~13m
>>> <https://scans.gradle.com/s/kj7ogemnd3toe/timeline?details=ancsbov425524>
>>> (from ~45m). This speedup will probably be lower once all 8 tasks are using
>>> pytest.
>>> It also adds 5 previously uncollected tests.
>>>
>>

Reply via email to