Hi Allen, all.

On Fri, Aug 23, 2019 at 2:25 PM Allen Wittenauer
<[email protected]> wrote:
>
>
> Something is not adding up here… or I’m not understanding the issue...
>
>
> > On Aug 22, 2019, at 6:41 AM, Christofer Dutz <[email protected]>
wrote:
> > we now had one problem several times, that our build is cancelled
because it is impossible to get an “ubuntu” node for deploying artifacts.
> > Right now I can see the Jenkins build log being flooded with Hadoop PR
jobs.
>
>
>         The master build queue will show EVERY job regardless of label
and will schedule the first job available for that label in the queue (see
below).  In fact, the hadoop jobs actually have a dedicated label that most
of the other big jobs (are supposed to) run on:
>
>                 https://builds.apache.org/label/Hadoop/
>
> Compare this to:
>
>                 https://builds.apache.org/label/ubuntu/
>
>         The nodes between these two are supposed to be distinct.  Of
course, there are some odd-ball labels out there that have a weird
cross-section:
>
>         https://builds.apache.org/label/xenial/
>
> Anyway ...
>
> > On Aug 23, 2019, at 5:22 AM, Christofer Dutz <[email protected]>
wrote:
> >
> > the problem is that we’re running our jobs on a dedicated node too …
>
>         Is the job running on a dedicated node or the shared ubuntu
label?
>
> > So our build runs smoothly: Doing Tests, Integration Tests, Sonar
Analysis, Website generation and then waits to get access to a node that
can deploy and here the job just times-out :-/
>
>         The job has multiple steps that runs on multiple nodes? If so,
you’re going to have a bad time if you’ve put a timeout for the entire
job.  That’s just not realistic.  If it actually needs to run on multiple
nodes, why not just trigger a new job via a pipeline API call (buildJob)
that can sit in the queue and take the artifacts from the previously
successful run as input?  Then it won’t time out.
>
> > Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz <
[email protected]>:
> > Would it be possible to enforce some sort of fair-use policy that one
project doesn’t block all the others?
>
>
> Side note: Jenkins default queuing system is fairly primitive:  pretty
much a node-based FIFO queue w/a smattering of node affinity. Node has a
free slot, check first job in the queue. Does it have a label or node
property that matches?  Run it. If not, go to the next job in the queue.
It doesn’t really have any sort of real capacity tracking to prevent
starvation.


The issue is, and I have seen this multiple times over the last few weeks,
is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
flaky tests and similar are running on multiple nodes at the same time. It
seems that one PR or 1 commit is triggering a job or jobs that split into
part jobs that run on multiple nodes. Just yesterday I saw Hadoop and HBase
taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
Some of these jobs that take many hours are triggered on a PR or a commit
that could be something as trivial as a typo. This is unacceptable. HBase
in particular is a Hadoop related project and should be limiting its jobs
to Hadoop labelled nodes H0-H21, but they are running on any and all nodes.

It is all too familiar to see one job running on a dozen or more executors,
the build queue is now constantly in the hundreds, despite the fact we have
nearly 100 nodes. This must stop.

Meanwhile, Chris informs me his single job to deploy to Nexus has been
waiting in 3 days.

Infra is looking into ways to make our Jenkins fairer to everybody, and
would like the community to be involved in this process.

Thanks

Gav... (ASF Infra)



--
Gav...

Reply via email to