Hi Allen, all. On Fri, Aug 23, 2019 at 2:25 PM Allen Wittenauer <[email protected]> wrote: > > > Something is not adding up here… or I’m not understanding the issue... > > > > On Aug 22, 2019, at 6:41 AM, Christofer Dutz <[email protected]> wrote: > > we now had one problem several times, that our build is cancelled because it is impossible to get an “ubuntu” node for deploying artifacts. > > Right now I can see the Jenkins build log being flooded with Hadoop PR jobs. > > > The master build queue will show EVERY job regardless of label and will schedule the first job available for that label in the queue (see below). In fact, the hadoop jobs actually have a dedicated label that most of the other big jobs (are supposed to) run on: > > https://builds.apache.org/label/Hadoop/ > > Compare this to: > > https://builds.apache.org/label/ubuntu/ > > The nodes between these two are supposed to be distinct. Of course, there are some odd-ball labels out there that have a weird cross-section: > > https://builds.apache.org/label/xenial/ > > Anyway ... > > > On Aug 23, 2019, at 5:22 AM, Christofer Dutz <[email protected]> wrote: > > > > the problem is that we’re running our jobs on a dedicated node too … > > Is the job running on a dedicated node or the shared ubuntu label? > > > So our build runs smoothly: Doing Tests, Integration Tests, Sonar Analysis, Website generation and then waits to get access to a node that can deploy and here the job just times-out :-/ > > The job has multiple steps that runs on multiple nodes? If so, you’re going to have a bad time if you’ve put a timeout for the entire job. That’s just not realistic. If it actually needs to run on multiple nodes, why not just trigger a new job via a pipeline API call (buildJob) that can sit in the queue and take the artifacts from the previously successful run as input? Then it won’t time out. > > > Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz < [email protected]>: > > Would it be possible to enforce some sort of fair-use policy that one project doesn’t block all the others? > > > Side note: Jenkins default queuing system is fairly primitive: pretty much a node-based FIFO queue w/a smattering of node affinity. Node has a free slot, check first job in the queue. Does it have a label or node property that matches? Run it. If not, go to the next job in the queue. It doesn’t really have any sort of real capacity tracking to prevent starvation.
The issue is, and I have seen this multiple times over the last few weeks, is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase flaky tests and similar are running on multiple nodes at the same time. It seems that one PR or 1 commit is triggering a job or jobs that split into part jobs that run on multiple nodes. Just yesterday I saw Hadoop and HBase taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours. Some of these jobs that take many hours are triggered on a PR or a commit that could be something as trivial as a typo. This is unacceptable. HBase in particular is a Hadoop related project and should be limiting its jobs to Hadoop labelled nodes H0-H21, but they are running on any and all nodes. It is all too familiar to see one job running on a dozen or more executors, the build queue is now constantly in the hundreds, despite the fact we have nearly 100 nodes. This must stop. Meanwhile, Chris informs me his single job to deploy to Nexus has been waiting in 3 days. Infra is looking into ways to make our Jenkins fairer to everybody, and would like the community to be involved in this process. Thanks Gav... (ASF Infra) -- Gav...
