Re: Jenkins issues

Jason Gerlowski Tue, 15 Oct 2024 06:02:32 -0700

> Jenkins is just the 24/7 burn/smoker.

Ah, ok - this explains things perfectly!  That was the piece I was
missing - if we're intentionally filling our project-specific Jenkins
agents, we'd make very bad candidates for the general pool.


Thanks Uwe

On Tue, Oct 15, 2024 at 8:52 AM Uwe Schindler <u...@thetaphi.de> wrote:
>
> Hi,
>
> Am 15.10.2024 um 13:40 schrieb Jason Gerlowski:
>
> Hi all,
>
> Appreciate the context Uwe!  In poking around a bit I see there's a
> cleanup option in the job config called: "Clean up this jobs
> workspaces from other slave nodes".  Seems like that might help a bit,
> though I'd have to do a little more digging to be sure.  Doesn't
> appear to be enabled on either the Lucene or Solr jobs, fwiw.
>
> The problem with the build option is that it cleans the workspace after every 
> build so the startup time is high when a job changes nodes. So it is better 
> to keep the workspace intact. We use the option at moment to clean the 
> workspace's changes (git reset).
>
> Does anyone have a pointer to the discussion around getting these
> project-specific build nodes?  I searched around in JIRA and on
> lists.apache.org, but couldn't find a request for the nodes or a
> discussion about creating them.  Would love to understand the
> rationale a bit better: in talking to INFRA folks last week, they
> suggested the main (only?) reason folks use project-specific VMs is to
> avoid waiting in the general pool...but our average wait time these
> days at least looks much much longer than anything in the general
> pool.
>
> The reason for this was the following:
>
> 1. We wanted to have more details on the IO system and number of threads and 
> not hardcode them in the job config. This would only work if all nodes have 
> same number of threads and so on. The problem is that the defaults are 
> currently for developer boxes only, so it does not spin too many parallel 
> threads while executing gradle builds. If we would add a gradle option to 
> enable full parallelizm to when the CI env var is set (test forks/processes 
> tests.jvms=XXX and gradle tasks), this would be obsolete. Then the jobs would 
> start with Gradle's CI option and then gradle config should then use full 
> parallelism when running tasks and tests.
>
> At moment we have a gradle.properties file per node (also at Policeman 
> Jenkins), which is loaded from Jenkins user directory automatically. To 
> maintain those, we need shell access.
>
> Let's open a PR to fix our Gradle build to autodetect CI builds and let's 
> tune the number of threads logic in the autogenerator to use the system info 
> with other factors if CI environment variable is used:
>
> https://github.com/apache/lucene/blob/3d6af9cecce3b6ce5c017ef6f919a2d727e0ea77/build-tools/build-infra/src/main/java/org/apache/lucene/gradle/GradlePropertiesGenerator.java#L56-L62
>
> (this would allow us to also spare to copy files on ASF Jenkins, the setup is 
> rather complex).
>
> BUT: 2. The main reason why we use a separate pool is they way how our tests 
> work: Due to randomized tests, we run the jobs not on commit only, we run 
> them 24/7. So the scripts triggers are written in a way to keep the work 
> queue for Lucene nodes always filled. We don't start jobs on commit, we queue 
> a new one every hour or more often (for Lucene). Therefore the longer wait 
> time is "wanted". We just ensure by it that the queue is always *full*. 
> Jenkins triggers a new job only if no other job is triggered. If you enqueue 
> a new job every 5 minutes, there will always be one job running and another 
> one waiting for execution.
>
> If we would flood the general queue with Jobs, other ASF projects would be 
> very unhappy. So I'd suggest to run the "burn and smoke the repo" tests on 
> our own nodes 24/7 with full job queues. Maybe only put those jobs which are 
> not testing jobs (like publishing artifacts) to the common queue.
>
> From the "outside" looking in, it feels like we're taking on more
> maintenance/infra burden for worse results (at least as defined by
> 'uptime' and build-wait times).
>
> See above, results are not worse, you're looking at it from wrong 
> perspective! The job queue and waiting time is full because we want to have 
> the nodes occupied with jobs all the time. For normal ASF jobs people want 
> low waiting times, because the Jenkins should run after commits and people 
> breaking stuff should be informed. We use Github for this, Jenkins is just 
> the 24/7 burn/smoker.
>
> Uwe
>
> Best,
>
> Jason
>
> On Tue, Oct 15, 2024 at 6:12 AM Uwe Schindler <u...@thetaphi.de> wrote:
>
> Hi,
>
> I have root access on both machines. I was not aware of problems. The
> workspace name problem is a known one: If a node is down while the job
> is renamed or there are multiple nodes having the workspace it can't
> delete. In the case of multiple nodes, it only deletes the workspace on
> the node which had the job running recently.
>
> As general rule that I always use: Before renaming a job, go to the Job
> and prune the workspace from the web interface. But this has the same
> problem like described before: It only shows the workspace of the last
> node which executed the job recently.
>
> Uwe
>
> Am 14.10.2024 um 22:01 schrieb Jason Gerlowski:
>
> Of course, happy to help - glad you got some 'green' builds.
>
> Both agents should be back online now.
>
> The root of the problem appears to be that Jenkins jobs use a static
> workspace whose path is based on the name of the job.  This would work
> great if job names never changed I guess.  But our job names *do*
> drift - both Lucene and Solr tend to include version strings (e.g.
> Solr-check-9.6, Lucene-check-9.12), which introduces some "drift" and
> orphans a few workspaces a year.  That doesn't sound like much, but
> each workspace contains a full Solr or Lucene checkout+build, so they
> add up pretty quickly.  Anyway, that root problem remains and will
> need to be addressed if our projects want to continue the specially
> tagged agents.  But things are healthy for now!
>
> Best,
>
> Jason
>
> On Tue, Oct 8, 2024 at 3:10 AM Luca Cavanna <java...@apache.org> wrote:
>
> Thanks a lot Jason,
> this helps a lot. I see that the newly added jobs for 10x and 10.0 have been 
> built and it all looks pretty green now.
>
> Thanks
> Luca
>
> On Mon, Oct 7, 2024 at 11:27 PM Jason Gerlowski <gerlowsk...@gmail.com> wrote:
>
> Hi Luca,
>
> I suspect I'm chiming in here a little late to help with your
> release-related question, but...
>
> I stopped into the "#askinfra Office Hours" this afternoon at
> ApacheCon, and asked for some help on this.  Both workers seemed to
> have disk-space issues, seemingly due to orphaned workspaces.  I've
> gotten one agent/worker back online (lucene-solr-2 I believe).  The
> other one I'm hoping to get back online shortly, after a bit more
> cleanup.
>
> (Getting the right permissions to clean things up was a bit of a
> process; I'm hoping to document this and will share here when that's
> ready.)
>
> There are still nightly jobs that run on the ASF Jenkins (for both
> Lucene and Solr); on the Solr side at least these are quite useful.
>
> Best,
>
> Jason
>
>
> On Wed, Oct 2, 2024 at 2:40 PM Luca Cavanna <java...@apache.org> wrote:
>
> Hi all,
> I created new CI jobs at https://ci-builds.apache.org/job/Lucene/ yesterday 
> to cover branch_10x and branch_10_0 . Not a single build for them started so 
> far.
>
> Poking around I noticed on the build history a message "Pending - all nodes 
> of label Lucene are offline", which looked suspicious. Are we still using 
> this jenkins? I have successfully used it for the release I have done in the 
> past, but it was already some months ago. The step of creating jobs is still 
> part of the release wizard process anyways, so it felt right to do this step. 
> I am not sure how to proceed from here, does anyone know? I also noticed a 
> too low disk space warning on one of the two agents.
>
> Thanks
> Luca
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Jenkins issues

Reply via email to