> Jenkins is just the 24/7 burn/smoker. Ah, ok - this explains things perfectly! That was the piece I was missing - if we're intentionally filling our project-specific Jenkins agents, we'd make very bad candidates for the general pool.
Thanks Uwe On Tue, Oct 15, 2024 at 8:52 AM Uwe Schindler <u...@thetaphi.de> wrote: > > Hi, > > Am 15.10.2024 um 13:40 schrieb Jason Gerlowski: > > Hi all, > > Appreciate the context Uwe! In poking around a bit I see there's a > cleanup option in the job config called: "Clean up this jobs > workspaces from other slave nodes". Seems like that might help a bit, > though I'd have to do a little more digging to be sure. Doesn't > appear to be enabled on either the Lucene or Solr jobs, fwiw. > > The problem with the build option is that it cleans the workspace after every > build so the startup time is high when a job changes nodes. So it is better > to keep the workspace intact. We use the option at moment to clean the > workspace's changes (git reset). > > Does anyone have a pointer to the discussion around getting these > project-specific build nodes? I searched around in JIRA and on > lists.apache.org, but couldn't find a request for the nodes or a > discussion about creating them. Would love to understand the > rationale a bit better: in talking to INFRA folks last week, they > suggested the main (only?) reason folks use project-specific VMs is to > avoid waiting in the general pool...but our average wait time these > days at least looks much much longer than anything in the general > pool. > > The reason for this was the following: > > 1. We wanted to have more details on the IO system and number of threads and > not hardcode them in the job config. This would only work if all nodes have > same number of threads and so on. The problem is that the defaults are > currently for developer boxes only, so it does not spin too many parallel > threads while executing gradle builds. If we would add a gradle option to > enable full parallelizm to when the CI env var is set (test forks/processes > tests.jvms=XXX and gradle tasks), this would be obsolete. Then the jobs would > start with Gradle's CI option and then gradle config should then use full > parallelism when running tasks and tests. > > At moment we have a gradle.properties file per node (also at Policeman > Jenkins), which is loaded from Jenkins user directory automatically. To > maintain those, we need shell access. > > Let's open a PR to fix our Gradle build to autodetect CI builds and let's > tune the number of threads logic in the autogenerator to use the system info > with other factors if CI environment variable is used: > > https://github.com/apache/lucene/blob/3d6af9cecce3b6ce5c017ef6f919a2d727e0ea77/build-tools/build-infra/src/main/java/org/apache/lucene/gradle/GradlePropertiesGenerator.java#L56-L62 > > (this would allow us to also spare to copy files on ASF Jenkins, the setup is > rather complex). > > BUT: 2. The main reason why we use a separate pool is they way how our tests > work: Due to randomized tests, we run the jobs not on commit only, we run > them 24/7. So the scripts triggers are written in a way to keep the work > queue for Lucene nodes always filled. We don't start jobs on commit, we queue > a new one every hour or more often (for Lucene). Therefore the longer wait > time is "wanted". We just ensure by it that the queue is always *full*. > Jenkins triggers a new job only if no other job is triggered. If you enqueue > a new job every 5 minutes, there will always be one job running and another > one waiting for execution. > > If we would flood the general queue with Jobs, other ASF projects would be > very unhappy. So I'd suggest to run the "burn and smoke the repo" tests on > our own nodes 24/7 with full job queues. Maybe only put those jobs which are > not testing jobs (like publishing artifacts) to the common queue. > > From the "outside" looking in, it feels like we're taking on more > maintenance/infra burden for worse results (at least as defined by > 'uptime' and build-wait times). > > See above, results are not worse, you're looking at it from wrong > perspective! The job queue and waiting time is full because we want to have > the nodes occupied with jobs all the time. For normal ASF jobs people want > low waiting times, because the Jenkins should run after commits and people > breaking stuff should be informed. We use Github for this, Jenkins is just > the 24/7 burn/smoker. > > Uwe > > Best, > > Jason > > On Tue, Oct 15, 2024 at 6:12 AM Uwe Schindler <u...@thetaphi.de> wrote: > > Hi, > > I have root access on both machines. I was not aware of problems. The > workspace name problem is a known one: If a node is down while the job > is renamed or there are multiple nodes having the workspace it can't > delete. In the case of multiple nodes, it only deletes the workspace on > the node which had the job running recently. > > As general rule that I always use: Before renaming a job, go to the Job > and prune the workspace from the web interface. But this has the same > problem like described before: It only shows the workspace of the last > node which executed the job recently. > > Uwe > > Am 14.10.2024 um 22:01 schrieb Jason Gerlowski: > > Of course, happy to help - glad you got some 'green' builds. > > Both agents should be back online now. > > The root of the problem appears to be that Jenkins jobs use a static > workspace whose path is based on the name of the job. This would work > great if job names never changed I guess. But our job names *do* > drift - both Lucene and Solr tend to include version strings (e.g. > Solr-check-9.6, Lucene-check-9.12), which introduces some "drift" and > orphans a few workspaces a year. That doesn't sound like much, but > each workspace contains a full Solr or Lucene checkout+build, so they > add up pretty quickly. Anyway, that root problem remains and will > need to be addressed if our projects want to continue the specially > tagged agents. But things are healthy for now! > > Best, > > Jason > > On Tue, Oct 8, 2024 at 3:10 AM Luca Cavanna <java...@apache.org> wrote: > > Thanks a lot Jason, > this helps a lot. I see that the newly added jobs for 10x and 10.0 have been > built and it all looks pretty green now. > > Thanks > Luca > > On Mon, Oct 7, 2024 at 11:27 PM Jason Gerlowski <gerlowsk...@gmail.com> wrote: > > Hi Luca, > > I suspect I'm chiming in here a little late to help with your > release-related question, but... > > I stopped into the "#askinfra Office Hours" this afternoon at > ApacheCon, and asked for some help on this. Both workers seemed to > have disk-space issues, seemingly due to orphaned workspaces. I've > gotten one agent/worker back online (lucene-solr-2 I believe). The > other one I'm hoping to get back online shortly, after a bit more > cleanup. > > (Getting the right permissions to clean things up was a bit of a > process; I'm hoping to document this and will share here when that's > ready.) > > There are still nightly jobs that run on the ASF Jenkins (for both > Lucene and Solr); on the Solr side at least these are quite useful. > > Best, > > Jason > > > On Wed, Oct 2, 2024 at 2:40 PM Luca Cavanna <java...@apache.org> wrote: > > Hi all, > I created new CI jobs at https://ci-builds.apache.org/job/Lucene/ yesterday > to cover branch_10x and branch_10_0 . Not a single build for them started so > far. > > Poking around I noticed on the build history a message "Pending - all nodes > of label Lucene are offline", which looked suspicious. Are we still using > this jenkins? I have successfully used it for the release I have done in the > past, but it was already some months ago. The step of creating jobs is still > part of the release wizard process anyways, so it felt right to do this step. > I am not sure how to proceed from here, does anyone know? I also noticed a > too low disk space warning on one of the two agents. > > Thanks > Luca > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org