+1 to only having one way to do things. The Lite option seems liable to cause more problems since it means it's changes can be blown away if a new image isn't prepared anyway. I don't think we are changing the images often enough for it. Perhaps call it the option to test changes if anything?
On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <[email protected]> wrote: > All workers were updated to use jenkins-slave-boot-image-20211011, which > should have had a go command, but it appears slightly misconfigured. I > reopened BEAM-13037 [1] and added some details there. > > I also added instructions to wiki [2] on how to perform an image swap and > it is actually very straightforward. I think a lesson here is that making > 'lite' upgrades is brittle as misconfigurations could resurface down the > road when the context of the lite upgrade is no longer fresh in our memory. > > I suggest we revise the instructions to keep only image swap commands and > remove the 'lite' update option. +Daniel Oliveira <[email protected]>, > WDYT? In the meantime, we should also prepare an image that fixes the > misconfiguration. Would you be able to help with that? Thank you. > > [1] https://issues.apache.org/jira/browse/BEAM-13037 > [2] > https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers > > > On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <[email protected]> wrote: > >> FYI it looks like all the Go tests are now failing because it can't find >> the Go command at all. >> Did a Jenkins image without Go (v1.16+) pre-installed get pushed? >> >> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <[email protected]> >> wrote: >> >>> Thanks Daniel, >>> >>> I can recreate the VMs on new disks. >>> >>> We currently have a set of stopped jenkins workers (named: >>> apache-beam-jenkins-##) and running workers (named: >>> apache-ci-beam-jenkins-##) >>> >>> Are there any concerns about deleting the stopped group of workers? >>> >>> >>> >>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <[email protected]> wrote: >>> >>>> Thank you Daniel, Valentyn! >>>> >>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <[email protected]> >>>> wrote: >>>> >>>>> I performed a light update of both Go and Python (from Valentyn's >>>>> update) on each worker VM over the weekend. I also added additional >>>>> instructions for the light update to Confluence (as an alternative to the >>>>> current instructions). >>>>> >>>>> There is still reason to perform a full update at some point: Valentyn >>>>> updated the VM image from 500 GB to 1000 GB of storage, which requires a >>>>> full update to actually take effect. >>>>> >>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev < >>>>> [email protected]> wrote: >>>>> >>>>>> > 3. SSH into the agent and perform the update. >>>>>> So, this would be a 'lite' version of the update, where we make >>>>>> changes to the live worker without recreating worker VM with a new image? >>>>>> We could perhaps document both options, and also make it clear that >>>>>> producing a VM image that has necessary updates is mandatory even if we >>>>>> perform 'lite' updates without recreating the worker. >>>>>> Also, for a lite update, marking the Jenkins offer offline may be >>>>>> optional, as some updates might not be disruptive (such as installing >>>>>> some >>>>>> software that will not be used immediately). >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> SGTM. Thank you very much Daniel! >>>>>>> >>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <[email protected]> wrote: >>>>>>> >>>>>>>> Thank you Daniel. Could you please update the wiki once you are >>>>>>>> done with the process? >>>>>>>> >>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an >>>>>>>>> approach for updating Go and did so and will be updating the image >>>>>>>>> momentarily. >>>>>>>>> >>>>>>>>> I think a more important note is that I tried what Valentyn was >>>>>>>>> considering, which is SSHing into workers and updating the >>>>>>>>> dependency. I'll >>>>>>>>> describe the process below, but the summary is that I did it on one >>>>>>>>> worker >>>>>>>>> with Go so far, saw no problems over the weekend, and would like to >>>>>>>>> continue updating the rest of the workers if there are no objections. >>>>>>>>> >>>>>>>>> Here's a step-by-step of what I did. If we decide to stick with >>>>>>>>> this approach, these instructions can be added to Confluence: >>>>>>>>> >>>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1] and >>>>>>>>> click "Mark this node temporarily offline", leaving a reason such as >>>>>>>>> "Updating X dependency." >>>>>>>>> 2. Wait until there are no more tests running in that agent (under >>>>>>>>> "Build Executor Status" on the left of the page). >>>>>>>>> 3. SSH into the agent and perform the update. >>>>>>>>> 4. Mark the node as online again. >>>>>>>>> 5. Repeat for every worker. >>>>>>>>> >>>>>>>>> And these are some additional steps if you want to immediately run >>>>>>>>> a test suite to check that the update worked correctly. For example >>>>>>>>> in my >>>>>>>>> case, I wanted to check against the Go Postcommit, and it was a good >>>>>>>>> thing >>>>>>>>> I did, because it actually failed the first time and I had to go back >>>>>>>>> in to >>>>>>>>> fix a small oversight I made. So doing this after you update your >>>>>>>>> first >>>>>>>>> worker is probably a good idea before updating the rest: >>>>>>>>> >>>>>>>>> 1. Go to the page for the job you want to run (for example: [2]). >>>>>>>>> 2. Click "Configure" on the left menu. >>>>>>>>> 3. Find the checkmark "Restrict where this project can be run" and >>>>>>>>> change the restriction from "beam" to the specific name of the agent >>>>>>>>> (ex. >>>>>>>>> "apache-beam-jenkins-1"). >>>>>>>>> 4. Save and apply that change. >>>>>>>>> 5. Back on the page for the job, click "Build with Parameters" on >>>>>>>>> the left menu. >>>>>>>>> 6. Run the build on "master". >>>>>>>>> 7. Once you're done checking the results, change the restriction >>>>>>>>> for the job back to "beam". (This also gets reset once every 24 hours >>>>>>>>> in >>>>>>>>> case you forget.) >>>>>>>>> >>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening >>>>>>>>> when it wasn't too busy, and got Go updated and working. I checked >>>>>>>>> that >>>>>>>>> agent's execution history again today just in case, and it was >>>>>>>>> healthy over >>>>>>>>> the weekend, with no Go-related problems as far as I could see. If >>>>>>>>> there's >>>>>>>>> no objections I'd like to go ahead and continue updating the rest of >>>>>>>>> the >>>>>>>>> workers (I'll do this late at night or over the weekend to avoid >>>>>>>>> disrupting >>>>>>>>> dev work). >>>>>>>>> >>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/ >>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/ >>>>>>>>> >>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I updated the image in [1], but did not change the workers yet to >>>>>>>>>> pick up the new image yet. We can do this once we add Go changes on >>>>>>>>>> top of >>>>>>>>>> it. >>>>>>>>>> >>>>>>>>>> I am also considering to SSH into every worker and run a one-line >>>>>>>>>> command that adds the dependency that was missing. It seems to be >>>>>>>>>> low risk, >>>>>>>>>> and there is a fall-back plan to re-start the worker using the >>>>>>>>>> saved image >>>>>>>>>> - both new and old images are saved and available in Cloud Console. >>>>>>>>>> >>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC >>>>>>>>>> or committer could trigger without logging into every machine. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> @Brian Hulette <[email protected]> That button seems like >>>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's >>>>>>>>>>> probably still preferable to causing a bunch of aborted tests. >>>>>>>>>>> >>>>>>>>>>> @Valentyn Tymofieiev <[email protected]> Collaborating to do >>>>>>>>>>> both updates at once is a great idea! I'll message you directly >>>>>>>>>>> about it. >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I am also interested in this updating version of Python on VMs, >>>>>>>>>>>> I need to install Python 3.9. Thanks for looking into this. We can >>>>>>>>>>>> coordinate together to make one update instead of two. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I >>>>>>>>>>>>> just poked around in the Jenkins UI (e.g. [1]) and it looks like >>>>>>>>>>>>> you can >>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if >>>>>>>>>>>>> you're a >>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up >>>>>>>>>>>>> new jobs >>>>>>>>>>>>> after it's finished the currently executing ones. Doing that >>>>>>>>>>>>> manually for >>>>>>>>>>>>> every worker could be a pain though. >>>>>>>>>>>>> >>>>>>>>>>>>> Brian >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/ >>>>>>>>>>>>> [2] >>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hey everyone, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, >>>>>>>>>>>>>> and I found these instructions on upgrading software on >>>>>>>>>>>>>> Jenkins >>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> >>>>>>>>>>>>>> on >>>>>>>>>>>>>> our cwiki. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I haven't started going through it yet, but I was wondering >>>>>>>>>>>>>> about the last few steps that involve stopping VMs, deleting >>>>>>>>>>>>>> boot disks, >>>>>>>>>>>>>> and restarting executors. Is there some best practice for that >>>>>>>>>>>>>> section to >>>>>>>>>>>>>> avoid causing interruptions in our automated testing? Should I >>>>>>>>>>>>>> be trying to >>>>>>>>>>>>>> do this outside of peak dev hours, or going one VM at a time so >>>>>>>>>>>>>> others can >>>>>>>>>>>>>> pick up extra load, or anything like that? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Daniel Oliveira >>>>>>>>>>>>>> >>>>>>>>>>>>>
