Re: [[email protected]] More from the land of multi-master.

Neil Gilmore Thu, 05 Oct 2017 07:13:29 -0700

Pierre,

As always, thanks for the reply and advice.

Note that I've clipped items that were addressed and that I have no morecomments on.


On 10/5/2017 3:45 AM, Pierre Tardy wrote:

On Wed, Oct 4, 2017 at 5:12 PM Neil Gilmore <[email protected]<mailto:[email protected]>> wrote:
We have also been getting a lot of errors apparently tied to build

    collapsing, which we have turned on globally. If you've been following
    along with the anecdotes, you'll know that we've also slightly
    modified
    the circumstances under which a build will be collapsed to ignore
    revision (in our case, we always want to use the latest -- we
    don't care
    about building anything 'intermediate'). We'd been getting a lot of
    'tried to complete N buildequests, but only completed M' warnings.
We have seen also people seeing those issues. I have made a fix in0.9.10, but it looks like there are still people complaining about it,but without much clue of what is wrong beyond what was fixed.The known problem was that the N buildrequests were actually notuniques buildrequests, the list contained duplicated.
So those warnings should be pretty harmless beyond the noise.

Even though the transaction involving those buildrequests cancels thetransaction, so that the original work of marking requests doesn'thappen? Or would that just mean the requests don't get skipped?

It does seem like the incidence of this warning has been going down,though we haven't done anything to fix it.

    And I
    left some builders' pages up in my browser long enough to see that
    every
    build (except forced builds) was getting marked as SKIPPED eventually.
    Forced builds were never getting claimed. Nor were the skipped builds
    marked as claimed, which is odd, because the collapsing code claims
    builds before marking them skipped. And the comments indicate that a
    prime suspect in getting that warning is builds that were already
    claimed.

Normally the buildrequest collapser is not supposed to mark *builds*skipped. It marks buildrequests as skipped.

So could that be another thing in your steps?

My mistake in using the wrong term here. The code appears to claim therequest then mark it as skipped. But in the UI, I never see a skippedrequest marked as claimed.

    The result of this is that our master is failing in its prime mission,
    which is to run builds. I've been occasionally able to get a build to
    happen by stopping the worker. When our process starts the worker back
    up, and it connects, the master will look for a pending build and
    start
    it. But any subsequent builds will not start. And if there aren't any
    queued builds, a build that gets queued while the worker is running is
    not started. And the builder we use to start workers, which is
    scheduled
    every half hour, didn't run for 18 hours (though it seems to have just
    started a build).
Not sure exactly how to answer to that. This is not normal, but thereare many reason which could be leading to that situation.in my experience, very often it is related to some customization codethat is failing.Is the first build correctly finished?, is there a nextWorker that isnot behaving correctly, do you have custom workers?I've seen people having good results by using Manhole to debug thosefreezes.

The only actual custom code we have is a pair of custom build steps thatproduce logs useful to us, and the modification to collapsing to ignorerevision.

The builder we have to start workers does not use the custom steps,though we have collapsing turned on globally. I have not seen thatbuilder having any skipped requests. It appears to be running normallysince yesterday.

For the builder that only wants to run once, the first build finishescorrectly.


We do not have custom workers.

https://docs.buildbot.net/current/manual/cfg-global.html#manhole
That could help you pinging into the workers and workerforbuildersobjects looking for their states

I've used the manhole before, but not for this. I've had to use it inthe past to manually finish stuck builds, and to manually release lockswhen necessary (though I haven't had to do that in a long time).

But we don't leave the manhole open, which means that I reconfig whenI'm going to use it (and since we use the same master.cfg for all themasters, the manhole would try, and probably fail, to open for all ofthem). Lately, that hasn't been a good option, because when we werehaving the CPU spikes, the reconfig would never finish (it might run for24 hours or more until we were going to restart the master anyway). Itmight work now, though, since we seem to have solved the CPU problem.


Neil Gilmore
grammatech.com

_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Re: [[email protected]] More from the land of multi-master.

Reply via email to