[go-cd] Is it possible a hung job causes other running jobs to reschedule as well?

[email protected] Tue, 12 Oct 2021 10:17:31 -0700

Hello --  Curious to know if anyone has seen this behavior before - 

GoCD version v20.2.0
Kubernetes Elastic Agent Plugin v3.2.0-187


We noticed in the console.log file for an actively  running job, midway 
through running there is a request to create a new pod for the job and the 
job starts over again:

[go] Received request to create a pod for job 
JobIdentifier{pipelineName='REDACTED', pipelineCounter=412, 
pipelineLabel='412', stageName='deployStage', stageCounter='1', 
jobName='deploy', jobId=182,399} in cluster 
https://REDACTED.gr7.us-east-1.eks.amazonaws.com at 2021-10-08 19:02:25 
+00:00

Looking through the go-server log, I do see an entry suggesting it thinks 
the job is hung:

2021-10-08 19:02:22,074 WARN [ThreadPoolTaskScheduler-1] 
ScheduleService:611 - Found hung job[id=JobInstance{stageId=169360, 
name='deploy', state=Building, result=Unknown, agentUuid='1cdec80 
5-11a3-43aa-aa00-301004541e87', stateTransitions=[], scheduledDate=Fri Oct 
08 18:49:13 UTC 2021, 
timeProvider=com.thoughtworks.go.util.TimeProvider@237ff2e, ignored=false, 
identifier=JobIdent ifier[REDACTED, 412, 412, deployStage, 1, deploy, 
182391], plan=null, runOnAllAgents=false, runMultipleInstance=false}], 
rescheduling it 2021-10-08 19:02:22,074 WARN [ThreadPoolTaskScheduler-1] 
ScheduleService:660 - [Job Reschedule] Rescheduling and marking old job as 
ignored: JobInstance{stageId=169360, name='deploy', state= Building, 
result=Unknown, agentUuid='1cdec805-11a3-43aa-aa00-301004541e87', 
stateTransitions=[], scheduledDate=Fri Oct 08 18:49:13 UTC 2021, 
timeProvider=com.thoughtworks.go.util.TimeProvider @237ff2e, ignored=false, 
identifier=JobIdentifier[REDACTED 412, 412, deployStage, 1, deploy, 
182391], plan=null, runOnAllAgents=false, runMultipleInstance=false} 
2021-10-08 19:02:22,082 INFO [ThreadPoolTaskScheduler-1] 
ScheduleService:678 - [Job Reschedule] Scheduled new job: JobIdentifier[
REDACTED, 412, 412, deployStage, 1, deploy, 182399]. Replacing old job: 
JobIdentifier[REDACTED, 412, 412, deployStage, 1, deploy, 182391]

There's nothing in the job's console to indicate the job was hung. Turning 
on timestamps shows that there was a log entry about 2 minutes before the 
entry for receiving the new pod creation request.

However, I did notice that at the same time GoCD detected that hung job, it 
detected another hung job, which in all likelihood was hung, or at least 
was slow to report any console output (we have to tune this job better as 
its a bit of a resource hog)

What *seems* to have happened is that gocd detects a legit hung job and 
reschedules it, but in the process *also* reports it detected another hung 
job that is likely NOT actually hung and end up rescheduling that one as 
well.

Is this a plausible scenario?  Anyone experience anything like this before?

I was looking through release notes for subsequent versions beyond what we 
have current deployed for gocd and the k8 EA cluster plugin and I'm not 
seeing any bug fixes that might address this.

In the meantime,  I'll see if I can mine through our go-server log history 
and find other cases of this scenario.

Thanks.

Doug

-- 
You received this message because you are subscribed to the Google Groups 
"go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/go-cd/6fa47495-9387-42af-821f-5343ad942102n%40googlegroups.com.

[go-cd] Is it possible a hung job causes other running jobs to reschedule as well?

Reply via email to