Re: [go-cd] Elastic agents plugin usage

2024-04-25 Thread Chad Wilson
It seems you are using the third party EC2 Elastic Agent Plugin
. This plugin
does not implement the plugin API correctly
 and
does not support environments correctly, which is presumably why you have
that user data hack.

It seems to me from a quick look at the code, this plugin only runs a
single job on an EC2 instance before terminating it so you shouldn't expect
re-use for multiple jobs. If you want to know more about how it is
designed, you are better to ask on their GitHub repo.

I don't know 100% why your agents aren't shutting down correctly, and you
probably need to look at the plugin logs (on both server and the agent
itself) to investigate.

However, since it looks like you have a ec2_user_data hack in place to get
some environment support with the plugin, you need to manually make sure
that the environments in the config agent.auto.register.
environments=staging,sandbox *exactly match the possible pipeline
environments for all possible jobs* you assign to this elastic agent
profile ID.

I also think having multiple environments registered here will possibly
cause chaos, because that is not how elastic agents manage environments
normally. They normally register only a single environment.

The problem will be that if *any single job* on your GoCD is assigned to
say, profile "*elastic_profile_staging*" with the autoregister config like
you have below, but then that job is configured for a pipeline inside a
GoCD environment called* "**other_env**" *an elastic agent will start but
then never get assigned the job. This is because it has registered only for
"*staging,sandbox*" via your hardcoded user_data, NOT "*other_env*".

This breaks the elastic agent plugin contract - GoCD thinks it has already
told the plugin to create an agent for "*other_env*", but it never does.
Now GoCD is confused as to what is happening with the agents. Thus the job
will likely never get assigned, and the plugin will never complete a job,
and it will never shut down the EC2 instance. Perhaps this is what is
happening to you? Might want to check if you have EC2 instances whose agent
logs don't show them doing any work or mismatched environments and elastic
profiles.

With a correctly behaving plugin GoCD tells the plugin a single environment
to register for (the one it needs a new agent to run a job on), and expects
the plugin to register for the environment it tells it to. This EC2 plugin
breaks that contract, which allows you to misconfigure things very easily
and create all sorts of problems. Personally I wouldn't use it if I am
using GoCD environments, but that's your decision to make.

-Chad

On Thu, Apr 25, 2024 at 10:48 PM Satya Elipe  wrote:

> Thank you Sriram.
> Please find my comments below.
>
> >Do the various jobs have an elastic profile ID set?
> Yes, I have two environments staging and prod, so we have separate
> profiles set for them.
>
> Here is pretty much what each profile has:
>
>1. ec2_ami
>2. ec2_instance_profile
>3. ec2_subnets
>4. ec2_instance_type
>5. ec2_key
>6. ec2_user_data
>*echo "agent.auto.register.environments=staging,sandbox" | sudo tee -a
>/var/lib/go-agent/config/autoregister.properties > /dev/null*
>7. ec2_sg
>
>
> >What is the error that you see due to the max count limit?
> ```
> [go] Received request to create an instance for
> brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
> at 2024-04-09 11:21:38 +00:00
> [go] Successfully created new instance i-093b44f70992505cc in
> subnet-555bba0d
> [go] Received request to create an instance for
> brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
> at 2024-04-09 11:23:38 +00:00
> [go] The number of instances currently running is currently at the maximum
> permissible limit, "2". Not creating more instances for jobs:
> brxt-core-service-deploy-staging/86/prepare-for-deploy-stage/1/prepare-for-deploy-job,
> brxt-core-service-deploy-staging/86/deploy-stage/1/deploy-job,
> brxt-core-service-deploy-staging/86/verify-stage/1/verify-job,
> brxt-config-service-deploy-staging/18/deploy-stage/1/deploy-job,
> brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
> [go] Received request to create an instance for
> brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
> at 2024-04-09 11:25:39 +00:00
> [go] The number of instances currently running is currently at the maximum
> permissible limit, "2". Not creating more instances for jobs:
> brxt-core-service-deploy-staging/86/prepare-for-deploy-stage/1/prepare-for-deploy-job,
> brxt-core-service-deploy-staging/86/deploy-stage/1/deploy-job,
> brxt-core-service-deploy-staging/86/verify-stage/1/verify-job,
> brxt-config-service-deploy-staging/18/deploy-stage/1/deploy-job,
> 

Re: [go-cd] Elastic agents plugin usage

2024-04-25 Thread Satya Elipe
Thank you Sriram.
Please find my comments below.

>Do the various jobs have an elastic profile ID set?
Yes, I have two environments staging and prod, so we have separate profiles
set for them.

Here is pretty much what each profile has:

   1. ec2_ami
   2. ec2_instance_profile
   3. ec2_subnets
   4. ec2_instance_type
   5. ec2_key
   6. ec2_user_data
   *echo "agent.auto.register.environments=staging,sandbox" | sudo tee -a
   /var/lib/go-agent/config/autoregister.properties > /dev/null*
   7. ec2_sg


>What is the error that you see due to the max count limit?
```
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:21:38 +00:00
[go] Successfully created new instance i-093b44f70992505cc in
subnet-555bba0d
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:23:38 +00:00
[go] The number of instances currently running is currently at the maximum
permissible limit, "2". Not creating more instances for jobs:
brxt-core-service-deploy-staging/86/prepare-for-deploy-stage/1/prepare-for-deploy-job,
brxt-core-service-deploy-staging/86/deploy-stage/1/deploy-job,
brxt-core-service-deploy-staging/86/verify-stage/1/verify-job,
brxt-config-service-deploy-staging/18/deploy-stage/1/deploy-job,
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:25:39 +00:00
[go] The number of instances currently running is currently at the maximum
permissible limit, "2". Not creating more instances for jobs:
brxt-core-service-deploy-staging/86/prepare-for-deploy-stage/1/prepare-for-deploy-job,
brxt-core-service-deploy-staging/86/deploy-stage/1/deploy-job,
brxt-core-service-deploy-staging/86/verify-stage/1/verify-job,
brxt-config-service-deploy-staging/18/deploy-stage/1/deploy-job,
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:27:39 +00:00
[go] The number of instances currently running is currently at the maximum
permissible limit, "2". Not creating more instances for jobs:
brxt-core-service-deploy-staging/86/prepare-for-deploy-stage/1/prepare-for-deploy-job,
brxt-core-service-deploy-staging/86/deploy-stage/1/deploy-job,
brxt-core-service-deploy-staging/86/verify-stage/1/verify-job,
brxt-config-service-deploy-staging/18/deploy-stage/1/deploy-job,
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:39:58 +00:00
[go] The number of instances currently running is currently at the maximum
permissible limit, "2". Not creating more instances for jobs:
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:41:56 +00:00
[go] Successfully created new instance i-0ca1b2dc4996c210b in
subnet-555bba0d
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:43:56 +00:00
[go] Successfully created new instance i-0bc0bf6e763b6ebf0 in
subnet-555bba0d
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:45:56 +00:00
[go] The number of instances currently running is currently at the maximum
permissible limit, "2". Not creating more instances for jobs:
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
[go] Received request to create an instance for
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job
at 2024-04-09 11:47:56 +00:00
[go] The number of instances currently running is currently at the maximum
permissible limit, "2". Not creating more instances for jobs:
brxt-config-service-deploy-production/19/prepare-deploy-stage/1/prepare-deploy-job.
Go cancelled this job as it has not been assigned an agent for more than 10
minute(s)```

In here, that all happened as you see in the log, so we have two instances
running but none of them got assigned to the job and then job failed
eventually.

>When you say "staging job", do you have a stage in a pipeline called
"staging" with one job in it? Or do you have a stage in a pipeline with one
job called "staging" and the other called "prod"?
Attached is one of our pipelines, if you trigger the build job that in
turn triggers the second and second triggers the third. Attached is 

Re: [go-cd] Elastic agents plugin usage

2024-04-25 Thread Chad Wilson
Can you be specific about the type of elastic agents you are creating and
the plugin you are using. Kubernetes? Docker? Something else? There are
many elastic agent plugins.


> Here's where it gets tricky: when the staging job completes and triggers
> the production job, I expect one of the active agents to take over.
> Instead, the production job attempts to launch new agents, fails due to the
> max count limit, and runs without any agents, leading to failure.
>

I believe elastic agents are generally launched sequentially - i.e a new
one won't be launched until there are no pending-launch ones - but this
depends on the specific elastic agent type.

If you are new to elastic agents, you'll want to be aware that in almost
all elastic agent plugin variants the elastic agents are
single-shot/single-job usage, and are not re-used. The specific type of
elastic agent and its plugin implementation defines how it handles such
things though, so need to know specifics to guess.

Look at the specific elastic agent plugin's log on the server to see what
it is doing. Perhaps your elastic agents are not shutting down
automatically for some reason due to a configuration issue or a problem
with the jobs you are running?

-Chad

-- 
You received this message because you are subscribed to the Google Groups 
"go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to go-cd+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/go-cd/CAA1RwH9msgAM_RtZu%2BEYbzHoQge5wh62a1i-YV7ueE_KoAkxtQ%40mail.gmail.com.


Re: [go-cd] Elastic agents plugin usage

2024-04-25 Thread Sriram Narayanan
On Thu, Apr 25, 2024 at 10:01 PM Satya Elipe  wrote:

> Hi All
>
> I'm encountering some issues with the way Elastic agents are launched,
> assigned, and terminated. Despite setting the maximum agent count to two,
> both agents launch sequentially, with only the first being assigned to the
> job.
>

Do you want the job to run on both the agents? If so, then these
instructions will help you:
https://docs.gocd.org/current/advanced_usage/admin_spawn_multiple_jobs.html


>
> Here's where it gets tricky: when the staging job completes and triggers
> the production job, I expect one of the active agents to take over.
> Instead, the production job attempts to launch new agents, fails due to the
> max count limit, and runs without any agents, leading to failure.
>
>
>
Do the various jobs have an elastic profile ID set?

What is the error that you see due to the max count limit?

When you say "staging job", do you have a stage in a pipeline called
"staging" with one job in it? Or do you have a stage in a pipeline with one
job called "staging" and the other called "prod"?

Could you share how your pipelines are composed? I'm especially asking this
since many new users tend to use GoCD after using other tools and carry
over some of the terminology but also the constraints. If you share your
pipeline structure and what you want to achieve, then we can design
something together.


> Additionally, some agent instances remain active for an extended period,
> requiring manual termination. This disrupts the workflow significantly.
>
>
>
On our cluster, we see the pods being activated upon need, then the
relevant job runs in the pod, and the pod is then deactivated. We are
sticking to the default of "10 pods" right now, and will be increasing the
limit after certain parallel-load reviews.

Could you share your Cluster Profile and the Elastic Profile? Please take
care to obfuscate any org-specific information such as IP addresses,
hostnames, AWS ARNs, URLs, etc.


> Has anyone experienced similar issues, or anyone has any suggestions for a
> workaround?
>
>
> Thanks in advance !
>
> --
> You received this message because you are subscribed to the Google Groups
> "go-cd" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to go-cd+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/go-cd/CADKEDRo_0yJjA0y31vOkXzgtVA_MOiSPQEc_uB3fE%3DfguO-wWQ%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to go-cd+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/go-cd/CANiY96Yvww_OoK3tzZ%3DogcDjN82FGfydkOoFvxxpCDwHRzKV9A%40mail.gmail.com.


[go-cd] Elastic agents plugin usage

2024-04-25 Thread Satya Elipe
Hi All

I'm encountering some issues with the way Elastic agents are launched,
assigned, and terminated. Despite setting the maximum agent count to two,
both agents launch sequentially, with only the first being assigned to the
job.


Here's where it gets tricky: when the staging job completes and triggers
the production job, I expect one of the active agents to take over.
Instead, the production job attempts to launch new agents, fails due to the
max count limit, and runs without any agents, leading to failure.


Additionally, some agent instances remain active for an extended period,
requiring manual termination. This disrupts the workflow significantly.


Has anyone experienced similar issues, or anyone has any suggestions for a
workaround?


Thanks in advance !

-- 
You received this message because you are subscribed to the Google Groups 
"go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to go-cd+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/go-cd/CADKEDRo_0yJjA0y31vOkXzgtVA_MOiSPQEc_uB3fE%3DfguO-wWQ%40mail.gmail.com.