Re: [go-cd] Agents going offline randomly on 22.3

2023-04-07 Thread Chad Wilson
FWIW, you can auto-register resources and environments along with an agent
just as with manual assignments:
https://docs.gocd.org/current/advanced_usage/agent_auto_register.html

The agent UUID is in the database, but the token will not be, I believe.

No idea what might cause this on the server side, but I imagine there'd be
errors in the logs that correspond to the timing, if that's a possible
related thing.

-Chad

On Fri, Apr 7, 2023 at 11:07 PM  wrote:

> I’ll see if I can aggregate all the logs.
>
>
>
> For the resources, we assign values to particular agents so we can
> dynamically assign agents based on what a job step should be doing. For
> example if it needs to be local to a SQL server we’ll tag that with a code
> and something like SQL, SQL-A if its in a cluster, etc. and then it all
> auto-assigns to the first valid agent. I’m not sure that we can do
> auto-registration with detail so we’ve just done it by hand.
>
>
>
> I don’t think its crossover because I have 4-5 agents that I need to fix
> that are the only one on the server but I’ll check and see what they say.
> Would something interrupting the SQL instance cause this to occur? A write
> to the agents table being missed or timing out, etc.? I’ve had some that
> have both the token and guid but this one just has a guid. I’m going to
> make sure the SQL folders are being handled gently and that AV wouldn’t be
> the interference.
>
>
>
> I was asking about the guid/token to see if that was stored somewhere I
> could retrieve or could be manually registered to make sure it reattached
> the agent to the assignments (environment and resources) instead of
> enabling the agent and doing all the assignments again. Some of these have
> quite a few resources to eyeball for comparisons.
>
>
>
> I’ll share back as soon as I find either more questions or a solution
> someone else might find useful.
>
>
>
> Thanks!
>
>
>
> *From:* go-cd@googlegroups.com  *On Behalf Of *Chad
> Wilson
> *Sent:* Tuesday, April 4, 2023 11:05 AM
> *To:* go-cd@googlegroups.com
> *Subject:* Re: [go-cd] Agents going offline randomly on 22.3
>
>
>
> Was the setup working at some point and then something changed?
>
> It sounds to me like you have some problem with
>
>- agents' identities getting confused with one another (shared GUIDs),
>or
>- accidentally sharing working folders between two agent processes
>(double-starting an agent perhaps?) or
>- token getting removed after it is first issued (by something...)
>
> Do you have any automated re-provisioning of the agents or other
> automation here that could be interfering with the config/token or guid.txt
> files?
>
> I can't really think of any other reason this would happen, and there's
> not really much information here to debug. If the agents aren't getting
> confused with one another, what this looks like is the agent still knows
> its GUID, but assuming it was previously working, the token it was
> previously issued has been lost off disk. To my knowledge the agent only
> actively deletes a token when the registration of the agent is denied by
> the server due to a 403 FORBIDDEN error after you reject registration, so
> If you have missing tokens for agents that were previously OK, perhaps you
> want to see what could be deleting the token?
>
> You also may need to follow through an agent's full log and timeline to
> see how that could have happened, correlating to other events and search
> the server log for the agent's GUID to see what might be happening -
> snippets like the below aren't complete enough to be helpful. Or have a
> look through https://github.com/gocd/gocd/issues/5170
>
> And no, you can't recreate GUID/token from PostgreSQL, but not sure what
> you mean here. Removing the GUID and token and restarting the agent should
> be sufficient to get it to re-register reliably - as long as the root
> problem is addressed that is causing the agents
>
> As for the resource tags, is there a reason you're doing that manually?
> You may be able to use auto registration of agents to automate that?
> https://docs.gocd.org/current/advanced_usage/agent_auto_register.html
>
>
>
> -Chad
>
>
>
> On Tue, Apr 4, 2023 at 10:51 PM Funkycybermonk  wrote:
>
> Hello! I'm running 22.3 and I keep having agents go offline. For example,
> on a particular server (mirror setup to other environments) I have several
> agents running side-by-side on an admin server and then an agent on various
> individual servers. At the moment for this particular example, I have 12 of
> 15 agents that are running perfectly fine. They all enabled and took their
> configs originally but now the two that are offline are just looping t

RE: [go-cd] Agents going offline randomly on 22.3

2023-04-07 Thread chantryc
I’ll see if I can aggregate all the logs. 

 

For the resources, we assign values to particular agents so we can dynamically 
assign agents based on what a job step should be doing. For example if it needs 
to be local to a SQL server we’ll tag that with a code and something like SQL, 
SQL-A if its in a cluster, etc. and then it all auto-assigns to the first valid 
agent. I’m not sure that we can do auto-registration with detail so we’ve just 
done it by hand. 

 

I don’t think its crossover because I have 4-5 agents that I need to fix that 
are the only one on the server but I’ll check and see what they say. Would 
something interrupting the SQL instance cause this to occur? A write to the 
agents table being missed or timing out, etc.? I’ve had some that have both the 
token and guid but this one just has a guid. I’m going to make sure the SQL 
folders are being handled gently and that AV wouldn’t be the interference.

 

I was asking about the guid/token to see if that was stored somewhere I could 
retrieve or could be manually registered to make sure it reattached the agent 
to the assignments (environment and resources) instead of enabling the agent 
and doing all the assignments again. Some of these have quite a few resources 
to eyeball for comparisons. 

 

I’ll share back as soon as I find either more questions or a solution someone 
else might find useful.

 

Thanks!

 

From: go-cd@googlegroups.com  On Behalf Of Chad Wilson
Sent: Tuesday, April 4, 2023 11:05 AM
To: go-cd@googlegroups.com
Subject: Re: [go-cd] Agents going offline randomly on 22.3

 

Was the setup working at some point and then something changed?

It sounds to me like you have some problem with 

*   agents' identities getting confused with one another (shared GUIDs), or 
*   accidentally sharing working folders between two agent processes 
(double-starting an agent perhaps?) or 
*   token getting removed after it is first issued (by something...)

Do you have any automated re-provisioning of the agents or other automation 
here that could be interfering with the config/token or guid.txt files?

I can't really think of any other reason this would happen, and there's not 
really much information here to debug. If the agents aren't getting confused 
with one another, what this looks like is the agent still knows its GUID, but 
assuming it was previously working, the token it was previously issued has been 
lost off disk. To my knowledge the agent only actively deletes a token when the 
registration of the agent is denied by the server due to a 403 FORBIDDEN error 
after you reject registration, so If you have missing tokens for agents that 
were previously OK, perhaps you want to see what could be deleting the token?

You also may need to follow through an agent's full log and timeline to see how 
that could have happened, correlating to other events and search the server log 
for the agent's GUID to see what might be happening - snippets like the below 
aren't complete enough to be helpful. Or have a look through 
https://github.com/gocd/gocd/issues/5170

And no, you can't recreate GUID/token from PostgreSQL, but not sure what you 
mean here. Removing the GUID and token and restarting the agent should be 
sufficient to get it to re-register reliably - as long as the root problem is 
addressed that is causing the agents

As for the resource tags, is there a reason you're doing that manually? You may 
be able to use auto registration of agents to automate that? 
https://docs.gocd.org/current/advanced_usage/agent_auto_register.html

 

-Chad

 

On Tue, Apr 4, 2023 at 10:51 PM Funkycybermonk mailto:chant...@gmail.com> > wrote:

Hello! I'm running 22.3 and I keep having agents go offline. For example, on a 
particular server (mirror setup to other environments) I have several agents 
running side-by-side on an admin server and then an agent on various individual 
servers. At the moment for this particular example, I have 12 of 15 agents that 
are running perfectly fine. They all enabled and took their configs originally 
but now the two that are offline are just looping the below message. Generally 
I can go to each server, stop the agent, delete the contents of the config 
folder and restart and it may after 1 or more tries create a new entry. The new 
entry now is missing all the resource tags so we have to note all the tags from 
the abandoned agent registration and add it to the new one. 

 

We have a significant number of agents around in multiple environments but this 
happens to maybe 10-20% of them. All agents were provisioned in the same way, 
started and registered in the same way. 

 

Sometimes they have a token, and guid file but sometimes there is only a guid 
while the error message loops. In this particular agent case, I have two that 
just went offline from a clean install. Both showed up initially and enabled 
but are now showing offline. They are on the same server but each has a 
different na

Re: [go-cd] Agents going offline randomly on 22.3

2023-04-04 Thread Chad Wilson
Was the setup working at some point and then something changed?

It sounds to me like you have some problem with

   - agents' identities getting confused with one another (shared GUIDs),
   or
   - accidentally sharing working folders between two agent processes
   (double-starting an agent perhaps?) or
   - token getting removed after it is first issued (by something...)

Do you have any automated re-provisioning of the agents or other automation
here that could be interfering with the config/token or guid.txt files?

I can't really think of any other reason this would happen, and there's not
really much information here to debug. If the agents aren't getting
confused with one another, what this looks like is the agent still knows
its GUID, but assuming it was previously working, the token it was
previously issued has been lost off disk. To my knowledge the agent only
actively deletes a token when the registration of the agent is denied by
the server due to a 403 FORBIDDEN error after you reject registration, so
If you have missing tokens for agents that were previously OK, perhaps you
want to see what could be deleting the token?

You also may need to follow through an agent's full log and timeline to see
how that could have happened, correlating to other events and search the
server log for the agent's GUID to see what might be happening - snippets
like the below aren't complete enough to be helpful. Or have a look through
https://github.com/gocd/gocd/issues/5170

And no, you can't recreate GUID/token from PostgreSQL, but not sure what
you mean here. Removing the GUID and token and restarting the agent should
be sufficient to get it to re-register reliably - as long as the root
problem is addressed that is causing the agents

As for the resource tags, is there a reason you're doing that manually? You
may be able to use auto registration of agents to automate that?
https://docs.gocd.org/current/advanced_usage/agent_auto_register.html

-Chad

On Tue, Apr 4, 2023 at 10:51 PM Funkycybermonk  wrote:

> Hello! I'm running 22.3 and I keep having agents go offline. For example,
> on a particular server (mirror setup to other environments) I have several
> agents running side-by-side on an admin server and then an agent on various
> individual servers. At the moment for this particular example, I have 12 of
> 15 agents that are running perfectly fine. They all enabled and took their
> configs originally but now the two that are offline are just looping the
> below message. Generally I can go to each server, stop the agent, delete
> the contents of the config folder and restart and it may after 1 or more
> tries create a new entry. The new entry now is missing all the resource
> tags so we have to note all the tags from the abandoned agent registration
> and add it to the new one.
>
> We have a significant number of agents around in multiple environments but
> this happens to maybe 10-20% of them. All agents were provisioned in the
> same way, started and registered in the same way.
>
> Sometimes they have a token, and guid file but sometimes there is only a
> guid while the error message loops. In this particular agent case, I have
> two that just went offline from a clean install. Both showed up initially
> and enabled but are now showing offline. They are on the same server but
> each has a different name "Go Agent 01" "Go Agent 02" etc.:
>
> 2023-04-03 18:46:28,930 INFO  [scheduler-3] SslInfrastructureService:78 -
> [Agent Registration] Starting to register agent.
> 2023-04-03 18:46:28,930 INFO  [scheduler-3] SslInfrastructureService:88 -
> [Agent Registration] Fetching token from server.
> 2023-04-03 18:46:28,932 ERROR [scheduler-3] TokenRequester:59 - Received
> status code from server 409
> 2023-04-03 18:46:28,933 ERROR [scheduler-3] TokenRequester:60 - Reason for
> failure A token has already been issued for this agent.
> 2023-04-03 18:46:28,933 ERROR [scheduler-3] SslInfrastructureService:106 -
> [Agent Registration] There was a problem registering with the GoCD server.
> java.lang.RuntimeException: A token has already been issued for this agent.
>
>
> I have tried to see if I could recreate the token and guid files but I
> can't seem to get them to be accepted when I think their values are
> correct. If there is a way to recreate the guid and token from the
> PostgreSQL server I can do that but I haven't found anything so far that
> seems to work for recreating those.
>
> Is there any reason that the agent would register and then lose its
> registration that we can try to avoid? Over the last month or two we've
> lost registration and set agents back up roughly 50-80 times across all
> areas.
>
> Thanks in advance for any assistance!
>
> --
> You received this message because you are subscribed to the Google Groups
> "go-cd" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to go-cd+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> 

[go-cd] Agents going offline randomly on 22.3

2023-04-04 Thread Funkycybermonk
Hello! I'm running 22.3 and I keep having agents go offline. For example, 
on a particular server (mirror setup to other environments) I have several 
agents running side-by-side on an admin server and then an agent on various 
individual servers. At the moment for this particular example, I have 12 of 
15 agents that are running perfectly fine. They all enabled and took their 
configs originally but now the two that are offline are just looping the 
below message. Generally I can go to each server, stop the agent, delete 
the contents of the config folder and restart and it may after 1 or more 
tries create a new entry. The new entry now is missing all the resource 
tags so we have to note all the tags from the abandoned agent registration 
and add it to the new one. 

We have a significant number of agents around in multiple environments but 
this happens to maybe 10-20% of them. All agents were provisioned in the 
same way, started and registered in the same way. 

Sometimes they have a token, and guid file but sometimes there is only a 
guid while the error message loops. In this particular agent case, I have 
two that just went offline from a clean install. Both showed up initially 
and enabled but are now showing offline. They are on the same server but 
each has a different name "Go Agent 01" "Go Agent 02" etc.:

2023-04-03 18:46:28,930 INFO  [scheduler-3] SslInfrastructureService:78 - 
[Agent Registration] Starting to register agent.
2023-04-03 18:46:28,930 INFO  [scheduler-3] SslInfrastructureService:88 - 
[Agent Registration] Fetching token from server.
2023-04-03 18:46:28,932 ERROR [scheduler-3] TokenRequester:59 - Received 
status code from server 409
2023-04-03 18:46:28,933 ERROR [scheduler-3] TokenRequester:60 - Reason for 
failure A token has already been issued for this agent. 
2023-04-03 18:46:28,933 ERROR [scheduler-3] SslInfrastructureService:106 - 
[Agent Registration] There was a problem registering with the GoCD server.
java.lang.RuntimeException: A token has already been issued for this agent.


I have tried to see if I could recreate the token and guid files but I 
can't seem to get them to be accepted when I think their values are 
correct. If there is a way to recreate the guid and token from the 
PostgreSQL server I can do that but I haven't found anything so far that 
seems to work for recreating those. 

Is there any reason that the agent would register and then lose its 
registration that we can try to avoid? Over the last month or two we've 
lost registration and set agents back up roughly 50-80 times across all 
areas.

Thanks in advance for any assistance!

-- 
You received this message because you are subscribed to the Google Groups 
"go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to go-cd+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/go-cd/b5b7ee4f-d21a-41a5-8162-3c883ae01542n%40googlegroups.com.