[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Jeff Thompson closed an issue as Incomplete Jenkins / JENKINS-57831 Remoting susceptible to race between HTTP availability and JNLP availability during master initialization Change By: Jeff Thompson Status: Open Closed Resolution: Incomplete Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-issues/JIRA.199788.1559625798000.4756.1561055280319%40Atlassian.JIRA. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Jeff Thompson commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization Great. It looks like the possible value of making any changes at this time has declined. I'll go ahead and close this issue and the related PR. We'll see if another scenario comes up that could be the driver for some init sequence changes. Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-issues/JIRA.199788.1559625798000.4754.1561055220092%40Atlassian.JIRA. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Basil Crow commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization I deployed Pedro's changes from jenkinsci/docker#805 on Monday, and since then I've done one Jenkins master restart which went off without a hitch. So far, use of the jenkins.model.Jenkins.slaveAgentPort Java property seems to have chased away the problem. Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-issues/JIRA.199788.1559625798000.4747.1561053840100%40Atlassian.JIRA. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Jeff Thompson commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization Basil Crow, haven't you gotten any further in validating your needs are met by existing functionality? Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-issues/JIRA.199788.1559625798000.4734.1561050180173%40Atlassian.JIRA. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Jeff Thompson commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization You've described the scenarios and status well. As you note, if the system property for JNLP port works there probably isn't much of a need to make a change at this point. I'll keep this in mind and see if any more information or needs come up. I think any change we might make would have to preserve the existing, default behavior. It's just too complicated to predict who might be relying on what. If Javier Delgado still has an identified need for the "Exit non-zero on failure" we could try again to implement that without disruption. Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-issues/JIRA.199788.1559625798000.955.1560535560131%40Atlassian.JIRA. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Basil Crow commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization This seems to be somewhat opposite to what Javier Delgado was working with in #JENKINS-46515 , Remoting PR#193(https://github.com/jenkinsci/remoting/pull/193). In that case he was trying to exit the agent process more quickly to have things kick off again whereas you are trying to get it not exit so quickly. Swarm has a number of retry-related options: -noRetryAfterConnected : Do not retry if a successful connection gets closed. -retry N : Number of retries before giving up. Unlimited if not specified. -retryBackOffStrategy RETRY_BACK_OFF_S : The mode controlling retry wait time. TRATEGY Can be either 'none' (use same interval between retries) or 'linear' (increase wait time before each retry up to maxRetryInterval) or 'exponential' (double wait interval on each retry up to maxRetryInterval). Default is 'none'. -retryInterval N : Time to wait before retry in seconds. Default is 10 seconds. When these options are being used, Swarm wants the process to keep running, so Remoting's use of System.exit is problematic. In this case, Swarm really wants Remoting to pass control back to Swarm on failure so that Swarm can retry. Today this takes place mostly by popping the stack back up to Remoting's main() so that Swarm can call Remoting's main again. But this interface is an implementation detail of Swarm/Remoting and we could always redefine that interface (for example, to use some specific exception type) if desired. Javier's use case seems a bit different; that use case seems more aligned with a situation where you have a service manager (e.g. systemd) that is monitoring the process and restarting it on failure, e.g.: [Unit] Description=Swarm client Requires=network.target After=local-fs.target After=network.target [Service] Type=simple WorkingDirectory=/var/lib/swarm-client ExecStart=java -jar swarm-client.jar [...] Restart=on-failure RestartSec=5 # # If the Swarm client loses its connection to the master and # needs to be restarted, we don't want to interrupt its child # processes, which the new Swarm client process will find when # it resumes its connection. #
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Jeff Thompson commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization It looks like that Jenkins system environment variable is the supported mechanism for specifying the JNLP port. Since it works right from the earliest init point, it should be reliable. I don't think we need to add anything additional to Jenkins. With that you shouldn't need your Groovy customization any longer. Hopefully that will work for you. I'm very interested in what you discover. If the system property doesn't solve your problems, I'm interested in working together to see if we can figure out what could – something related to your second suggestion, improving Remoting to be more reliable on connections. We would need to introduce it in such a way that changes would only be active when specified and not for everyone else, who might possibly be relying on existing behavior. Some flag that you could send into the agent to turn on the new behavior. This seems to be somewhat opposite to what Javier Delgado was working with in #JENKINS-46515 , Remoting PR#193(https://github.com/jenkinsci/remoting/pull/193). In that case he was trying to exit the agent process more quickly to have things kick off again whereas you are trying to get it not exit so quickly. It's difficult to get it to satisfy all of the different scenarios, but if they're distinguishable and flags can be set, we might be able to introduce specific sequences to help in certain scenarios. I'll look through it a little bit more, but I don't have any testing environments or configurations that lead into these interesting cases. Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the web visit
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Jeff Thompson edited a comment on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization You've done some impressive diagnosis and reporting on this one. That will make it a lot easier to understand and see if we can come up with any solutions. I'll need to find time tomorrow or the next day to study this more thoroughly.I doubt your first suggestion is feasible. The Jenkins init sequence has some cool capabilities but it's pretty limited. Disrupting it can cause lots of problems. It's probably worth investigating, but I'm not hopeful, based on my last attempt to get it to do exactly what I needed.Your second suggestion has me worried, mostly because of recent events involving #JENKINS-46515. @ [~ witokondoria ] made an attempt to improve the connection sequence but it broke other scenarios, #57713.I'd really like to see us make improvements in this area, but there may be too many different implementations and scenarios relying on the existing sequence. We may have to pursue your third suggestion.I'll look it in more detail but I'm very interested in any further ideas or suggestions. Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-issues/JIRA.199788.1559625798000.526.1560463020136%40Atlassian.JIRA. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Basil Crow commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization I just discovered jenkinsci/docker#805, which sounds similar to my scenario on the surface. There, Pedro Rodrigues found that the JNLP port number is first set to a random value and then a Groovy initialization script in the Docker image changes to it the desired value. This opens up a race where Jenkins is responding to the wrong JNLP port for a short amount of time during initialization. Pedro fixed this in the Docker image by using a Java system property rather than a Groovy initialization script to set the port. The Java system property is used when Jenkins first initializes JNLP, so this theoretically closes the race. I wonder if this phenomenon could explain my observations above, where Jenkins was replying to HTTP request but the JNLP port was not available. It seems like the following sequence of events is possible: Jenkins advertises the old JNLP port Swarm/Remoting picks this up Jenkins executes the Groovy initialization script to change the port Swarm/Remoting tries to connect to the old port I'll be deploying Pedro's changes to the Docker image next week. I don't know for sure they will fix this problem, but I am hopeful that they might. Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Jeff Thompson commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization You've done some impressive diagnosis and reporting on this one. That will make it a lot easier to understand and see if we can come up with any solutions. I'll need to find time tomorrow or the next day to study this more thoroughly. I doubt your first suggestion is feasible. The Jenkins init sequence has some cool capabilities but it's pretty limited. Disrupting it can cause lots of problems. It's probably worth investigating, but I'm not hopeful, based on my last attempt to get it to do exactly what I needed. Your second suggestion has me worried, mostly because of recent events involving #JENKINS-46515. @witokondoria made an attempt to improve the connection sequence but it broke other scenarios, #57713. I'd really like to see us make improvements in this area, but there may be too many different implementations and scenarios relying on the existing sequence. We may have to pursue your third suggestion. I'll look it in more detail but I'm very interested in any further ideas or suggestions. Add Comment This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-issues/JIRA.199788.1559625798000.25601.1560290220096%40Atlassian.JIRA. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Basil Crow commented on JENKINS-57831 Re: Remoting susceptible to race between HTTP availability and JNLP availability during master initialization I still haven't figured out a way to reproduce the problem with a real Jenkins master, but using the crude method in jenkinsci/remoting#325 I can reliably reproduce this on the Remoting side. There, I am simulating a server whose JNLP port isn't reachable by having isPortVisible return a fake value of false the first time, and the real value all other times. I built a Swarm client with these Remoting changes. With that Swarm client and connecting to a regular productionJenkins master, I can reliably reproduce the above scenario that I ran into production. My analysis was almost completely correct, up to #9. Engine#innerRun does indeed catch the exception, but it never returns. Instead, it calls events.error(e), which calls CuiListener#error, which does this: LOGGER.log(Level.SEVERE, t.getMessage(), t); System.exit(-1); Pretty harsh. (Also note the use of the negative exit code, which is technically invalid since only positive numbers are allowed.) You can step through it yourself to get a clearer picture of my analysis if the prose isn't convincing. If I make this simple change to Remoting to downgrade the IOException emanating from JnlpAgentEndpointResolver#resolve from "severe" (i.e., CuiListener#error kills the process) to "warning" (i.e., CuiListener#status, which just logs a message), then everything works properly: diff --git a/src/main/java/hudson/remoting/Engine.java b/src/main/java/hudson/remoting/Engine.java index 0f1b92ed..47bc3e7f 100644 --- a/src/main/java/hudson/remoting/Engine.java +++ b/src/main/java/hudson/remoting/Engine.java @@ -522,7 +522,7 @@ public class Engine extends Thread { try { endpoint = resolver.resolve(); } catch (Exception e) { -events.error(e); +events.status(e.getMessage()); return; } if (endpoint == null) { With the above patch, we pop the stack back up to Swarm, which retries the connection a second time, and succeeds the second time. This isn't necessarily a final fix, since I don't know if we'd want to downgrade all IOExceptions coming from JnlpAgentEndpointResolver#resolve from "severe" to "warning" (maybe just ones coming from JnlpAgentEndpointResolver#isPortVisible?), but it clearly illustrates the practical viability of my second proposed solution.
[JIRA] (JENKINS-57831) Remoting susceptible to race between HTTP availability and JNLP availability during master initialization
Title: Message Title Basil Crow created an issue Jenkins / JENKINS-57831 Remoting susceptible to race between HTTP availability and JNLP availability during master initialization Issue Type: Bug Assignee: Jeff Thompson Components: remoting Created: 2019-06-04 05:23 Environment: Jenkins 2.150.1 Swarm client 3.17 (Remoting 3.30) Priority: Major Reporter: Basil Crow Hi, I'm the new maintainer of the Swarm Plugin. I encountered an issue with tonight after doing a routine restart of a Jenkins master (to perform a plugin update) that resulted in all my Swarm clients losing their connection to that master (but not my other masters). I explain the details below. I'd welcome your thoughts on my root cause analysis below, and I'd be happy to collaborate on a solution with you. Problem Typically, my Swarm clients reconnect just fine after a master restarts due to my use of the Swarm client -deleteExistingClients feature. In fact, I even have a unit test for this functionality. And tonight, Swarm clients successfully reconnected when all of my Jenkins masters were restarted, except for one. On that single master (but not the others), all the Swarm clients failed to reconnect. The Swarm client logs on all the failed clients showed messages like the following: 2019-06-04 03:08:24 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl Connecting to http://example.com/ to configure swarm client. 2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked 2019-06-04 03:08:24 FINE hudson.plugins.swarm.SwarmClient createHttpClientContext createHttpClientCo