OpenShift Master: v3.9.0+ba7faec-1
Kubernetes Master: v1.9.1+a0ce1bc657
OpenShift Web Console: v3.9.0+b600d46-dirty

After working successfully for the past few months, my Jenkins deployment 
started to fail to launch build agents for jobs. The event error was 
essentially Failed to start transient scope unit: Argument list too long. The 
error was initially confusing because it’s just running the same agents it’s 
always been running. The agents are configured to live for a short time (15 
minutes) after which they’re removed and another created when necessary.

All this has been perfectly functional up until today.

The complete event error was: -

MountVolume.SetUp failed for volume "fs-input" : mount failed: exit status 1 
Mounting command: systemd-run Mounting arguments: --description=Kubernetes 
transient mount for 
/var/lib/origin/openshift.local.volumes/pods/4da0f883-aaa2-11e8-901a-c81f66c79dfc/volumes/kubernetes.io~nfs/fs-input
 --scope -- mount -t nfs -o ro bastion.novalocal:/data/fs-input 
/var/lib/origin/openshift.local.volumes/pods/4da0f883-aaa2-11e8-901a-c81f66c79dfc/volumes/kubernetes.io~nfs/fs-input
 Output: Failed to start transient scope unit: Argument list too long

I suspect it might be related to Kubernetes issue #57345 
<https://github.com/kubernetes/kubernetes/issues/57345> : Number of "loaded 
inactive dead" systemd transient mount units continues to grow.

In attempt to rectify the situation I tried the issue's suggestion, which was 
to run: -

        $ sudo systemctl daemon-reload

...on the affected node(s). It worked on all nodes except the one that was 
giving me problems. On the “broken” node the command took a few seconds to 
complete but failed, responding with: -

        Failed to execute operation: Connection timed out

I was unable to reboot the node from the command-line (clearly the system was 
polluted to the point that it was essentially unusable) and I was forced to 
resort to rebooting the node by other means.

When the node returned Jenkins and it’s deployments eventually returned to an 
operational state.

So it looks like the issue may be right: - the number of systemd transient 
mount units continues to grow unchecked on nodes.

Although I’ve recovered the system and now believe I have a work-around for the 
underlying fault next time I see this I wonder whether anyone else seen this in 
3.9 and is there a long-term solution for this?

Alan Christie
achris...@informaticsmatters.com



_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Reply via email to