Re: timeout expired waiting for volumes to attach/mount for pod
Andrew, YOU MADE MY DAY! The issue is GONE (sorry, I'm very excited and relieved at the same time). We tried to nuke docker completely on the node (also removing /var/lib/docker), but we hadn't removed /var/lib/origin/. So, for an obscur reason, we had a lot of old volumes from May, June and July. After removing these folders, our deploys now take less than 5s (time to run the deploy pod + actually starting the services). We havent seen our cluster running like that since a long time. For the record, here's the command we've been using on all nodes: find /var/lib/origin/openshift.local.volumes/pods/ -type d -maxdepth 1 -mtime +30 -exec rm -rf \{\} \; It tooks more than 30s on some nodes, so I suspect some folders to be completely full of sh... Anyway, that's a relief, thanks again for your pugnacity :) ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
On Tue, Jul 25, 2017 at 12:21 PM, Andrew Lauwrote: > I think your issue may come from https://github.com/ > kubernetes/kubernetes/issues/38498 > > Too many orphaned volumes causing the timeout. I guess the downgrade > doesn't help with the increased number of volumes(?) > That's a good path to follow, indeed. We tried to downgrade, with the same result (timeout). I've been looking on a node, and we're seeing volumes from May in /var/lib/origin/openshift.local.volumes/pods/ These folders have ~5000 items, and we have ~300 per nodes. So maybe we're hitting a limit somewhere. I'll try to purge docker completely, and this folder too to see if it helps. I'll keep you updated, and thanks again for pointing us to this. ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
I think your issue may come from https://github.com/kubernetes/kubernetes/issues/38498 Too many orphaned volumes causing the timeout. I guess the downgrade doesn't help with the increased number of volumes(?) On Wed, 19 Jul 2017 at 05:39 Philippe Lafoucrière < philippe.lafoucri...@tech-angels.com> wrote: > I'm pretty sure it's not related, but I took a look at the git log from > the day we started to have issues, and noticed it was the first time we > were using env vars set from secrets (env[x].valueFrom.secretKeyRef). > Maybe it will ring a bell for someone. > > ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
I'm pretty sure it's not related, but I took a look at the git log from the day we started to have issues, and noticed it was the first time we were using env vars set from secrets (env[x].valueFrom.secretKeyRef). Maybe it will ring a bell for someone. ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
We have tried to rollback the master to 1.4, and it worked for a few moments. And now, again, we can't deploy anything unless we restart origin-node for every deploy. I guess now the solution will be to restart origin-node every 5 minutes on all nodes to make sure deploy are not blocked :( ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
2017-07-17 17:20 GMT+02:00 Andrew Lau: > I see this too. It only started happening after mixing 1.5 and 1.4 nodes. > Ok, thanks, we have also master 1.5.1 and nodes in 1.4 :( ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
I see this too. It only started happening after mixing 1.5 and 1.4 nodes. I think you are also doing the same thing since the SDN bug was never made into a release. On Mon., 17 Jul. 2017, 10:08 pm Stéphane Klein,wrote: > > 2017-07-17 17:03 GMT+02:00 Stéphane Klein : > >> >> >> 2017-07-17 17:01 GMT+02:00 Hemant Kumar : >> >>> Did you use openshift-ansible? >>> >>> >> Yes >> > > > We use ovs-multitenant > ___ > users mailing list > users@lists.openshift.redhat.com > http://lists.openshift.redhat.com/openshiftmm/listinfo/users > ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
2017-07-17 17:03 GMT+02:00 Stéphane Klein: > > > 2017-07-17 17:01 GMT+02:00 Hemant Kumar : > >> Did you use openshift-ansible? >> >> > Yes > We use ovs-multitenant ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
On Mon, Jul 17, 2017 at 11:03 AM, Stéphane Klein < cont...@stephane-klein.info> wrote: > You mean "journalctl -u origin-node" on node ? > No. There should be a origin-master process on master node. Depending on configuration, controller-manager and apiserver might be one process or separate process. I am specifically looking for apiserver and controller logs. Look for systemd units on master node. ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
2017-07-17 17:01 GMT+02:00 Hemant Kumar: > Is there anything in apiserver/controller logs? > You mean "journalctl -u origin-node" on node ? ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
2017-07-17 17:01 GMT+02:00 Hemant Kumar: > Did you use openshift-ansible? > > Yes ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
Is there anything in apiserver/controller logs? How did you deploy this cluster btw? Did you use openshift-ansible? On Mon, Jul 17, 2017 at 10:44 AM, Stéphane Klein < cont...@stephane-klein.info> wrote: > 2017-07-17 15:39 GMT+02:00 Hemant Kumar: > >> Phillippe - I have never seen a properly configured openshift server to >> timeout while mounting secrets. >> > > We have this messages in log (it is the same cluster that Philippe) : > > ul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: E0717 > 10:34:15.197266 65220 docker_manager.go:357] NetworkPlugin cni failed on > the status hook for pod 'test-secret-3-deploy' - Unexpected command output > Device "eth0" does not exist. > Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: with > error: exit status 1 > Jul 17 10:34:17 prod-node-rbx-2.example.com origin-node[65154]: I0717 > 10:34:17.519925 65220 docker_manager.go:2177] Determined pod ip after > infra change: > "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)": > "10.1.3.9" > Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 > 10:34:18.314570 65220 docker_manager.go:761] Logging security options: > {key:seccomp value:unconfined msg:} > Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 > 10:34:18.708998 65220 docker_manager.go:1711] Failed to create symbolic > link to the log file of pod "test-secret-3-deploy_issue-29 > 059(ebd9309d-6afb-11e7-9452-005056b1755a)" container "deployment": > symlink /var/log/containers/test-secret-3-deploy_issue-29059_deploy > ment-ded8b25b6ad78a620d981292111a2f0a46da14b879f9e862f630228e07e8cd7c.log: > no such file or directory > > > ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
2017-07-17 15:39 GMT+02:00 Hemant Kumar: > Phillippe - I have never seen a properly configured openshift server to > timeout while mounting secrets. > We have this messages in log (it is the same cluster that Philippe) : ul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:15.197266 65220 docker_manager.go:357] NetworkPlugin cni failed on the status hook for pod 'test-secret-3-deploy' - Unexpected command output Device "eth0" does not exist. Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: with error: exit status 1 Jul 17 10:34:17 prod-node-rbx-2.example.com origin-node[65154]: I0717 10:34:17.519925 65220 docker_manager.go:2177] Determined pod ip after infra change: "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)": "10.1.3.9" Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:18.314570 65220 docker_manager.go:761] Logging security options: {key:seccomp value:unconfined msg:} Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717 10:34:18.708998 65220 docker_manager.go:1711] Failed to create symbolic link to the log file of pod "test-secret-3-deploy_issue- 29059(ebd9309d-6afb-11e7-9452-005056b1755a)" container "deployment": symlink /var/log/containers/test-secret-3-deploy_issue-29059_deployment- ded8b25b6ad78a620d981292111a2f0a46da14b879f9e862f630228e07e8cd7c.log: no such file or directory ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
And the problem occurs on all our nodes btw. ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
Do you have access to logs of atomic-openshift-node process where secrets are failing to mount? If yes, can you post them a in Bug or something[1] We may have clues in that. Is the API request that is fetching secret is taking time to respond or something else is amiss. Also, api-server metrics can be easily requested via curl. Something like - "curl http://api-server-url/metrics;. [1] https://bugzilla.redhat.com/enter_bug.cgi?product=OpenShift%20Origin On Wed, Jul 12, 2017 at 3:06 PM, Philippe Lafoucrière < philippe.lafoucri...@tech-angels.com> wrote: > Could it be related to this? > https://github.com/openshift/origin/issues/11016 > > Sounds definitely like our issue, I just don't understand why would we hit > this suddenly. > > > > ___ > users mailing list > users@lists.openshift.redhat.com > http://lists.openshift.redhat.com/openshiftmm/listinfo/users > > ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
Could it be related to this? https://github.com/openshift/origin/issues/11016 Sounds definitely like our issue, I just don't understand why would we hit this suddenly. ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
On the master, we're seeing this on a regular basis: https://gist.github.com/gravis/cae52e763cd5cdac19a8456f9208aa34 I don't know if it can be related ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
Our nodes are up-to-date already, but we're not using docker-latest (1.13). I don't think that's an issue, since everything was fine with 1.12 last week. The only thing having changed lately are PVs, we are migrating some datastores. I wonder if one of them could be an issue, and openshift is waiting for a volume until timeout. ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
Hi, We have this issue on Openshift 1.5 (with 1.4 nodes because of this crazy bug https://github.com/openshift/origin/issues/14092). It started a few days ago, and nothing really changed in our cluster. We just added a bunch of secrets, and noticed longer and longer deploys. We have nothing fancy in the logs, and the only relevent event is : Unable to mount volumes for pod "xx": timeout expired waiting for volumes to attach/mount for pod ""/"". list of unattached/unmounted volumes=[xxx-secrets -secrets -secrets ssl-certs -secrets default-token-n6pbo] We have this event several times (it varies, let's say around 5 times), then the container starts as expected. It's an issue when it comes to single DB pod, the application is down for 5 minutes if the pod needs to restart. Thanks ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
Also might want to check if your nodes has any os updates, I found Network Manager 1.4.0-19.el7_3 has a memory leak which appears overtime. There was a recent devicemapper update too I believe. On Wed, 12 Jul 2017 at 07:59 Andrew Lauwrote: > Try restarting origin-node it seemed to fix this issue for me. > > Also sometimes those mount errors are actually harmless. It happens when > one of the controllers had been restarted but didn't sync the status. > There's a fix upstream but I think only landed in 1.7 > > The volume is already mounted but the controller doesn't know. > > > On Wed., 12 Jul. 2017, 4:19 am Philippe Lafoucrière, < > philippe.lafoucri...@tech-angels.com> wrote: > >> And... it's starting again. >> Pods are getting stuck because volumes (secrets) can't be mounted, then >> after a few minutes, everything starts. >> I really don't get it :( >> >> ___ >> users mailing list >> users@lists.openshift.redhat.com >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users >> > ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
Try restarting origin-node it seemed to fix this issue for me. Also sometimes those mount errors are actually harmless. It happens when one of the controllers had been restarted but didn't sync the status. There's a fix upstream but I think only landed in 1.7 The volume is already mounted but the controller doesn't know. On Wed., 12 Jul. 2017, 4:19 am Philippe Lafoucrière, < philippe.lafoucri...@tech-angels.com> wrote: > And... it's starting again. > Pods are getting stuck because volumes (secrets) can't be mounted, then > after a few minutes, everything starts. > I really don't get it :( > > ___ > users mailing list > users@lists.openshift.redhat.com > http://lists.openshift.redhat.com/openshiftmm/listinfo/users > ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
Title: Re: timeout expired waiting for volumes to attach/mount for pod Hi Philippe. on Dienstag, 11. Juli 2017 at 23:18 was written: And... it's starting again. Pods are getting stuck because volumes (secrets) can't be mounted, then after a few minutes, everything starts. I really don't get it :( Maybe it would help when you tell us some basic informations. On which plattform do your run openshift? Since when happen this behavior? What was the latest changes which your have done before? oc version oc project oc export dc/ oc describe pod oc get events -- Best Regards Aleks smime.p7s Description: S/MIME Cryptographic Signature ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
And... it's starting again. Pods are getting stuck because volumes (secrets) can't be mounted, then after a few minutes, everything starts. I really don't get it :( ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
Re: timeout expired waiting for volumes to attach/mount for pod
After a lot of tests, we discovered the pending pods were always on the same node. There were some (usual) "thin: Deletion of thin device" messages. After draining the node, nuking /var/lib/docker, a hard reboot, everything went back to normal. I suspect devicemapper to be the source of all our troubles, and we'll certainly try overlayfs instead when 3.6 will be ready. ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users
timeout expired waiting for volumes to attach/mount for pod
Hi, Since a few days, we have pods waiting for volumes to be mounted, and get stuck for several minutes. https://www.dropbox.com/s/9vuge2t9llr7u6h/Screenshot%202017-07-11%2011.29.19.png?dl=0 After 3-10 minutes, the pod eventually starts, with no obvious reason. Any idea what could cause this? Thanks ___ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users