Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-25 Thread Philippe Lafoucrière
Andrew, YOU MADE MY DAY!

The issue is GONE (sorry, I'm very excited and relieved at the same time).
We tried to nuke docker completely on the node (also removing
/var/lib/docker), but we hadn't removed /var/lib/origin/.

So, for an obscur reason, we had a lot of old volumes from May, June and
July.
After removing these folders, our deploys now take less than 5s (time to
run the deploy pod + actually starting the services). We havent seen our
cluster running like that since a long time.

For the record, here's the command we've been using on all nodes:

find /var/lib/origin/openshift.local.volumes/pods/ -type d -maxdepth 1
-mtime +30 -exec rm -rf \{\} \;

It tooks more than 30s on some nodes, so I suspect some folders to be
completely full of sh...

Anyway, that's a relief, thanks again for your pugnacity :)
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-25 Thread Philippe Lafoucrière
On Tue, Jul 25, 2017 at 12:21 PM, Andrew Lau  wrote:

> I think your issue may come from https://github.com/
> kubernetes/kubernetes/issues/38498
>
> Too many orphaned volumes causing the timeout. I guess the downgrade
> doesn't help with the increased number of volumes(?)
>

That's a good path to follow, indeed.
We tried to downgrade, with the same result (timeout).

I've been looking on a node, and we're seeing volumes from May in
/var/lib/origin/openshift.local.volumes/pods/

These folders have ~5000 items, and we have ~300 per nodes. So maybe we're
hitting a limit somewhere.

I'll try to purge docker completely, and this folder too to see if it helps.
I'll keep you updated, and thanks again for pointing us to this.
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-25 Thread Andrew Lau
I think your issue may come from
https://github.com/kubernetes/kubernetes/issues/38498

Too many orphaned volumes causing the timeout. I guess the downgrade
doesn't help with the increased number of volumes(?)

On Wed, 19 Jul 2017 at 05:39 Philippe Lafoucrière <
philippe.lafoucri...@tech-angels.com> wrote:

> I'm pretty sure it's not related, but I took a look at the git log from
> the day we started to have issues, and noticed it was the first time we
> were using env vars set from secrets (env[x].valueFrom.secretKeyRef).
> Maybe it will ring a bell for someone.
> ​
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-18 Thread Philippe Lafoucrière
I'm pretty sure it's not related, but I took a look at the git log from the
day we started to have issues, and noticed it was the first time we were
using env vars set from secrets (env[x].valueFrom.secretKeyRef).
Maybe it will ring a bell for someone.
​
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Philippe Lafoucrière
We have tried to rollback the master to 1.4, and it worked for a few
moments.
And now, again, we can't deploy anything unless we restart origin-node for
every deploy.

I guess now the solution will be to restart origin-node every 5 minutes on
all nodes to make sure deploy are not blocked :(
​
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Stéphane Klein
2017-07-17 17:20 GMT+02:00 Andrew Lau :

> I see this too. It only started happening after mixing 1.5 and 1.4 nodes.
>

Ok, thanks, we have also master 1.5.1 and nodes in 1.4 :(
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Andrew Lau
I see this too. It only started happening after mixing 1.5 and 1.4 nodes.

I think you are also doing the same thing since the SDN bug was never made
into a release.

On Mon., 17 Jul. 2017, 10:08 pm Stéphane Klein, 
wrote:

>
> 2017-07-17 17:03 GMT+02:00 Stéphane Klein :
>
>>
>>
>> 2017-07-17 17:01 GMT+02:00 Hemant Kumar :
>>
>>> Did you use openshift-ansible?
>>>
>>>
>> Yes
>>
>
>
> We use ovs-multitenant
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Stéphane Klein
2017-07-17 17:03 GMT+02:00 Stéphane Klein :

>
>
> 2017-07-17 17:01 GMT+02:00 Hemant Kumar :
>
>> Did you use openshift-ansible?
>>
>>
> Yes
>


We use ovs-multitenant
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Hemant Kumar
On Mon, Jul 17, 2017 at 11:03 AM, Stéphane Klein <
cont...@stephane-klein.info> wrote:

> You mean "journalctl -u origin-node" on node ?
>

No. There should be a origin-master process on master node. Depending on
configuration, controller-manager and apiserver might be one process or
separate process.  I am specifically looking for apiserver and controller
logs. Look for systemd units on master node.
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Stéphane Klein
2017-07-17 17:01 GMT+02:00 Hemant Kumar :

> Is there anything in apiserver/controller logs?
>

You mean "journalctl -u origin-node" on node ?
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Stéphane Klein
2017-07-17 17:01 GMT+02:00 Hemant Kumar :

> Did you use openshift-ansible?
>
>
Yes
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Hemant Kumar
Is there anything in apiserver/controller logs? How did you deploy this
cluster btw? Did you use openshift-ansible?



On Mon, Jul 17, 2017 at 10:44 AM, Stéphane Klein <
cont...@stephane-klein.info> wrote:

> 2017-07-17 15:39 GMT+02:00 Hemant Kumar :
>
>> Phillippe - I have never seen a properly configured openshift server to
>> timeout while mounting secrets.
>>
>
> We have this messages in log (it is the same cluster that Philippe) :
>
>  ul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: E0717
> 10:34:15.197266   65220 docker_manager.go:357] NetworkPlugin cni failed on
> the status hook for pod 'test-secret-3-deploy' - Unexpected command output
> Device "eth0" does not exist.
> Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: with
> error: exit status 1
> Jul 17 10:34:17 prod-node-rbx-2.example.com origin-node[65154]: I0717
> 10:34:17.519925   65220 docker_manager.go:2177] Determined pod ip after
> infra change: 
> "test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)":
> "10.1.3.9"
> Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717
> 10:34:18.314570   65220 docker_manager.go:761] Logging security options:
> {key:seccomp value:unconfined msg:}
> Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717
> 10:34:18.708998   65220 docker_manager.go:1711] Failed to create symbolic
> link to the log file of pod "test-secret-3-deploy_issue-29
> 059(ebd9309d-6afb-11e7-9452-005056b1755a)" container "deployment":
> symlink  /var/log/containers/test-secret-3-deploy_issue-29059_deploy
> ment-ded8b25b6ad78a620d981292111a2f0a46da14b879f9e862f630228e07e8cd7c.log:
> no such file or directory
>
>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Stéphane Klein
2017-07-17 15:39 GMT+02:00 Hemant Kumar :

> Phillippe - I have never seen a properly configured openshift server to
> timeout while mounting secrets.
>

We have this messages in log (it is the same cluster that Philippe) :

 ul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: E0717
10:34:15.197266   65220 docker_manager.go:357] NetworkPlugin cni failed on
the status hook for pod 'test-secret-3-deploy' - Unexpected command output
Device "eth0" does not exist.
Jul 17 10:34:15 prod-node-rbx-2.example.com origin-node[65154]: with error:
exit status 1
Jul 17 10:34:17 prod-node-rbx-2.example.com origin-node[65154]: I0717
10:34:17.519925   65220 docker_manager.go:2177] Determined pod ip after
infra change: 
"test-secret-3-deploy_issue-29059(ebd9309d-6afb-11e7-9452-005056b1755a)":
"10.1.3.9"
Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717
10:34:18.314570   65220 docker_manager.go:761] Logging security options:
{key:seccomp value:unconfined msg:}
Jul 17 10:34:18 prod-node-rbx-2.example.com origin-node[65154]: E0717
10:34:18.708998   65220 docker_manager.go:1711] Failed to create symbolic
link to the log file of pod "test-secret-3-deploy_issue-
29059(ebd9309d-6afb-11e7-9452-005056b1755a)" container "deployment":
symlink  /var/log/containers/test-secret-3-deploy_issue-29059_deployment-
ded8b25b6ad78a620d981292111a2f0a46da14b879f9e862f630228e07e8cd7c.log: no
such file or directory
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-17 Thread Philippe Lafoucrière
And the problem occurs on all our nodes btw.
​
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-12 Thread Hemant Kumar
Do you have access to logs of atomic-openshift-node process where secrets
are failing to mount? If yes, can you post them a in Bug or something[1]

We may have clues in that. Is the API request that is fetching secret is
taking time to respond or something else is amiss. Also, api-server metrics
can be easily requested via curl. Something like - "curl
http://api-server-url/metrics;.


[1] https://bugzilla.redhat.com/enter_bug.cgi?product=OpenShift%20Origin



On Wed, Jul 12, 2017 at 3:06 PM, Philippe Lafoucrière <
philippe.lafoucri...@tech-angels.com> wrote:

> Could it be related to this?
> https://github.com/openshift/origin/issues/11016
> ​
> Sounds definitely like our issue, I just don't understand why would we hit
> this suddenly.
>
>
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-12 Thread Philippe Lafoucrière
Could it be related to this?
https://github.com/openshift/origin/issues/11016
​
Sounds definitely like our issue, I just don't understand why would we hit
this suddenly.
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-12 Thread Philippe Lafoucrière
On the master, we're seeing this on a regular basis:
https://gist.github.com/gravis/cae52e763cd5cdac19a8456f9208aa34

I don't know if it can be related
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-12 Thread Philippe Lafoucrière
Our nodes are up-to-date already, but we're not using docker-latest (1.13).
I don't think that's an issue, since everything was fine with 1.12 last
week.
​
The only thing having changed lately are PVs, we are migrating some
datastores. I wonder if one of them could be an issue, and openshift is
waiting for a volume until timeout.
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-12 Thread Philippe Lafoucrière
Hi,

We have this issue on Openshift 1.5 (with 1.4 nodes because of this crazy
bug https://github.com/openshift/origin/issues/14092).
It started a few days ago, and nothing really changed in our cluster. We
just added a bunch of secrets, and noticed longer and longer deploys.

We have nothing fancy in the logs, and the only relevent event is :

Unable to mount volumes for pod "xx": timeout expired waiting for
volumes to attach/mount for pod ""/"". list of unattached/unmounted
volumes=[xxx-secrets -secrets -secrets ssl-certs -secrets
default-token-n6pbo]
​
We have this event several times (it varies, let's say around 5 times),
then the container starts as expected. It's an issue when it comes to
single DB pod, the application is down for 5 minutes if the pod needs to
restart.

Thanks
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-11 Thread Andrew Lau
Also might want to check if your nodes has any os updates, I found Network
Manager 1.4.0-19.el7_3 has a memory leak which appears overtime. There was
a recent devicemapper update too I believe.

On Wed, 12 Jul 2017 at 07:59 Andrew Lau  wrote:

> Try restarting origin-node it seemed to fix this issue for me.
>
> Also sometimes those mount errors are actually harmless. It happens when
> one of the controllers had been restarted but didn't sync the status.
> There's a fix upstream but I think only landed in 1.7
>
> The volume is already mounted but the controller doesn't know.
>
>
> On Wed., 12 Jul. 2017, 4:19 am Philippe Lafoucrière, <
> philippe.lafoucri...@tech-angels.com> wrote:
>
>> And... it's starting again.
>> Pods are getting stuck because volumes (secrets) can't be mounted, then
>> after a few minutes, everything starts.
>> I really don't get it :(
>> ​
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-11 Thread Andrew Lau
Try restarting origin-node it seemed to fix this issue for me.

Also sometimes those mount errors are actually harmless. It happens when
one of the controllers had been restarted but didn't sync the status.
There's a fix upstream but I think only landed in 1.7

The volume is already mounted but the controller doesn't know.

On Wed., 12 Jul. 2017, 4:19 am Philippe Lafoucrière, <
philippe.lafoucri...@tech-angels.com> wrote:

> And... it's starting again.
> Pods are getting stuck because volumes (secrets) can't be mounted, then
> after a few minutes, everything starts.
> I really don't get it :(
> ​
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-11 Thread Aleksandar Lazic
Title: Re: timeout expired waiting for volumes to attach/mount for pod


Hi Philippe.

on Dienstag, 11. Juli 2017 at 23:18 was written:





And... it's starting again.
Pods are getting stuck because volumes (secrets) can't be mounted, then after a few minutes, everything starts.
I really don't get it :(



Maybe it would help when you tell us some basic informations.

On which plattform do your run openshift?
Since when happen this behavior?
What was the latest changes which your have done before? 

oc version
oc project
oc export dc/
oc describe pod 
oc get events

-- 
Best Regards
Aleks


smime.p7s
Description: S/MIME Cryptographic Signature
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-11 Thread Philippe Lafoucrière
And... it's starting again.
Pods are getting stuck because volumes (secrets) can't be mounted, then
after a few minutes, everything starts.
I really don't get it :(
​
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: timeout expired waiting for volumes to attach/mount for pod

2017-07-11 Thread Philippe Lafoucrière
After a lot of tests, we discovered the pending pods were always on the
same node.
There were some (usual) "thin: Deletion of thin device" messages.
After draining the node, nuking /var/lib/docker, a hard reboot, everything
went back to normal.

I suspect devicemapper to be the source of all our troubles, and we'll
certainly try overlayfs instead when 3.6 will be ready.
​
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


timeout expired waiting for volumes to attach/mount for pod

2017-07-11 Thread Philippe Lafoucrière
Hi,

Since a few days, we have pods waiting for volumes to be mounted, and get
stuck for several minutes.

https://www.dropbox.com/s/9vuge2t9llr7u6h/Screenshot%202017-07-11%2011.29.19.png?dl=0

After 3-10 minutes, the pod eventually starts, with no obvious reason. Any
idea what could cause this?

Thanks
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users