Re: clsut stopped working - certificate problems

2020-03-31 Thread Brian Jarvis
The certificate expiration check playbook was recently updated to include
this check for the nodes.

[0] https://github.com/openshift/openshift-ansible/pull/11967

On Tue, Mar 31, 2020 at 1:12 PM Tim Dudgeon  wrote:

> So thanks for the help in fixing this problem. Much appreciated.
>
> Having looked at it now after the event I have 2 concerns.
>
> 1. Whilst this is documented [1] the significance of this is not
> mentioned. Unless you do as described (either manually or automatically)
> your cluster will stop working 1 year after being deployed!
>
> 2. The playbooks that check certificate expiry [2] do not catch this
> problem.
>
> Thanks
> Tim
>
> [1]
> https://docs.okd.io/3.11/install_config/redeploying_certificates.html#cert-expiry-managing-csrs
>
> [2]
> https://docs.okd.io/3.11/install_config/redeploying_certificates.html#install-config-cert-expiry
>
>
> On 31/03/2020 17:05, Brian Jarvis wrote:
>
> Hello Tim,
>
> Each node has a client certificate that expire after one year.
> Run "oc get csr"  you should see many pending requests, possibly
> thousands.
>
> To clear those run "oc get csr -o name | xargs oc adm certificate approve"
>
> One way to prevent this in the future is to deploy/enable the auto
> approver statefulset with the following command.
> ansible-playbook -vvv -i [inventory_file]
> /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml
> -e openshift_master_bootstrap_auto_approve=true
>
> On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon 
> wrote:
>
>> Maybe an uncanny coincidence but with think the cluster was created
>> almost EXACTLY 1 year before it failed.
>> On 31/03/2020 16:17, Ben Holmes wrote:
>>
>> Hi Tim,
>>
>> Can you verify that the host's clocks are being synced correctly as per
>> Simon's other suggestion?
>>
>> Ben
>>
>> On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon  wrote:
>>
>>> Hi Simon,
>>>
>>> we're run those playbooks and all certs are reported as still being
>>> valid.
>>>
>>> Tim
>>>
>>> On 31/03/2020 15:59, Simon Krenger wrote:
>>> > Hi Tim,
>>> >
>>> > Note that there are multiple sets of certificates, both external and
>>> > internal. So it would be worth checking the certificates again using
>>> > the Certificate Expiration Playbooks (see link below). The
>>> > documentation also has an overview of what can be done to renew
>>> > certain certificates:
>>> >
>>> > - [ Redeploying Certificates ]
>>> >
>>> https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>>> >
>>> > Apart from checking all certificates, I'd certainly review the time
>>> > synchronisation for the whole cluster, as we see the message "x509:
>>> > certificate has expired or is not yet valid".
>>> >
>>> > I hope this helps.
>>> >
>>> > Kind regards
>>> > Simon
>>> >
>>> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon 
>>> wrote:
>>> >> One of our OKD 3.11 clusters has suddenly stopped working without any
>>> >> obvious reason.
>>> >>
>>> >> The origin-node service on the nodes does not start (times out).
>>> >> The master-api pod is running on the master.
>>> >> The nodes can access the master-api endpoints.
>>> >>
>>> >> The logs of the master-api pod look mostly OK other than a huge number
>>> >> of warnings about certificates that don't really make sense as the
>>> >> certificates are valid (we use named certificates from let's Encryt
>>> and
>>> >> they were renewed about 2 weeks ago and all appear to be correct.
>>> >>
>>> >> Examples of errors from the master-api pod are:
>>> >>
>>> >> I0331 12:46:57.065147   1 establishing_controller.go:73] Starting
>>> >> EstablishingController
>>> >> I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.17:58024: EOF
>>> >> I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.19:48102: EOF
>>> >> I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.19:37178: EOF
>>> >> I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.17:58022: EOF
>>> >>
>>> >> E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource
>>> >> list for metrics.k8s.io/v1beta1: the server is currently unable to
>>> >> handle the request
>>> >> E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource
>>> >> list for servicecatalog.k8s.io/v1beta1: the server is currently
>>> unable
>>> >> to handle the request
>>> >> E0331 12:47:44.115290   1 authentication.go:62] Unable to
>>> >> authenticate the request due to an error: [x509: certificate has
>>> expired
>>> >> or is not yet valid, x509: certificate
>>> >>has expired or is not yet valid]
>>> >> E0331 12:47:44.118976   1 authentication.go:62] Unable to
>>> >> authenticate the request due to an error: [x509: certificate has
>>> expired
>>> >> or is not yet valid, x509: certificate
>>> >>has expired or is not yet valid]
>>> >> E0331 12:47:44.122276   1 authentication.go:62] Unable to
>>> >>

Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon

So thanks for the help in fixing this problem. Much appreciated.

Having looked at it now after the event I have 2 concerns.

1. Whilst this is documented [1] the significance of this is not 
mentioned. Unless you do as described (either manually or automatically) 
your cluster will stop working 1 year after being deployed!


2. The playbooks that check certificate expiry [2] do not catch this 
problem.


Thanks
Tim

[1] 
https://docs.okd.io/3.11/install_config/redeploying_certificates.html#cert-expiry-managing-csrs


[2] 
https://docs.okd.io/3.11/install_config/redeploying_certificates.html#install-config-cert-expiry



On 31/03/2020 17:05, Brian Jarvis wrote:

Hello Tim,

Each node has a client certificate that expire after one year.
Run "oc get csr" you should see many pending requests, possibly thousands.

To clear those run "oc get csr -o name | xargs oc adm certificate approve"

One way to prevent this in the future is to deploy/enable the auto 
approver statefulset with the following command.
ansible-playbook -vvv -i [inventory_file] 
/usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml 
-e openshift_master_bootstrap_auto_approve=true


On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon > wrote:


Maybe an uncanny coincidence but with think the cluster was
created almost EXACTLY 1 year before it failed.

On 31/03/2020 16:17, Ben Holmes wrote:

Hi Tim,

Can you verify that the host's clocks are being synced correctly
as per Simon's other suggestion?

Ben

On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon mailto:tdudgeon...@gmail.com>> wrote:

Hi Simon,

we're run those playbooks and all certs are reported as still
being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:
> Hi Tim,
>
> Note that there are multiple sets of certificates, both
external and
> internal. So it would be worth checking the certificates
again using
> the Certificate Expiration Playbooks (see link below). The
> documentation also has an overview of what can be done to renew
> certain certificates:
>
> - [ Redeploying Certificates ]
>
https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>
> Apart from checking all certificates, I'd certainly review
the time
> synchronisation for the whole cluster, as we see the
message "x509:
> certificate has expired or is not yet valid".
>
> I hope this helps.
>
> Kind regards
> Simon
>
> On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
mailto:tdudgeon...@gmail.com>> wrote:
>> One of our OKD 3.11 clusters has suddenly stopped working
without any
>> obvious reason.
>>
>> The origin-node service on the nodes does not start (times
out).
>> The master-api pod is running on the master.
>> The nodes can access the master-api endpoints.
>>
>> The logs of the master-api pod look mostly OK other than a
huge number
>> of warnings about certificates that don't really make
sense as the
>> certificates are valid (we use named certificates from
let's Encryt and
>> they were renewed about 2 weeks ago and all appear to be
correct.
>>
>> Examples of errors from the master-api pod are:
>>
>> I0331 12:46:57.065147       1
establishing_controller.go:73] Starting
>> EstablishingController
>> I0331 12:46:57.065561       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.17:58024 : EOF
>> I0331 12:46:57.071932       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.19:48102 : EOF
>> I0331 12:46:57.072036       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.19:37178 : EOF
>> I0331 12:46:57.072141       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.17:58022 : EOF
>>
>> E0331 12:47:37.855023       1 memcache.go:147] couldn't
get resource
>> list for metrics.k8s.io/v1beta1
: the server is currently
unable to
>> handle the request
>> E0331 12:47:37.856569       1 memcache.go:147] couldn't
get resource
>> list for servicecatalog.k8s.io/v1beta1
: the server is
currently unable
>> to handle the request
>> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509:
certificate has expired
>> 

Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon

Brian,

That's fixed it. THANK YOU.

On 31/03/2020 17:05, Brian Jarvis wrote:

Hello Tim,

Each node has a client certificate that expire after one year.
Run "oc get csr" you should see many pending requests, possibly thousands.

To clear those run "oc get csr -o name | xargs oc adm certificate approve"

One way to prevent this in the future is to deploy/enable the auto 
approver statefulset with the following command.
ansible-playbook -vvv -i [inventory_file] 
/usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml 
-e openshift_master_bootstrap_auto_approve=true


On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon > wrote:


Maybe an uncanny coincidence but with think the cluster was
created almost EXACTLY 1 year before it failed.

On 31/03/2020 16:17, Ben Holmes wrote:

Hi Tim,

Can you verify that the host's clocks are being synced correctly
as per Simon's other suggestion?

Ben

On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon mailto:tdudgeon...@gmail.com>> wrote:

Hi Simon,

we're run those playbooks and all certs are reported as still
being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:
> Hi Tim,
>
> Note that there are multiple sets of certificates, both
external and
> internal. So it would be worth checking the certificates
again using
> the Certificate Expiration Playbooks (see link below). The
> documentation also has an overview of what can be done to renew
> certain certificates:
>
> - [ Redeploying Certificates ]
>
https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>
> Apart from checking all certificates, I'd certainly review
the time
> synchronisation for the whole cluster, as we see the
message "x509:
> certificate has expired or is not yet valid".
>
> I hope this helps.
>
> Kind regards
> Simon
>
> On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
mailto:tdudgeon...@gmail.com>> wrote:
>> One of our OKD 3.11 clusters has suddenly stopped working
without any
>> obvious reason.
>>
>> The origin-node service on the nodes does not start (times
out).
>> The master-api pod is running on the master.
>> The nodes can access the master-api endpoints.
>>
>> The logs of the master-api pod look mostly OK other than a
huge number
>> of warnings about certificates that don't really make
sense as the
>> certificates are valid (we use named certificates from
let's Encryt and
>> they were renewed about 2 weeks ago and all appear to be
correct.
>>
>> Examples of errors from the master-api pod are:
>>
>> I0331 12:46:57.065147       1
establishing_controller.go:73] Starting
>> EstablishingController
>> I0331 12:46:57.065561       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.17:58024 : EOF
>> I0331 12:46:57.071932       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.19:48102 : EOF
>> I0331 12:46:57.072036       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.19:37178 : EOF
>> I0331 12:46:57.072141       1 logs.go:49] http: TLS
handshake error from
>> 192.168.160.17:58022 : EOF
>>
>> E0331 12:47:37.855023       1 memcache.go:147] couldn't
get resource
>> list for metrics.k8s.io/v1beta1
: the server is currently
unable to
>> handle the request
>> E0331 12:47:37.856569       1 memcache.go:147] couldn't
get resource
>> list for servicecatalog.k8s.io/v1beta1
: the server is
currently unable
>> to handle the request
>> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509:
certificate has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.118976       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509:
certificate has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.122276       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509:
certificate has expired
>> or is not yet valid, x509: certificate
>>    has expired or 

Re: clsut stopped working - certificate problems

2020-03-31 Thread Brian Jarvis
Hello Tim,

Each node has a client certificate that expire after one year.
Run "oc get csr"  you should see many pending requests, possibly thousands.

To clear those run "oc get csr -o name | xargs oc adm certificate approve"

One way to prevent this in the future is to deploy/enable the auto approver
statefulset with the following command.
ansible-playbook -vvv -i [inventory_file]
/usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml
-e openshift_master_bootstrap_auto_approve=true

On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon  wrote:

> Maybe an uncanny coincidence but with think the cluster was created almost
> EXACTLY 1 year before it failed.
> On 31/03/2020 16:17, Ben Holmes wrote:
>
> Hi Tim,
>
> Can you verify that the host's clocks are being synced correctly as per
> Simon's other suggestion?
>
> Ben
>
> On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon  wrote:
>
>> Hi Simon,
>>
>> we're run those playbooks and all certs are reported as still being valid.
>>
>> Tim
>>
>> On 31/03/2020 15:59, Simon Krenger wrote:
>> > Hi Tim,
>> >
>> > Note that there are multiple sets of certificates, both external and
>> > internal. So it would be worth checking the certificates again using
>> > the Certificate Expiration Playbooks (see link below). The
>> > documentation also has an overview of what can be done to renew
>> > certain certificates:
>> >
>> > - [ Redeploying Certificates ]
>> >
>> https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>> >
>> > Apart from checking all certificates, I'd certainly review the time
>> > synchronisation for the whole cluster, as we see the message "x509:
>> > certificate has expired or is not yet valid".
>> >
>> > I hope this helps.
>> >
>> > Kind regards
>> > Simon
>> >
>> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon 
>> wrote:
>> >> One of our OKD 3.11 clusters has suddenly stopped working without any
>> >> obvious reason.
>> >>
>> >> The origin-node service on the nodes does not start (times out).
>> >> The master-api pod is running on the master.
>> >> The nodes can access the master-api endpoints.
>> >>
>> >> The logs of the master-api pod look mostly OK other than a huge number
>> >> of warnings about certificates that don't really make sense as the
>> >> certificates are valid (we use named certificates from let's Encryt and
>> >> they were renewed about 2 weeks ago and all appear to be correct.
>> >>
>> >> Examples of errors from the master-api pod are:
>> >>
>> >> I0331 12:46:57.065147   1 establishing_controller.go:73] Starting
>> >> EstablishingController
>> >> I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error
>> from
>> >> 192.168.160.17:58024: EOF
>> >> I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error
>> from
>> >> 192.168.160.19:48102: EOF
>> >> I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error
>> from
>> >> 192.168.160.19:37178: EOF
>> >> I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error
>> from
>> >> 192.168.160.17:58022: EOF
>> >>
>> >> E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource
>> >> list for metrics.k8s.io/v1beta1: the server is currently unable to
>> >> handle the request
>> >> E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource
>> >> list for servicecatalog.k8s.io/v1beta1: the server is currently unable
>> >> to handle the request
>> >> E0331 12:47:44.115290   1 authentication.go:62] Unable to
>> >> authenticate the request due to an error: [x509: certificate has
>> expired
>> >> or is not yet valid, x509: certificate
>> >>has expired or is not yet valid]
>> >> E0331 12:47:44.118976   1 authentication.go:62] Unable to
>> >> authenticate the request due to an error: [x509: certificate has
>> expired
>> >> or is not yet valid, x509: certificate
>> >>has expired or is not yet valid]
>> >> E0331 12:47:44.122276   1 authentication.go:62] Unable to
>> >> authenticate the request due to an error: [x509: certificate has
>> expired
>> >> or is not yet valid, x509: certificate
>> >>has expired or is not yet valid]
>> >>
>> >> Huge number of this second sort.
>> >>
>> >> Any ideas what is wrong?
>> >>
>> >>
>> >>
>> >> ___
>> >> users mailing list
>> >> users@lists.openshift.redhat.com
>> >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>> >
>> >
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>>
>
> --
>
> BENJAMIN HOLMES
>
> SENIOR Solution ARCHITECT
>
> Red Hat UKI Presales 
>
> bhol...@redhat.comM: 07876-885388
> 
> 
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>


Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon
Maybe an uncanny coincidence but with think the cluster was created 
almost EXACTLY 1 year before it failed.


On 31/03/2020 16:17, Ben Holmes wrote:

Hi Tim,

Can you verify that the host's clocks are being synced correctly as 
per Simon's other suggestion?


Ben

On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon > wrote:


Hi Simon,

we're run those playbooks and all certs are reported as still
being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:
> Hi Tim,
>
> Note that there are multiple sets of certificates, both external and
> internal. So it would be worth checking the certificates again using
> the Certificate Expiration Playbooks (see link below). The
> documentation also has an overview of what can be done to renew
> certain certificates:
>
> - [ Redeploying Certificates ]
>
https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>
> Apart from checking all certificates, I'd certainly review the time
> synchronisation for the whole cluster, as we see the message "x509:
> certificate has expired or is not yet valid".
>
> I hope this helps.
>
> Kind regards
> Simon
>
> On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
mailto:tdudgeon...@gmail.com>> wrote:
>> One of our OKD 3.11 clusters has suddenly stopped working
without any
>> obvious reason.
>>
>> The origin-node service on the nodes does not start (times out).
>> The master-api pod is running on the master.
>> The nodes can access the master-api endpoints.
>>
>> The logs of the master-api pod look mostly OK other than a huge
number
>> of warnings about certificates that don't really make sense as the
>> certificates are valid (we use named certificates from let's
Encryt and
>> they were renewed about 2 weeks ago and all appear to be correct.
>>
>> Examples of errors from the master-api pod are:
>>
>> I0331 12:46:57.065147       1 establishing_controller.go:73]
Starting
>> EstablishingController
>> I0331 12:46:57.065561       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.17:58024 : EOF
>> I0331 12:46:57.071932       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.19:48102 : EOF
>> I0331 12:46:57.072036       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.19:37178 : EOF
>> I0331 12:46:57.072141       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.17:58022 : EOF
>>
>> E0331 12:47:37.855023       1 memcache.go:147] couldn't get
resource
>> list for metrics.k8s.io/v1beta1
: the server is currently unable to
>> handle the request
>> E0331 12:47:37.856569       1 memcache.go:147] couldn't get
resource
>> list for servicecatalog.k8s.io/v1beta1
: the server is currently unable
>> to handle the request
>> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.118976       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.122276       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>>
>> Huge number of this second sort.
>>
>> Any ideas what is wrong?
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com

>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>

___
users mailing list
users@lists.openshift.redhat.com

http://lists.openshift.redhat.com/openshiftmm/listinfo/users



--

BENJAMIN HOLMES

SENIOR Solution ARCHITECT

Red Hat UKI Presales 

bhol...@redhat.com  M: 07876-885388 





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon

And yes, the clocks of all the nodes are correct and in sync.

On 31/03/2020 16:17, Ben Holmes wrote:

Hi Tim,

Can you verify that the host's clocks are being synced correctly as 
per Simon's other suggestion?


Ben

On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon > wrote:


Hi Simon,

we're run those playbooks and all certs are reported as still
being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:
> Hi Tim,
>
> Note that there are multiple sets of certificates, both external and
> internal. So it would be worth checking the certificates again using
> the Certificate Expiration Playbooks (see link below). The
> documentation also has an overview of what can be done to renew
> certain certificates:
>
> - [ Redeploying Certificates ]
>
https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>
> Apart from checking all certificates, I'd certainly review the time
> synchronisation for the whole cluster, as we see the message "x509:
> certificate has expired or is not yet valid".
>
> I hope this helps.
>
> Kind regards
> Simon
>
> On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
mailto:tdudgeon...@gmail.com>> wrote:
>> One of our OKD 3.11 clusters has suddenly stopped working
without any
>> obvious reason.
>>
>> The origin-node service on the nodes does not start (times out).
>> The master-api pod is running on the master.
>> The nodes can access the master-api endpoints.
>>
>> The logs of the master-api pod look mostly OK other than a huge
number
>> of warnings about certificates that don't really make sense as the
>> certificates are valid (we use named certificates from let's
Encryt and
>> they were renewed about 2 weeks ago and all appear to be correct.
>>
>> Examples of errors from the master-api pod are:
>>
>> I0331 12:46:57.065147       1 establishing_controller.go:73]
Starting
>> EstablishingController
>> I0331 12:46:57.065561       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.17:58024 : EOF
>> I0331 12:46:57.071932       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.19:48102 : EOF
>> I0331 12:46:57.072036       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.19:37178 : EOF
>> I0331 12:46:57.072141       1 logs.go:49] http: TLS handshake
error from
>> 192.168.160.17:58022 : EOF
>>
>> E0331 12:47:37.855023       1 memcache.go:147] couldn't get
resource
>> list for metrics.k8s.io/v1beta1
: the server is currently unable to
>> handle the request
>> E0331 12:47:37.856569       1 memcache.go:147] couldn't get
resource
>> list for servicecatalog.k8s.io/v1beta1
: the server is currently unable
>> to handle the request
>> E0331 12:47:44.115290       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.118976       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>> E0331 12:47:44.122276       1 authentication.go:62] Unable to
>> authenticate the request due to an error: [x509: certificate
has expired
>> or is not yet valid, x509: certificate
>>    has expired or is not yet valid]
>>
>> Huge number of this second sort.
>>
>> Any ideas what is wrong?
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com

>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>

___
users mailing list
users@lists.openshift.redhat.com

http://lists.openshift.redhat.com/openshiftmm/listinfo/users



--

BENJAMIN HOLMES

SENIOR Solution ARCHITECT

Red Hat UKI Presales 

bhol...@redhat.com  M: 07876-885388 





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon

One more thing - this happened at or very close to Noon UTC, 31st March.
Sounds significant?

On 31/03/2020 15:59, Simon Krenger wrote:

Hi Tim,

Note that there are multiple sets of certificates, both external and
internal. So it would be worth checking the certificates again using
the Certificate Expiration Playbooks (see link below). The
documentation also has an overview of what can be done to renew
certain certificates:

- [ Redeploying Certificates ]
   https://docs.okd.io/3.11/install_config/redeploying_certificates.html

Apart from checking all certificates, I'd certainly review the time
synchronisation for the whole cluster, as we see the message "x509:
certificate has expired or is not yet valid".

I hope this helps.

Kind regards
Simon

On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon  wrote:

One of our OKD 3.11 clusters has suddenly stopped working without any
obvious reason.

The origin-node service on the nodes does not start (times out).
The master-api pod is running on the master.
The nodes can access the master-api endpoints.

The logs of the master-api pod look mostly OK other than a huge number
of warnings about certificates that don't really make sense as the
certificates are valid (we use named certificates from let's Encryt and
they were renewed about 2 weeks ago and all appear to be correct.

Examples of errors from the master-api pod are:

I0331 12:46:57.065147   1 establishing_controller.go:73] Starting
EstablishingController
I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error from
192.168.160.17:58024: EOF
I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error from
192.168.160.19:48102: EOF
I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error from
192.168.160.19:37178: EOF
I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error from
192.168.160.17:58022: EOF

E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource
list for metrics.k8s.io/v1beta1: the server is currently unable to
handle the request
E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource
list for servicecatalog.k8s.io/v1beta1: the server is currently unable
to handle the request
E0331 12:47:44.115290   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]
E0331 12:47:44.118976   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]
E0331 12:47:44.122276   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]

Huge number of this second sort.

Any ideas what is wrong?



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: clsut stopped working - certificate problems

2020-03-31 Thread Ben Holmes
Hi Tim,

Can you verify that the host's clocks are being synced correctly as per
Simon's other suggestion?

Ben

On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon  wrote:

> Hi Simon,
>
> we're run those playbooks and all certs are reported as still being valid.
>
> Tim
>
> On 31/03/2020 15:59, Simon Krenger wrote:
> > Hi Tim,
> >
> > Note that there are multiple sets of certificates, both external and
> > internal. So it would be worth checking the certificates again using
> > the Certificate Expiration Playbooks (see link below). The
> > documentation also has an overview of what can be done to renew
> > certain certificates:
> >
> > - [ Redeploying Certificates ]
> >https://docs.okd.io/3.11/install_config/redeploying_certificates.html
> >
> > Apart from checking all certificates, I'd certainly review the time
> > synchronisation for the whole cluster, as we see the message "x509:
> > certificate has expired or is not yet valid".
> >
> > I hope this helps.
> >
> > Kind regards
> > Simon
> >
> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon 
> wrote:
> >> One of our OKD 3.11 clusters has suddenly stopped working without any
> >> obvious reason.
> >>
> >> The origin-node service on the nodes does not start (times out).
> >> The master-api pod is running on the master.
> >> The nodes can access the master-api endpoints.
> >>
> >> The logs of the master-api pod look mostly OK other than a huge number
> >> of warnings about certificates that don't really make sense as the
> >> certificates are valid (we use named certificates from let's Encryt and
> >> they were renewed about 2 weeks ago and all appear to be correct.
> >>
> >> Examples of errors from the master-api pod are:
> >>
> >> I0331 12:46:57.065147   1 establishing_controller.go:73] Starting
> >> EstablishingController
> >> I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error from
> >> 192.168.160.17:58024: EOF
> >> I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error from
> >> 192.168.160.19:48102: EOF
> >> I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error from
> >> 192.168.160.19:37178: EOF
> >> I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error from
> >> 192.168.160.17:58022: EOF
> >>
> >> E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource
> >> list for metrics.k8s.io/v1beta1: the server is currently unable to
> >> handle the request
> >> E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource
> >> list for servicecatalog.k8s.io/v1beta1: the server is currently unable
> >> to handle the request
> >> E0331 12:47:44.115290   1 authentication.go:62] Unable to
> >> authenticate the request due to an error: [x509: certificate has expired
> >> or is not yet valid, x509: certificate
> >>has expired or is not yet valid]
> >> E0331 12:47:44.118976   1 authentication.go:62] Unable to
> >> authenticate the request due to an error: [x509: certificate has expired
> >> or is not yet valid, x509: certificate
> >>has expired or is not yet valid]
> >> E0331 12:47:44.122276   1 authentication.go:62] Unable to
> >> authenticate the request due to an error: [x509: certificate has expired
> >> or is not yet valid, x509: certificate
> >>has expired or is not yet valid]
> >>
> >> Huge number of this second sort.
> >>
> >> Any ideas what is wrong?
> >>
> >>
> >>
> >> ___
> >> users mailing list
> >> users@lists.openshift.redhat.com
> >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
> >
> >
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
>

-- 

BENJAMIN HOLMES

SENIOR Solution ARCHITECT

Red Hat UKI Presales 

bhol...@redhat.comM: 07876-885388


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon

Hi Simon,

we're run those playbooks and all certs are reported as still being valid.

Tim

On 31/03/2020 15:59, Simon Krenger wrote:

Hi Tim,

Note that there are multiple sets of certificates, both external and
internal. So it would be worth checking the certificates again using
the Certificate Expiration Playbooks (see link below). The
documentation also has an overview of what can be done to renew
certain certificates:

- [ Redeploying Certificates ]
   https://docs.okd.io/3.11/install_config/redeploying_certificates.html

Apart from checking all certificates, I'd certainly review the time
synchronisation for the whole cluster, as we see the message "x509:
certificate has expired or is not yet valid".

I hope this helps.

Kind regards
Simon

On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon  wrote:

One of our OKD 3.11 clusters has suddenly stopped working without any
obvious reason.

The origin-node service on the nodes does not start (times out).
The master-api pod is running on the master.
The nodes can access the master-api endpoints.

The logs of the master-api pod look mostly OK other than a huge number
of warnings about certificates that don't really make sense as the
certificates are valid (we use named certificates from let's Encryt and
they were renewed about 2 weeks ago and all appear to be correct.

Examples of errors from the master-api pod are:

I0331 12:46:57.065147   1 establishing_controller.go:73] Starting
EstablishingController
I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error from
192.168.160.17:58024: EOF
I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error from
192.168.160.19:48102: EOF
I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error from
192.168.160.19:37178: EOF
I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error from
192.168.160.17:58022: EOF

E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource
list for metrics.k8s.io/v1beta1: the server is currently unable to
handle the request
E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource
list for servicecatalog.k8s.io/v1beta1: the server is currently unable
to handle the request
E0331 12:47:44.115290   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]
E0331 12:47:44.118976   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]
E0331 12:47:44.122276   1 authentication.go:62] Unable to
authenticate the request due to an error: [x509: certificate has expired
or is not yet valid, x509: certificate
   has expired or is not yet valid]

Huge number of this second sort.

Any ideas what is wrong?



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users





___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



Re: clsut stopped working - certificate problems

2020-03-31 Thread Simon Krenger
Hi Tim,

Note that there are multiple sets of certificates, both external and
internal. So it would be worth checking the certificates again using
the Certificate Expiration Playbooks (see link below). The
documentation also has an overview of what can be done to renew
certain certificates:

- [ Redeploying Certificates ]
  https://docs.okd.io/3.11/install_config/redeploying_certificates.html

Apart from checking all certificates, I'd certainly review the time
synchronisation for the whole cluster, as we see the message "x509:
certificate has expired or is not yet valid".

I hope this helps.

Kind regards
Simon

On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon  wrote:
>
> One of our OKD 3.11 clusters has suddenly stopped working without any
> obvious reason.
>
> The origin-node service on the nodes does not start (times out).
> The master-api pod is running on the master.
> The nodes can access the master-api endpoints.
>
> The logs of the master-api pod look mostly OK other than a huge number
> of warnings about certificates that don't really make sense as the
> certificates are valid (we use named certificates from let's Encryt and
> they were renewed about 2 weeks ago and all appear to be correct.
>
> Examples of errors from the master-api pod are:
>
> I0331 12:46:57.065147   1 establishing_controller.go:73] Starting
> EstablishingController
> I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error from
> 192.168.160.17:58024: EOF
> I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error from
> 192.168.160.19:48102: EOF
> I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error from
> 192.168.160.19:37178: EOF
> I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error from
> 192.168.160.17:58022: EOF
>
> E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource
> list for metrics.k8s.io/v1beta1: the server is currently unable to
> handle the request
> E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource
> list for servicecatalog.k8s.io/v1beta1: the server is currently unable
> to handle the request
> E0331 12:47:44.115290   1 authentication.go:62] Unable to
> authenticate the request due to an error: [x509: certificate has expired
> or is not yet valid, x509: certificate
>   has expired or is not yet valid]
> E0331 12:47:44.118976   1 authentication.go:62] Unable to
> authenticate the request due to an error: [x509: certificate has expired
> or is not yet valid, x509: certificate
>   has expired or is not yet valid]
> E0331 12:47:44.122276   1 authentication.go:62] Unable to
> authenticate the request due to an error: [x509: certificate has expired
> or is not yet valid, x509: certificate
>   has expired or is not yet valid]
>
> Huge number of this second sort.
>
> Any ideas what is wrong?
>
>
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users



-- 
Simon Krenger
Technical Account Manager
Red Hat


___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users



clsut stopped working - certificate problems

2020-03-31 Thread Tim Dudgeon
One of our OKD 3.11 clusters has suddenly stopped working without any 
obvious reason.


The origin-node service on the nodes does not start (times out).
The master-api pod is running on the master.
The nodes can access the master-api endpoints.

The logs of the master-api pod look mostly OK other than a huge number 
of warnings about certificates that don't really make sense as the 
certificates are valid (we use named certificates from let's Encryt and 
they were renewed about 2 weeks ago and all appear to be correct.


Examples of errors from the master-api pod are:

I0331 12:46:57.065147   1 establishing_controller.go:73] Starting 
EstablishingController
I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error from 
192.168.160.17:58024: EOF
I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error from 
192.168.160.19:48102: EOF
I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error from 
192.168.160.19:37178: EOF
I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error from 
192.168.160.17:58022: EOF


E0331 12:47:37.855023   1 memcache.go:147] couldn't get resource 
list for metrics.k8s.io/v1beta1: the server is currently unable to 
handle the request
E0331 12:47:37.856569   1 memcache.go:147] couldn't get resource 
list for servicecatalog.k8s.io/v1beta1: the server is currently unable 
to handle the request
E0331 12:47:44.115290   1 authentication.go:62] Unable to 
authenticate the request due to an error: [x509: certificate has expired 
or is not yet valid, x509: certificate

 has expired or is not yet valid]
E0331 12:47:44.118976   1 authentication.go:62] Unable to 
authenticate the request due to an error: [x509: certificate has expired 
or is not yet valid, x509: certificate

 has expired or is not yet valid]
E0331 12:47:44.122276   1 authentication.go:62] Unable to 
authenticate the request due to an error: [x509: certificate has expired 
or is not yet valid, x509: certificate

 has expired or is not yet valid]

Huge number of this second sort.

Any ideas what is wrong?



___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users