The certificate expiration check playbook was recently updated to include
this check for the nodes.
[0] https://github.com/openshift/openshift-ansible/pull/11967
On Tue, Mar 31, 2020 at 1:12 PM Tim Dudgeon wrote:
> So thanks for the help in fixing this problem. Much appreciated.
>
> Having looked at it now after the event I have 2 concerns.
>
> 1. Whilst this is documented [1] the significance of this is not
> mentioned. Unless you do as described (either manually or automatically)
> your cluster will stop working 1 year after being deployed!
>
> 2. The playbooks that check certificate expiry [2] do not catch this
> problem.
>
> Thanks
> Tim
>
> [1]
> https://docs.okd.io/3.11/install_config/redeploying_certificates.html#cert-expiry-managing-csrs
>
> [2]
> https://docs.okd.io/3.11/install_config/redeploying_certificates.html#install-config-cert-expiry
>
>
> On 31/03/2020 17:05, Brian Jarvis wrote:
>
> Hello Tim,
>
> Each node has a client certificate that expire after one year.
> Run "oc get csr" you should see many pending requests, possibly
> thousands.
>
> To clear those run "oc get csr -o name | xargs oc adm certificate approve"
>
> One way to prevent this in the future is to deploy/enable the auto
> approver statefulset with the following command.
> ansible-playbook -vvv -i [inventory_file]
> /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml
> -e openshift_master_bootstrap_auto_approve=true
>
> On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon
> wrote:
>
>> Maybe an uncanny coincidence but with think the cluster was created
>> almost EXACTLY 1 year before it failed.
>> On 31/03/2020 16:17, Ben Holmes wrote:
>>
>> Hi Tim,
>>
>> Can you verify that the host's clocks are being synced correctly as per
>> Simon's other suggestion?
>>
>> Ben
>>
>> On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon wrote:
>>
>>> Hi Simon,
>>>
>>> we're run those playbooks and all certs are reported as still being
>>> valid.
>>>
>>> Tim
>>>
>>> On 31/03/2020 15:59, Simon Krenger wrote:
>>> > Hi Tim,
>>> >
>>> > Note that there are multiple sets of certificates, both external and
>>> > internal. So it would be worth checking the certificates again using
>>> > the Certificate Expiration Playbooks (see link below). The
>>> > documentation also has an overview of what can be done to renew
>>> > certain certificates:
>>> >
>>> > - [ Redeploying Certificates ]
>>> >
>>> https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>>> >
>>> > Apart from checking all certificates, I'd certainly review the time
>>> > synchronisation for the whole cluster, as we see the message "x509:
>>> > certificate has expired or is not yet valid".
>>> >
>>> > I hope this helps.
>>> >
>>> > Kind regards
>>> > Simon
>>> >
>>> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon
>>> wrote:
>>> >> One of our OKD 3.11 clusters has suddenly stopped working without any
>>> >> obvious reason.
>>> >>
>>> >> The origin-node service on the nodes does not start (times out).
>>> >> The master-api pod is running on the master.
>>> >> The nodes can access the master-api endpoints.
>>> >>
>>> >> The logs of the master-api pod look mostly OK other than a huge number
>>> >> of warnings about certificates that don't really make sense as the
>>> >> certificates are valid (we use named certificates from let's Encryt
>>> and
>>> >> they were renewed about 2 weeks ago and all appear to be correct.
>>> >>
>>> >> Examples of errors from the master-api pod are:
>>> >>
>>> >> I0331 12:46:57.065147 1 establishing_controller.go:73] Starting
>>> >> EstablishingController
>>> >> I0331 12:46:57.065561 1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.17:58024: EOF
>>> >> I0331 12:46:57.071932 1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.19:48102: EOF
>>> >> I0331 12:46:57.072036 1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.19:37178: EOF
>>> >> I0331 12:46:57.072141 1 logs.go:49] http: TLS handshake error
>>&g