Re: clsut stopped working - certificate problems

2020-03-31 Thread Brian Jarvis
The certificate expiration check playbook was recently updated to include
this check for the nodes.

[0] https://github.com/openshift/openshift-ansible/pull/11967

On Tue, Mar 31, 2020 at 1:12 PM Tim Dudgeon  wrote:

> So thanks for the help in fixing this problem. Much appreciated.
>
> Having looked at it now after the event I have 2 concerns.
>
> 1. Whilst this is documented [1] the significance of this is not
> mentioned. Unless you do as described (either manually or automatically)
> your cluster will stop working 1 year after being deployed!
>
> 2. The playbooks that check certificate expiry [2] do not catch this
> problem.
>
> Thanks
> Tim
>
> [1]
> https://docs.okd.io/3.11/install_config/redeploying_certificates.html#cert-expiry-managing-csrs
>
> [2]
> https://docs.okd.io/3.11/install_config/redeploying_certificates.html#install-config-cert-expiry
>
>
> On 31/03/2020 17:05, Brian Jarvis wrote:
>
> Hello Tim,
>
> Each node has a client certificate that expire after one year.
> Run "oc get csr"  you should see many pending requests, possibly
> thousands.
>
> To clear those run "oc get csr -o name | xargs oc adm certificate approve"
>
> One way to prevent this in the future is to deploy/enable the auto
> approver statefulset with the following command.
> ansible-playbook -vvv -i [inventory_file]
> /usr/share/ansible/openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml
> -e openshift_master_bootstrap_auto_approve=true
>
> On Tue, Mar 31, 2020 at 11:53 AM Tim Dudgeon 
> wrote:
>
>> Maybe an uncanny coincidence but with think the cluster was created
>> almost EXACTLY 1 year before it failed.
>> On 31/03/2020 16:17, Ben Holmes wrote:
>>
>> Hi Tim,
>>
>> Can you verify that the host's clocks are being synced correctly as per
>> Simon's other suggestion?
>>
>> Ben
>>
>> On Tue, 31 Mar 2020 at 16:05, Tim Dudgeon  wrote:
>>
>>> Hi Simon,
>>>
>>> we're run those playbooks and all certs are reported as still being
>>> valid.
>>>
>>> Tim
>>>
>>> On 31/03/2020 15:59, Simon Krenger wrote:
>>> > Hi Tim,
>>> >
>>> > Note that there are multiple sets of certificates, both external and
>>> > internal. So it would be worth checking the certificates again using
>>> > the Certificate Expiration Playbooks (see link below). The
>>> > documentation also has an overview of what can be done to renew
>>> > certain certificates:
>>> >
>>> > - [ Redeploying Certificates ]
>>> >
>>> https://docs.okd.io/3.11/install_config/redeploying_certificates.html
>>> >
>>> > Apart from checking all certificates, I'd certainly review the time
>>> > synchronisation for the whole cluster, as we see the message "x509:
>>> > certificate has expired or is not yet valid".
>>> >
>>> > I hope this helps.
>>> >
>>> > Kind regards
>>> > Simon
>>> >
>>> > On Tue, Mar 31, 2020 at 4:33 PM Tim Dudgeon 
>>> wrote:
>>> >> One of our OKD 3.11 clusters has suddenly stopped working without any
>>> >> obvious reason.
>>> >>
>>> >> The origin-node service on the nodes does not start (times out).
>>> >> The master-api pod is running on the master.
>>> >> The nodes can access the master-api endpoints.
>>> >>
>>> >> The logs of the master-api pod look mostly OK other than a huge number
>>> >> of warnings about certificates that don't really make sense as the
>>> >> certificates are valid (we use named certificates from let's Encryt
>>> and
>>> >> they were renewed about 2 weeks ago and all appear to be correct.
>>> >>
>>> >> Examples of errors from the master-api pod are:
>>> >>
>>> >> I0331 12:46:57.065147   1 establishing_controller.go:73] Starting
>>> >> EstablishingController
>>> >> I0331 12:46:57.065561   1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.17:58024: EOF
>>> >> I0331 12:46:57.071932   1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.19:48102: EOF
>>> >> I0331 12:46:57.072036   1 logs.go:49] http: TLS handshake error
>>> from
>>> >> 192.168.160.19:37178: EOF
>>> >> I0331 12:46:57.072141   1 logs.go:49] http: TLS handshake error
>>&g

Re: clsut stopped working - certificate problems

2020-03-31 Thread Brian Jarvis
>>
>> >> Huge number of this second sort.
>> >>
>> >> Any ideas what is wrong?
>> >>
>> >>
>> >>
>> >> ___
>> >> users mailing list
>> >> users@lists.openshift.redhat.com
>> >> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>> >
>> >
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>>
>
> --
>
> BENJAMIN HOLMES
>
> SENIOR Solution ARCHITECT
>
> Red Hat UKI Presales <https://www.redhat.com/>
>
> bhol...@redhat.comM: 07876-885388
> <http://redhatemailsignature-marketing.itos.redhat.com/>
> <https://red.ht/sig>
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>


-- 


Brian Jarvis, RHCE

Technical Account Manager

Red Hat North America <https://www.redhat.com/>
Partnering with you to help achieve your business goals

bjar...@redhat.com

T: 631-685-7519   M: 610-587-1736
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: router stats

2019-10-16 Thread Brian Jarvis
Information on accessing the router metrics can be found [0].

[0]
https://docs.okd.io/3.11/install_config/router/default_haproxy_router.html#exposing-the-router-metrics


On Tue, Oct 15, 2019 at 6:09 AM Tim Dudgeon  wrote:

> So how do I access these?
>
> And are the docs here [1] wrong?
>
> [1] https://docs.okd.io/3.11/admin_guide/router.html
> On 14/10/2019 19:26, Clayton Coleman wrote:
>
> Metrics are exposed via the controller process in the pod (pid1), not the
> HAProxy process.
>
> On Mon, Oct 14, 2019 at 1:27 PM Tim Dudgeon  wrote:
>
>> I'm trying to see the router stats as described here:
>> https://docs.okd.io/3.11/admin_guide/router.html
>>
>> I can see this from within the container using the command:
>>
>> echo 'show stat' | socat - UNIX-CONNECT:/var/lib/haproxy/run/haproxy.sock
>>
>> But they do not seem to be being exposed through the web listener as
>> described in that doc. In fact I can't see anything in the
>> haproxy.config file that suggests that haproxy is exposing stats on port
>> 1936 or any other port.
>>
>> The installation was a fairly standard openshift-ansible install so I'm
>> sure the defaults have not been changed.
>>
>> Are there any instructions for how to get this working?
>>
>> Thanks
>> Tim
>>
>> ___
>> users mailing list
>> users@lists.openshift.redhat.com
>> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: Changing network MTU

2019-08-27 Thread Brian Jarvis
Tim,

You need to set the MTU of the OpenShift SDN to be lower than the MTU of
the NIC.

This is described in
https://docs.openshift.com/container-platform/3.11/scaling_performance/network_optimization.html#scaling-performance-optimizing-mtu
.




On Tue, Aug 27, 2019 at 12:02 PM Tim Dudgeon  wrote:

> In one of our OKD3.11 environments the hosting provider wanting to
> change the network MTU from 9000 to 1500 and did that for all the
> physical network interfaces of all the nodes.
>
> This caused the Openshift networking to break completely. Resetting back
> to 9000 restored things.
>
> Is there a way to allow for this to be done on a running Openshift system?
>
> Thanks
> Tim
>
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users