----- On Aug 12, 2019, at 7:47 PM, Chris Walker cwal...@cray.com wrote: > When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, for > example, > > Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd: info: > pcmk_quorum_notification: > Quorum retained | membership=1320 members=1 > > after ~20s (dc-deadtime parameter), ha-idg-2 is marked 'unclean' and STONITHed > as part of startup fencing. > > There is nothing in ha-idg-2's HA logs around 17:43 indicating that it saw > ha-idg-1 either, so it appears that there was no communication at all between > the two nodes. > > I'm not sure exactly why the nodes did not see one another, but there are > indications of network issues around this time > > 2019-08-09T17:42:16.427947+02:00 ha-idg-2 kernel: [ 1229.245533] bond1: now > running without any active interface! > > so perhaps that's related.
This is the initialization of the bond1 on ha-idg-1 during boot. 3 seconds later bond1 is fine: 2019-08-09T17:42:19.299886+02:00 ha-idg-2 kernel: [ 1232.117470] tg3 0000:03:04.0 eth2: Link is up at 1000 Mbps, full duplex 2019-08-09T17:42:19.299908+02:00 ha-idg-2 kernel: [ 1232.117482] tg3 0000:03:04.0 eth2: Flow control is on for TX and on for RX 2019-08-09T17:42:19.315756+02:00 ha-idg-2 kernel: [ 1232.131565] tg3 0000:03:04.1 eth3: Link is up at 1000 Mbps, full duplex 2019-08-09T17:42:19.315767+02:00 ha-idg-2 kernel: [ 1232.131568] tg3 0000:03:04.1 eth3: Flow control is on for TX and on for RX 2019-08-09T17:42:19.351781+02:00 ha-idg-2 kernel: [ 1232.169386] bond1: link status definitely up for interface eth2, 1000 Mbps full duplex 2019-08-09T17:42:19.351792+02:00 ha-idg-2 kernel: [ 1232.169390] bond1: making interface eth2 the new active one 2019-08-09T17:42:19.352521+02:00 ha-idg-2 kernel: [ 1232.169473] bond1: first active interface up! 2019-08-09T17:42:19.352532+02:00 ha-idg-2 kernel: [ 1232.169480] bond1: link status definitely up for interface eth3, 1000 Mbps full duplex also on ha-idg-1: 2019-08-09T17:42:19.168035+02:00 ha-idg-1 kernel: [ 110.164250] tg3 0000:02:00.3 eth3: Link is up at 1000 Mbps, full duplex 2019-08-09T17:42:19.168050+02:00 ha-idg-1 kernel: [ 110.164252] tg3 0000:02:00.3 eth3: Flow control is on for TX and on for RX 2019-08-09T17:42:19.168052+02:00 ha-idg-1 kernel: [ 110.164254] tg3 0000:02:00.3 eth3: EEE is disabled 2019-08-09T17:42:19.172020+02:00 ha-idg-1 kernel: [ 110.171378] tg3 0000:02:00.2 eth2: Link is up at 1000 Mbps, full duplex 2019-08-09T17:42:19.172028+02:00 ha-idg-1 kernel: [ 110.171380] tg3 0000:02:00.2 eth2: Flow control is on for TX and on for RX 2019-08-09T17:42:19.172029+02:00 ha-idg-1 kernel: [ 110.171382] tg3 0000:02:00.2 eth2: EEE is disabled ... 2019-08-09T17:42:19.244066+02:00 ha-idg-1 kernel: [ 110.240310] bond1: link status definitely up for interface eth2, 1000 Mbps full duplex 2019-08-09T17:42:19.244083+02:00 ha-idg-1 kernel: [ 110.240311] bond1: making interface eth2 the new active one 2019-08-09T17:42:19.244085+02:00 ha-idg-1 kernel: [ 110.240353] bond1: first active interface up! 2019-08-09T17:42:19.244087+02:00 ha-idg-1 kernel: [ 110.240356] bond1: link status definitely up for interface eth3, 1000 Mbps full duplex And the cluster is started afterwards on ha-idg-1 at 17:43:04. I don't find further entries for problems with bond1. So i think it's not related. Time is synchronized by ntp. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/