Re: [Veritas-ha] LLT errors - delayed and lost hb ticks

Philippe Belliard Mon, 18 Sep 2006 01:54:25 -0700

Title: LLT errors - delayed and lost hb ticks

Hi Winston,

You can fix your problem by using “SAP”. (Service Access Point), from the LLT configuration. (/etc/llttab)

I attach for you explain on it and when you need to configure.

Regards,

Philippe

############################################################################################################

Details:

A service access point (SAP), also known as "ethertype," is a network layer 2 method of differentiating between various communications protocols. All layer 2 packets have an assigned SAP: there is a SAP for NetBIOS, another for SNA, another for NetWare, and so on. The default value for an LLT packet is 0xCAFE. Under certain circumstances, you may want to specify a unique SAP to LLT links to eliminate any confusion in identifying packets at the receiving node.

The LLT heartbeat communications are broadcast to all nodes, while other messages, such as VERITAS Cluster Server (tm) state information, is relayed point to point. Occasionally, if LLT links are sharing a hub (or a VLAN) you may see messages referring to lost heartbeats on the console or in the /var/adm/messages file (see below). It is usually due to the fact that both links can "see" each other's traffic, and on Sun hardware the EEPROM variable local-mac-address? is "false" (Sun by default uses a variation of the hostID as the MAC address for *all network interfaces* in the box; you can change this variable to "true" and reboot to obtain unique MAC addresses for each interface).

Here is an example of the lost heartbeat messages found in the messages file:

Dec 11 09:39:35 testserver llt: LLT:10023: lost 6 hb seq 5879 from 0 link 0
Dec 11 09:39:35 testserver llt: LLT:10023: lost 6 hb seq 5879 from 0 link 1
Dec 11 09:39:40 testserver llt: LLT:10019: delayed hb 1150 ticks from 1 link 1
Dec 11 09:39:40 testserver llt: LLT:10023: lost 22 hb seq 7138 from 1 link 1

The manpage for llttab states it this way:

"Also, if there is more than one lowpri link configured on a system and sharing the same network, then each must use a unique SAP or else LLT will generate messages referring to lost heartbeats on the console, since links are normally required to be completely isolated."

Note that this statement is made only for configurations with more than one lowpri link across a single public network, where separating the links is impossible (it's the same public network). This is only mentioned in the manpage and the issue is not specifically addressed in other documentation because missed heartbeats are most often corrected by examining the physical cabling of private networks and resolving cable mismatches; specifying the wrong network port in the llttab as a normal "high priority" or "private" link, which in fact is connected to the public network, instead of the properly cabled private network port.

Multiple heartbeat links in a given network broadcast domain (same hub or VLAN of switch) are supported only when you have at least one additional private heartbeat in a separate broadcast domain (hub or VLAN) on separate electrical power to eliminate the single point of failure. An example would be public networks use VLAN1 and private networks each use their own private VLAN: where VLAN2 is hosted on a physically separate switch and the power circuit comes from that of VLAN3. See the sample llttab below.

If you experience the "lost heartbeats" messages described above when using multiple lowpri links, you may assign unique SAP tags to each link in llttab:

set-node 0
# omitting the set-cluster directive causes LLT to assume a default ID of 0
set-cluster 0
link-lowpri mylink0 /dev/hme:0 - ether 0xcaf0 -
link-lowpri mylink1 /dev/hme:1 - ether 0xcaf1 -
link qfe:2 /dev/qfe:2 - ether - -
link mylink3 /dev/qfe:3 - ether - -
# exclude the following node IDs
exclude 2-31
start

# lltstat -l
LLT link information:
Link Tag State Type Pri SAP MTU Addrlen
Xmit Recv Err LateHB
Broadcast
0 mylink0 on ether lowpri 0xCAF0 1500 6 ## note "lowpri" in output
250705 241197 0 0
FF:FF:FF:FF:FF:FF

1 mylink1 on ether lowpri 0xCAF1 1500 6
77456 222957 0 0
FF:FF:FF:FF:FF:FF

2 qfe:2 on ether hipri 0xCAFE 1500 6 ## note "hipri" in output
22846 241197 0 0
FF:FF:FF:FF:FF:FF

3 mylink3 on ether hipri 0xCAFE 1500 6
34564 241197 0 0
FF:FF:FF:FF:FF:FF
#

With this hypothetical configuration, cable the interfaces to your switches as follows:
/dev/hme:0 SWITCH-A VLAN1 (public)
/dev/hme:1 SWITCH-B VLAN1 (public)
/dev/qfe:2 SWITCH-A VLAN2 # each qfe:2 interface of each node all use SWITCH-A
/dev/qfe:3 SWITCH-B VLAN3 # each qfe:3 interface of each node all use SWITCH-B

It is recommended that the VLANs used for private communications not be trunked across switches.

############################################################################################################

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Senicka
Sent: vendredi 15 septembre 2006 17:43
To: Kawaley Winston; veritas-ha@mailman.eng.auburn.edu
Subject: Re: [Veritas-ha] LLT errors - delayed and lost hb ticks

you have two LLT streams sharing common infrastructure/switch/VLAN.

Each LLT link must be completely independent and neither stream should see packets from the other.

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kawaley Winston
Sent: Friday, September 15, 2006 11:18 AM
To: veritas-ha@mailman.eng.auburn.edu
Subject: [Veritas-ha] LLT errors - delayed and lost hb ticks

Hi all,

We are running VCS 4.1 on a two Solaris 9 systems and have configured a local cluster for our Configuration Management software called Clearcase. Recently I have been receiving a lot of the following LLT latency errors:

Sep 14 17:24:18 ncfbvcs01 llt: [ID 794702 kern.notice] LLT INFO V-14-1-10019 delayed hb 18561 ticks from 1 link 0 (bge1)
Sep 14 17:24:18 ncfbvcs01 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 373 hb seq 30608288 from 1 link 0 (bge1)
Sep 14 17:24:18 ncfbvcs01 llt: [ID 794702 kern.notice] LLT INFO V-14-1-10019 delayed hb 18561 ticks from 1 link 1 (bge2)
Sep 14 17:24:18 ncfbvcs01 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 373 hb seq 30608288 from 1 link 1 (bge2)
Sep 14 17:24:18 ncfbvcs01 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost -4 hb seq 30608285 from 1 link 1 (bge2)
Sep 14 17:24:18 ncfbvcs01 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost -4 hb seq 30608285 from 1 link 0 (bge1)
Sep 14 17:24:48 ncfbvcs01 llt: [ID 794702 kern.notice] LLT INFO V-14-1-10019 delayed hb 2955 ticks from 1 link 1 (bge2)
Sep 14 17:24:48 ncfbvcs01 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 62 hb seq 30608348 from 1 link 1 (bge2)
Sep 14 17:24:48 ncfbvcs01 llt: [ID 794702 kern.notice] LLT INFO V-14-1-10019 delayed hb 2955 ticks from 1 link 0 (bge1)
Sep 14 17:24:48 ncfbvcs01 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost 62 hb seq 30608348 from 1 link 0 (bge1)
Sep 14 17:24:48 ncfbvcs01 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost -4 hb seq 30608345 from 1 link 0 (bge1)
Sep 14 17:24:48 ncfbvcs01 llt: [ID 602713 kern.notice] LLT INFO V-14-1-10023 lost -4 hb seq 30608345 from 1 link 1 (bge2)

Does anyone know what exactly is causing these delayed and lost ticks and how they can be corrected?

Thanks,

Winston Kawaley

_______________________________________________
Veritas-ha maillist  -  Veritas-ha@mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha

Re: [Veritas-ha] LLT errors - delayed and lost hb ticks

Reply via email to