On 4/5/19 8:18 PM, Rohit Saini wrote:
*Further update on this:*
This issue is resolved now. ILO was discarding "Neighbor Advertisement" (NA) as Solicited flag was set in NA message. Hence it was not updating its local neighbor table. As per RFC, Solicited flag should be set only in NA message when it is a response to Neighbor Solicitation. After disabling the Solicited flag in NA message, ILO started updating the local neighbor cache.

Hi Rohit,

Sounds great that after change you get a consistent behaviour. As I had not worked with IPv6 for quite some time I wonder how did you disable the 'Solicited flag'. Was this done on the OS (cluster node) or on the iLO? My guess is the OS but I have no idea how that can be accomplished.
Can you share which setting you have changed to accomplish this? :)

One additional note the observation here is that you are using the "floating IP" that relocated to other machine, while the configuration of cluster seems to be not containing any IPaddr2 resources that would be representing this address. I would guess that cluster without the floating address would not have issue as it would use the addresses assigned to the nodes and therefore the mapping between IP address and MAC address will be not changing even when the fence_ilo4 resource are moving between nodes. If there is intention to use the floating address in this cluster I would suggest checking if there is also no issue when "not using the floating address" or when it is disabled to see how the fence_ilo4 communicates. I think that there might be way in routing tables to set which IPv6 address should communicate with iLO IPv6 address so you get consistent behaviour instead of using the floating IP address.

Anyway I'm glad that mystery is resolved.

--
Ondrej


On Fri, Apr 5, 2019 at 2:23 PM Rohit Saini <rohitsaini111.fo...@gmail.com <mailto:rohitsaini111.fo...@gmail.com>> wrote:

    Hi Ondrej,
    Finally found some lead on this.. We started tcpdump on my machine
    to understand the IPMI traffic. Attaching the capture for your
    reference.
    fd00:1061:37:9021:: is my floating IP and fd00:1061:37:9002:: is my
    ILO IP.
    When resource movement happens, we are initiating the "Neighbor
    Advertisement" for fd00:1061:37:9021:: (which is on new machine now)
    so that peers can update their neighbor table and starts
    communication with new MAC address.
    Looks like ILO is not updating its neighbor table, as it is still
    sending responding to older MAC.
    After sometime, "Neighbor Solicitation" happens and ILO updates the
    neighbor table. Now this ILO becomes reachable and starts responding
    towards new MAC address.

    My ILO firmware is 2.60. We will try again the issue post upgrading
    my firmware.

    To verify this theory, after resource movement, I flushed the local
    neighbor table due to which "Neighbor Solicitation" was initiated
    early and this delay in getting ILO response was not seen.
    This fixed the issue.

    We are now more interested in understanding why ILO couldnot update
    its neighbor table on receiving "Neighbor Advertisement". FYI,
    Override flag in "Neighbor Advertisement" is already set.

    Thanks,
    Rohit

    On Thu, Apr 4, 2019 at 8:37 AM Ondrej <ondrej-clusterl...@famera.cz
    <mailto:ondrej-clusterl...@famera.cz>> wrote:

        On 4/3/19 6:10 PM, Rohit Saini wrote:
         > Hi Ondrej,
         > Please find my reply below:
         >
         > 1.
         > *Stonith configuration:*
         > [root@orana ~]# pcs config
         >   Resource: fence-uc-orana (class=stonith type=fence_ilo4)
         >    Attributes: delay=0 ipaddr=fd00:1061:37:9002:: lanplus=1
        login=xyz
         > passwd=xyz pcmk_host_list=orana pcmk_reboot_action=off
         >    Meta Attrs: failure-timeout=3s
         >    Operations: monitor interval=5s on-fail=ignore
         > (fence-uc-orana-monitor-interval-5s)
         >                start interval=0s on-fail=restart
         > (fence-uc-orana-start-interval-0s)
         >   Resource: fence-uc-tigana (class=stonith type=fence_ilo4)
         >    Attributes: delay=10 ipaddr=fd00:1061:37:9001:: lanplus=1
        login=xyz
         > passwd=xyz pcmk_host_list=tigana pcmk_reboot_action=off
         >    Meta Attrs: failure-timeout=3s
         >    Operations: monitor interval=5s on-fail=ignore
         > (fence-uc-tigana-monitor-interval-5s)
         >                start interval=0s on-fail=restart
         > (fence-uc-tigana-start-interval-0s)
         >
         > Fencing Levels:
         >
         > Location Constraints:
         > Ordering Constraints:
         >    start fence-uc-orana then promote unicloud-master
        (kind:Mandatory)
         >    start fence-uc-tigana then promote unicloud-master
        (kind:Mandatory)
         > Colocation Constraints:
         >    fence-uc-orana with unicloud-master (score:INFINITY)
         > (rsc-role:Started) (with-rsc-role:Master)
         >    fence-uc-tigana with unicloud-master (score:INFINITY)
         > (rsc-role:Started) (with-rsc-role:Master)
         >
         >
         > 2. This is seen randomly. Since I am using colocation,
        stonith resources
         > are stopped and started on new master. That time, starting of
        stonith is
         > taking variable amount of time.
         > No other IPv6 issues are seen in the cluster nodes.
         >
         > 3. fence_agent version
         >
         > [root@orana ~]#  rpm -qa|grep  fence-agents-ipmilan
         > fence-agents-ipmilan-4.0.11-66.el7.x86_64
         >
         >
         > *NOTE:*
         > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client
         > Applications use IPv6 first" turned on.
         > Attaching corosync logs also.
         >
         > Thanks, increasing timeout to 60 worked. But thats not what
        exactly I am
         > looking for. I need to know exact reason behind delay of
        starting these
         > IPv6 stonith resources.
         >
         > Regards,
         > Rohit

        Hi Rohit,

        Thank you for response.

          From configuration it is clear that we are using directly IP
        addresses
        so the DNS resolution issue can be rules out. There are no
        messages from
        fence_ilo4 that would indicate reason why it timed out. So we
        cannot
        tell yet what caused the issue. I see that you have enabled
        PCMK_debug=stonith-ng most probably (or PCMK_debug=yes),

        It is nice that increased the timeout worked, but as said in
        previous
        email it may just mask the real reason why it takes longer to do
        monitor/start operation.

          > Both IPv4 and IPv6 are configured on my ILO, with "iLO Client
          > Applications use IPv6 first" turned on.
        This seems to me to be more related to SNMP communication which
        we don't
        use with fence_ilo4 as far as I know. We use the ipmitool on
        port 623/udp.
        
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00026111en_us&docLocale=en_US#N104B2

          > 2. This is seen randomly. Since I am using colocation,
        stonith resources
          > are stopped and started on new master. That time, starting
        of stonith is
          > taking variable amount of time.
        This is a good observation. Which leads me to question if the
        iLO has
        set any kind of session limits for the user that is used here.
        If there
        is any session limit it may be worth trying to increase it and
        test if
        the same delay can be observed. One situation when this can
        happen is
        that when one node communicates with iLO and during that time the
        communication from other node needs to happen while the limit is 1
        connection. The relocation of resource from one note to another
        might
        fit this, but this is just speculation and fastest way to
        prove/reject
        it would be to increase limit, if there is one, and test it.

        # What more can be done to figure out on what is causing delay?

        1. The fence_ilo4 can be configured with attribute 'verbose=1'
        to print
        additional information when it is run. These data looks similar
        to ones
        below and they seems to provide the timestamps which is great as we
        should be able to see when what command was run. I don't have a
        testing
        machine on which to run fence_ilo4 so the below example just
        shows how
        it looks when it fails on timeout connecting.

        Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: notice:
        stonith_action_async_done: Child process 4252 performing action
        'monitor' timed out with signal 15
        Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:
        log_action: fence_ilo4[4252] stderr: [ 2019-04-03 12:33:51,193 INFO:
        Executing: /usr/bin/ipmitool -I lanplus -H
        fe80::f6bd:8a67:7eb5:214f -p
        623 -U xyz -P [set] -L ADMINISTRATOR chassis power status ]
        Apr 03 12:34:11 [4025] fastvm-centos-7-6-31 stonith-ng: warning:
        log_action: fence_ilo4[4252] stderr: [ ]

        # pcs stonith update fence-uc-orana verbose=1

        Note: That above shows that some private data are included in
        logs, so
        in case that you have there something interesting for sharing
        make sure
        to strip out the sensitive data.

        2. The version of fence-agents-ipmilan is not the latest when
        comparing
        that to my CentOS 7.6 system
        (fence-agents-ipmilan-4.2.1-11.el7_6.7.x86_64) so you may
        consider to
        try upgrading the package and see if the latest provided in your
        distribution helps by any way if that is possible.

        3. You may check if there is any update for the iLO devices and
        see if
        the updated version exhibits the same behavior with timeouts.
         From logs
        I cannot tell what version or device the fence_ilo4 is
        communicating with.

        4. If there is more reliable way for triggering way triggering the
        situation when the timeout with default 20s is observed you can
        setup
        network packet capture with tcpdump to see what kind of
        communication is
        happening during that time. This can help to establish the idea
        if there
        is any response from the iLO device while we wait which would
        indicate
        the iLO or network to be issue or if the data arrives fast and the
        fence_ilo4 doesn't do anything.
        - In first case that would point more to network or iLO
        communication issue
        - In second case that would be more likely issue with fence_ilo4 or
        ipmitool that is used for communication

        NOTE: In case that you happen to have a subscription for your
        systems
        you can try also reaching technical support to look deeper on
        collected
        data. That way you can save time figuring out how to strip the
        private
        parts from data before sharing them here.

        ========================================================================

        --
        Ondrej


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to