On 8/4/23 19:58, Mike Pattrick wrote:
> Diagnosing connectivity issues involving a bond can be complicated by a
> lack of logging in LACP. It is difficult to determine the health of
> sending and receving LACP packets. This is further complicated by the
> tendency of some switches to toggle the carrier on interfaces that
> experience issues with LACP, which can cause confusion about why an
> interface is flapping down and up again.
> 
> With this patch, OVS will log when LACP packets aren't sent or recieved
> on time.
> 
> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2223306
> Signed-off-by: Mike Pattrick <[email protected]>

Hi, Mike.  Thanks for the improvements!

Not a full review, but I have a few comments inline.

> ---
> v2: Added ratelimits to logs
> ---
>  lib/lacp.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/lib/lacp.c b/lib/lacp.c
> index 3252f17eb..22001a39d 100644
> --- a/lib/lacp.c
> +++ b/lib/lacp.c
> @@ -143,6 +143,8 @@ struct member {
>      uint32_t count_link_expired;    /* Num of times link expired */
>      uint32_t count_link_defaulted;  /* Num of times link defaulted */
>      uint32_t count_carrier_changed; /* Num of times link status changed */
> +    long long int last_tx;
> +    long long int last_rx;

Some simple comments would be nice here.

>  };
> 
>  static struct ovs_mutex mutex;
> @@ -387,6 +389,13 @@ lacp_process_packet(struct lacp *lacp, const void 
> *member_,
>          goto out;
>      }
> 
> +    if (member->last_rx && member->status != LACP_CURRENT) {
> +        long long int delay = time_msec() - member->last_rx;
> +        VLOG_DBG_RL(&rl, "%s: %s recieved PDU after expiry, delayed by 
> %lldms "
> +                    "seconds.", lacp->name, member->name, delay);

"delayed by X ms seconds" ?
The "seconds" part should not be there.

> +    }
> +
> +    member->last_rx = time_msec();
>      member->status = LACP_CURRENT;
>      tx_rate = lacp->fast ? LACP_FAST_TIME_TX : LACP_SLOW_TIME_TX;
>      timer_set_duration(&member->rx, LACP_RX_MULTIPLIER * tx_rate);
> @@ -511,6 +520,7 @@ lacp_member_carrier_changed(const struct lacp *lacp, 
> const void *member_,
>                              bool carrier_up)
>      OVS_EXCLUDED(mutex)
>  {
> +    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>      struct member *member;
>      if (!lacp) {
>          return;
> @@ -524,6 +534,11 @@ lacp_member_carrier_changed(const struct lacp *lacp, 
> const void *member_,
> 
>      if (member->status == LACP_CURRENT || member->lacp->active) {
>          member_set_expired(member);
> +        VLOG_DBG_RL(&rl, "%s: Expiring LACP due to %s carrier change.",
> +                    lacp->name, member->name);
> +        /* Do not warn about long LACP RX/TX interval if interface was down 
> */

Period at the end.

> +        member->last_rx = 0;
> +        member->last_tx = 0;
>      }
> 
>      if (member->carrier_up != carrier_up) {
> @@ -594,6 +609,7 @@ lacp_member_is_current(const struct lacp *lacp, const 
> void *member_)
>  void
>  lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) OVS_EXCLUDED(mutex)
>  {
> +    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>      struct member *member;
> 
>      lacp_lock();
> @@ -603,6 +619,8 @@ lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) 
> OVS_EXCLUDED(mutex)
> 
>              if (member->status == LACP_CURRENT) {
>                  member_set_expired(member);
> +                VLOG_DBG_RL(&rl, "%s: Expired member %s because LACP PDU was 
> "
> +                            "not received on time.", lacp->name, 
> member->name);
>                  member->count_link_expired++;
>              } else if (member->status == LACP_EXPIRED) {
>                  member_set_defaulted(member);
> @@ -642,6 +660,13 @@ lacp_run(struct lacp *lacp, lacp_send_pdu *send_pdu) 
> OVS_EXCLUDED(mutex)
>                          ? LACP_FAST_TIME_TX
>                          : LACP_SLOW_TIME_TX);
> 
> +            /* Log if we exceed the tx timer by double the tx rate. */
> +            if (member->last_tx &&
> +                (time_msec() - member->last_tx) > (duration * 2)) {
> +                VLOG_INFO_RL(&rl, "%s: Member %s failed to send LACP PCU "
> +                             "on time.", lacp->name, member->name);
> +            }
> +            member->last_tx = time_msec();
>              timer_set_duration(&member->tx, duration);
>              seq_change(connectivity_seq_get());
>          }
> @@ -800,6 +825,7 @@ member_set_expired(struct member *member) 
> OVS_REQUIRES(mutex)
>      member->partner.state &= ~LACP_STATE_SYNC;
> 
>      timer_set_duration(&member->rx, LACP_RX_MULTIPLIER * LACP_FAST_TIME_TX);
> +    member->last_tx = 0;

Why we clearing this for expired ports?  Shouldn't we still transmit even
if we do not receive anything from the other side?

Not sure how to correctly achieve that, but is seems that last_tx shoudl
be cleared whenever member_may_tx() turns false.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to