>> the attached patch

Converted to Gerrit: [1].

Vratko.

[1] https://gerrit.fd.io/r/c/vpp/+/23849

-----Original Message-----
From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Juraj Linkeš
Sent: Thursday, December 5, 2019 11:11 AM
To: Lijian Zhang (Arm Technology China) <lijian.zh...@arm.com>; Peter Mikus -X 
(pmikus - PANTHEON TECH SRO at Cisco) <pmi...@cisco.com>; Benoit Ganne (bganne) 
<bga...@cisco.com>; Maciek Konstantynowicz (mkonstan) <mkons...@cisco.com>; 
vpp-dev <vpp-dev@lists.fd.io>; csit-...@lists.fd.io
Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) <vrpo...@cisco.com>; 
Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
Subject: Re: [vpp-dev] CSIT - performance tests failing on Taishan

Hi Lijian,

The patch helped, I can't reproduce the issue now.

Thanks,
Juraj

-----Original Message-----
From: Lijian Zhang (Arm Technology China) <lijian.zh...@arm.com>
Sent: Thursday, December 5, 2019 7:16 AM
To: Juraj Linkeš <juraj.lin...@pantheon.tech>; Peter Mikus -X (pmikus - 
PANTHEON TECH SRO at Cisco) <pmi...@cisco.com>; Benoit Ganne (bganne) 
<bga...@cisco.com>; Maciek Konstantynowicz (mkonstan) <mkons...@cisco.com>; 
vpp-dev <vpp-dev@lists.fd.io>; csit-...@lists.fd.io
Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) <vrpo...@cisco.com>; 
Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
Subject: RE: CSIT - performance tests failing on Taishan

Hi Juraj,
Could you please try the attached patch?
Thanks.
-----Original Message-----
From: Juraj Linkeš <juraj.lin...@pantheon.tech>
Sent: 2019年12月4日 18:12
To: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) <pmi...@cisco.com>; 
Benoit Ganne (bganne) <bga...@cisco.com>; Maciek Konstantynowicz (mkonstan) 
<mkons...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; csit-...@lists.fd.io
Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) <vrpo...@cisco.com>; 
Lijian Zhang (Arm Technology China) <lijian.zh...@arm.com>; Honnappa 
Nagarahalli <honnappa.nagaraha...@arm.com>
Subject: RE: CSIT - performance tests failing on Taishan

Hi Ben, Lijian, Honnappa,

The issue is reproducible after the second invocation of show pci:
DBGvpp# show pci
Address      Sock VID:PID     Link Speed   Driver          Product Name         
           Vital Product Data
0000:11:00.0   2  8086:10fb   5.0 GT/s x8  ixgbe
0000:11:00.1   2  8086:10fb   5.0 GT/s x8  ixgbe
0002:f9:00.0   0  15b3:1015   8.0 GT/s x8  mlx5_core       CX4121A - ConnectX-4 
LX SFP28   PN: MCX4121A-ACAT_C12
                                                                                
           EC: A1
                                                                                
           SN: MT1745K13032
                                                                                
           V0: 0x 50 43 49 65 47 65 6e 33 ...
                                                                                
           RV: 0x ba
0002:f9:00.1   0  15b3:1015   8.0 GT/s x8  mlx5_core       CX4121A - ConnectX-4 
LX SFP28   PN: MCX4121A-ACAT_C12
                                                                                
           EC: A1
                                                                                
           SN: MT1745K13032
                                                                                
           V0: 0x 50 43 49 65 47 65 6e 33 ...
                                                                                
           RV: 0x ba DBGvpp# show pci
Address      Sock VID:PID     Link Speed   Driver          Product Name         
           Vital Product Data
0000:11:00.0   2  8086:10fb   5.0 GT/s x8  ixgbe
0000:11:00.1   2  8086:10fb   5.0 GT/s x8  ixgbe
Aborted
Makefile:546: recipe for target 'run' failed
make: *** [run] Error 134

I've tried to do some debugging with a debug build:
(gdb) bt
...
#5  0x0000ffffbe775000 in format_vlib_pci_vpd (s=0xffff7efa9e80 "0002:f9:00.0   
0  15b3:1015   8.0 GT/s x8  mlx5_core       CX4121A - ConnectX-4 LX SFP28", 
args=0xffff7ef729b0) at /home/testuser/vpp/src/vlib/pci/pci.c:230
...
(gdb) frame 5
#5  0x0000ffffbe775000 in format_vlib_pci_vpd (s=0xffff7efa9e80 "0002:f9:00.0   
0  15b3:1015   8.0 GT/s x8  mlx5_core       CX4121A - ConnectX-4 LX SFP28", 
args=0xffff7ef729b0) at /home/testuser/vpp/src/vlib/pci/pci.c:230
230           else if (*(u16 *) & data[p] == *(u16 *) id)
(gdb) info locals
data = 0xffff7efa9cd0 "PN\025MCX4121A-ACAT_C12    EC\002A1SN\030MT1745K13032", 
' ' <repeats 12 times>, "V0\023PCIeGen3 x8        RV\001\272"
id = 0xaaa8000000000000 <error: Cannot access memory at address 
0xaaa8000000000000> indent = 91 string_types = {0xffffbe7b7950 "PN", 
0xffffbe7b7958 "EC", 0xffffbe7b7960 "SN", 0xffffbe7b7968 "MN", 0x0} p = 0 
first_line = 1

Looks like something went wrong with the 'id' variable. More is attached.

As a temporary workaround (until we fix this), we're going to replace show pci 
with something else in CSIT: https://gerrit.fd.io/r/c/csit/+/23785

Juraj

-----Original Message-----
From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) <pmi...@cisco.com>
Sent: Tuesday, December 3, 2019 3:58 PM
To: Juraj Linkeš <juraj.lin...@pantheon.tech>; Benoit Ganne (bganne) 
<bga...@cisco.com>; Maciek Konstantynowicz (mkonstan) <mkons...@cisco.com>; 
vpp-dev <vpp-dev@lists.fd.io>; csit-...@lists.fd.io
Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) <vrpo...@cisco.com>; 
lijian.zh...@arm.com; Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>
Subject: RE: CSIT - performance tests failing on Taishan

Latest update is that Benoit has no access over VPN so he did try to replicate 
in local lab (assuming x86).
I will do quick fix in CSIT. I will disable MLX driver on Taishan.

Peter Mikus
Engineer - Software
Cisco Systems Limited

> -----Original Message-----
> From: Juraj Linkeš <juraj.lin...@pantheon.tech>
> Sent: Tuesday, December 3, 2019 3:09 PM
> To: Benoit Ganne (bganne) <bga...@cisco.com>; Peter Mikus -X (pmikus - 
> PANTHEON TECH SRO at Cisco) <pmi...@cisco.com>; Maciek Konstantynowicz
> (mkonstan) <mkons...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; csit- 
> d...@lists.fd.io
> Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) 
> <vrpo...@cisco.com>; lijian.zh...@arm.com; Honnappa Nagarahalli 
> <honnappa.nagaraha...@arm.com>
> Subject: RE: CSIT - performance tests failing on Taishan
>
> Hi Benoit,
>
> Do you have access to FD.io lab? The Taishan servers are in it.
>
> Juraj
>
> -----Original Message-----
> From: Benoit Ganne (bganne) <bga...@cisco.com>
> Sent: Friday, November 29, 2019 4:03 PM
> To: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) 
> <pmi...@cisco.com>; Juraj Linkeš <juraj.lin...@pantheon.tech>; Maciek 
> Konstantynowicz (mkonstan) <mkons...@cisco.com>; vpp-dev <vpp- 
> d...@lists.fd.io>; csit-...@lists.fd.io
> Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) 
> <vrpo...@cisco.com>; lijian.zh...@arm.com; Honnappa Nagarahalli 
> <honnappa.nagaraha...@arm.com>
> Subject: RE: CSIT - performance tests failing on Taishan
>
> Hi Peter, can I get access to the setup to investigate?
>
> Best
> ben
>
> > -----Original Message-----
> > From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco) 
> > <pmi...@cisco.com>
> > Sent: vendredi 29 novembre 2019 11:08
> > To: Benoit Ganne (bganne) <bga...@cisco.com>; Juraj Linkeš 
> > <juraj.lin...@pantheon.tech>; Maciek Konstantynowicz (mkonstan) 
> > <mkons...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; 
> > csit-...@lists.fd.io
> > Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) 
> > <vrpo...@cisco.com>; Benoit Ganne (bganne) <bga...@cisco.com>; 
> > lijian.zh...@arm.com; Honnappa Nagarahalli 
> > <honnappa.nagaraha...@arm.com>
> > Subject: RE: CSIT - performance tests failing on Taishan
> >
> > +dev lists
> >
> > Peter Mikus
> > Engineer - Software
> > Cisco Systems Limited
> >
> > > -----Original Message-----
> > > From: Peter Mikus -X (pmikus - PANTHEON TECH SRO at Cisco)
> > > Sent: Friday, November 29, 2019 11:06 AM
> > > To: Benoit Ganne (bganne) <bga...@cisco.com>; Juraj Linkeš 
> > > <juraj.lin...@pantheon.tech>; Maciek Konstantynowicz (mkonstan) 
> > > <mkons...@cisco.com>
> > > Cc: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) 
> > > <vrpo...@cisco.com>; Benoit Ganne (bganne) <bga...@cisco.com>; 
> > > lijian.zh...@arm.com; Honnappa Nagarahalli
> > <honnappa.nagaraha...@arm.com>
> > > Subject: CSIT - performance tests failing on Taishan
> > >
> > > Hello all,
> > >
> > > In CSIT we are observing the issue with Taishan boxes where 
> > > performance tests are failing.
> > > There has been long misleading discussion about the potential 
> > > issue,
> > root
> > > cause and what workaround to apply.
> > >
> > > Issue
> > > =====
> > > VPP is being restarted after an attempt to read "show pci" over 
> > > the socket on '/run/vpp/cli.sock'
> > > in a loop. This loop test is executed in CSIT towards VPP with 
> > > default startup configuration via command below to check if VPP is 
> > > really UP and responding.
> > >
> > > How to reproduce
> > > ================
> > > for i in $(seq 1 120); do echo "show pci" | sudo socat - UNIX- 
> > > CONNECT:/run/vpp/cli.sock; sudo netstat -ap | grep vpp; done
> > >
> > > The same can be reproduced using vppctl:
> > >
> > > for i in $(seq 1 120); do echo "show pci" | sudo vppctl; sudo 
> > > netstat -
> > ap
> > > | grep vpp; done
> > >
> > > To eliminate the issue with test itself I used "show version"
> > > for i in $(seq 1 120); do echo "show version" | sudo socat - UNIX- 
> > > CONNECT:/run/vpp/cli.sock; sudo netstat -ap | grep vpp; done
> > >
> > > This test is passing with "show version" and VPP is not restarted.
> > >
> > >
> > > Root cause
> > > ==========
> > > The root cause seems to be:
> > >
> > > Thread 1 "vpp_main" received signal SIGSEGV, Segmentation fault.
> > > 0x0000ffffbeb4f3d0 in format_vlib_pci_vpd (
> > >     s=0xffff7fabe830 "0002:f9:00.0   0  15b3:1015   8.0 GT/s x8
> > > mlx5_core       CX4121A - ConnectX-4 LX SFP28", args
> > > =<optimized out>)
> > >     at /w/workspace/vpp-arm-merge-master-
> > > ubuntu1804/src/vlib/pci/pci.c:230
> > > 230     /w/workspace/vpp-arm-merge-master-
> ubuntu1804/src/vlib/pci/pci.c:
> > > No such file or directory.
> > > (gdb)
> > > Continuing.
> > >
> > > Thread 1 "vpp_main" received signal SIGABRT, Aborted.
> > > __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> > > 51      ../sysdeps/unix/sysv/linux/raise.c: No such file or
> directory.
> > > (gdb)
> > >
> > >
> > > Issue started after MLX was installed into Taishan.
> > >
> > >
> > > @Benoit Ganne (bganne) can you please help fixing the root cause?
> > >
> > > Thank you.
> > >
> > > Peter Mikus
> > > Engineer - Software
> > > Cisco Systems Limited

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#14825): https://lists.fd.io/g/vpp-dev/message/14825
Mute This Topic: https://lists.fd.io/mt/64332740/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-
  • ... Peter Mikus via Lists.Fd.Io
    • ... Benoit Ganne (bganne) via Lists.Fd.Io
      • ... Juraj Linkeš
        • ... Peter Mikus via Lists.Fd.Io
          • ... Juraj Linkeš
            • ... Lijian Zhang
              • ... Juraj Linkeš
                • ... Vratko Polak -X (vrpolak - PANTHEON TECHNOLOGIES at Cisco) via Lists.Fd.Io

Reply via email to