> -----Original Message-----
> From: George Cherian
> Sent: Tuesday, December 1, 2020 10:49 AM
> To: 'Jakub Kicinski' <k...@kernel.org>
> Cc: 'net...@vger.kernel.org' <net...@vger.kernel.org>; 'linux-
> ker...@vger.kernel.org' <linux-kernel@vger.kernel.org>;
> 'da...@davemloft.net' <da...@davemloft.net>; Sunil Kovvuri Goutham
> <sgout...@marvell.com>; Linu Cherian <lcher...@marvell.com>;
> Geethasowjanya Akula <gak...@marvell.com>; 'masahi...@kernel.org'
> <masahi...@kernel.org>; 'willemdebruijn.ker...@gmail.com'
> <willemdebruijn.ker...@gmail.com>; 'sa...@kernel.org'
> <sa...@kernel.org>; 'j...@resnulli.us' <j...@resnulli.us>
> Subject: RE: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> reporters for NPA
> 
> Jakub,
> 
> > -----Original Message-----
> > From: George Cherian
> > Sent: Tuesday, December 1, 2020 9:06 AM
> > To: Jakub Kicinski <k...@kernel.org>
> > Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > da...@davemloft.net; Sunil Kovvuri Goutham
> <sgout...@marvell.com>;
> > Linu Cherian <lcher...@marvell.com>; Geethasowjanya Akula
> > <gak...@marvell.com>; masahi...@kernel.org;
> > willemdebruijn.ker...@gmail.com; sa...@kernel.org; j...@resnulli.us
> > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > reporters for NPA
> >
> > Hi Jakub,
> >
> > > -----Original Message-----
> > > From: Jakub Kicinski <k...@kernel.org>
> > > Sent: Tuesday, December 1, 2020 7:59 AM
> > > To: George Cherian <gcher...@marvell.com>
> > > Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > da...@davemloft.net; Sunil Kovvuri Goutham
> > <sgout...@marvell.com>;
> > > Linu Cherian <lcher...@marvell.com>; Geethasowjanya Akula
> > > <gak...@marvell.com>; masahi...@kernel.org;
> > > willemdebruijn.ker...@gmail.com; sa...@kernel.org; j...@resnulli.us
> > > Subject: Re: [PATCHv5 net-next 2/3] octeontx2-af: Add devlink health
> > > reporters for NPA
> > >
> > > On Thu, 26 Nov 2020 19:32:50 +0530 George Cherian wrote:
> > > > Add health reporters for RVU NPA block.
> > > > NPA Health reporters handle following HW event groups
> > > >  - GENERAL events
> > > >  - ERROR events
> > > >  - RAS events
> > > >  - RVU event
> > > > An event counter per event is maintained in SW.
> > > >
> > > > Output:
> > > >  # devlink health
> > > >  pci/0002:01:00.0:
> > > >    reporter hw_npa
> > > >      state healthy error 0 recover 0  # devlink  health dump show
> > > > pci/0002:01:00.0 reporter hw_npa
> > > >  NPA_AF_GENERAL:
> > > >         Unmap PF Error: 0
> > > >         NIX:
> > > >         0: free disabled RX: 0 free disabled TX: 0
> > > >         1: free disabled RX: 0 free disabled TX: 0
> > > >         Free Disabled for SSO: 0
> > > >         Free Disabled for TIM: 0
> > > >         Free Disabled for DPI: 0
> > > >         Free Disabled for AURA: 0
> > > >         Alloc Disabled for Resvd: 0
> > > >   NPA_AF_ERR:
> > > >         Memory Fault on NPA_AQ_INST_S read: 0
> > > >         Memory Fault on NPA_AQ_RES_S write: 0
> > > >         AQ Doorbell Error: 0
> > > >         Poisoned data on NPA_AQ_INST_S read: 0
> > > >         Poisoned data on NPA_AQ_RES_S write: 0
> > > >         Poisoned data on HW context read: 0
> > > >   NPA_AF_RVU:
> > > >         Unmap Slot Error: 0
> > >
> > > You seem to have missed the feedback Saeed and I gave you on v2.
> > >
> > > Did you test this with the errors actually triggering? Devlink
> > > should store only
> > Yes, the same was tested using devlink health test interface by
> > injecting errors.
> > The dump gets generated automatically and the counters do get out of
> > sync, in case of continuous error.
> > That wouldn't be much of an issue as the user could manually trigger a
> > dump clear and Re-dump the counters to get the exact status of the
> > counters at any point of time.
> 
> Now that recover op is added the devlink error counter and recover counter
> will be proper. The internal counter for each event is needed just to
> understand within a specific reporter, how many such events occurred.
> 
> Following is the log snippet of the devlink health test being done on hw_nix
> reporter.
> # for i in `seq 1 33` ; do  devlink health test pci/0002:01:00.0 reporter 
> hw_nix;
> done //Inject 33 errors (16  of NIX_AF_RVU and 17 of NIX_AF_RAS and
> NIX_AF_GENERAL errors) # devlink health
> pci/0002:01:00.0:
>   reporter hw_npa
>     state healthy error 0 recover 0 grace_period 0 auto_recover true
> auto_dump true
>   reporter hw_nix
>     state healthy error 250 recover 250 last_dump_date 1970-01-01
> last_dump_time 00:04:16 grace_period 0 auto_recover true auto_dump true
Oops, There was a log copy paste error above its not 250 (that was from a run, 
in which test was done
for 250 error injections)  
# devlink health
pci/0002:01:00.0:
  reporter hw_npa
    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump 
true
  reporter hw_nix
    state healthy error 33 recover 33 last_dump_date 1970-01-01 last_dump_time 
00:02:16 grace_period 0 auto_recover true auto_dump true

> # devlink health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
>         Memory Fault on NIX_AQ_INST_S read: 1
>         Memory Fault on NIX_AQ_RES_S write: 1
>         AQ Doorbell error: 1
>         Rx on unmapped PF_FUNC: 1
>         Rx multicast replication error: 1
>         Memory fault on NIX_RX_MCE_S read: 1
>         Memory fault on multicast WQE read: 1
>         Memory fault on mirror WQE read: 1
>         Memory fault on mirror pkt write: 1
>         Memory fault on multicast pkt write: 1
>   NIX_AF_RAS:
>         Poisoned data on NIX_AQ_INST_S read: 1
>         Poisoned data on NIX_AQ_RES_S write: 1
>         Poisoned data on HW context read: 1
>         Poisoned data on packet read from mirror buffer: 1
>         Poisoned data on packet read from mcast buffer: 1
>         Poisoned data on WQE read from mirror buffer: 1
>         Poisoned data on WQE read from multicast buffer: 1
>         Poisoned data on NIX_RX_MCE_S read: 1
>   NIX_AF_RVU:
>         Unmap Slot Error: 0
> # devlink health dump clear pci/0002:01:00.0 reporter hw_nix # devlink
> health dump show pci/0002:01:00.0 reporter hw_nix
> NIX_AF_GENERAL:
>         Memory Fault on NIX_AQ_INST_S read: 17
>         Memory Fault on NIX_AQ_RES_S write: 17
>         AQ Doorbell error: 17
>         Rx on unmapped PF_FUNC: 17
>         Rx multicast replication error: 17
>         Memory fault on NIX_RX_MCE_S read: 17
>         Memory fault on multicast WQE read: 17
>         Memory fault on mirror WQE read: 17
>         Memory fault on mirror pkt write: 17
>         Memory fault on multicast pkt write: 17
>   NIX_AF_RAS:
>         Poisoned data on NIX_AQ_INST_S read: 17
>         Poisoned data on NIX_AQ_RES_S write: 17
>         Poisoned data on HW context read: 17
>         Poisoned data on packet read from mirror buffer: 17
>         Poisoned data on packet read from mcast buffer: 17
>         Poisoned data on WQE read from mirror buffer: 17
>         Poisoned data on WQE read from multicast buffer: 17
>         Poisoned data on NIX_RX_MCE_S read: 17
>   NIX_AF_RVU:
>         Unmap Slot Error: 16
> >
> > > one dump, are the counters not going to get out of sync unless
> > > something clears the dump every time it triggers?
> Also, note that auto_dump is something which can be turned off by user.
> # devlink health set pci/0002:01:00.0 reporter hw_nix auto_dump false So
> that user can dump whenever required, which will always return the correct
> counter values.
> 
> >
> > Regards,
> > -George

Reply via email to