On Sun, Apr 26, 2026 at 10:15:55AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 23, 2026 at 09:16:28AM -0700, Dipayaan Roy wrote:
> > During Function Level Reset recovery, the MANA driver reads
> > hardware BAR0 registers that may temporarily contain garbage values.
> > The SHM (Shared Memory) offset read from GDMA_REG_SHM_OFFSET is used
> > to compute gc->shm_base, which is later dereferenced via readl() in
> > mana_smc_poll_register(). If the hardware returns an unaligned or
> > out-of-range value, the driver must not blindly use it, as this would
> > propagate the hardware error into a kernel crash.
> 
> It is not what we are calling "hardening" if you are hitting actual
> crashes in actual real systems.
> 
> "hardening" is the driver defending against actively malicious
> hardware, operating in ways that will never be seen in real systems,
> attempting to compromise the kernel.
> 
> Drivers working around real world broken/buggy/malfunctioning HW is
> just entirely normal stuff.
>
Hi Jason,

sure, I will drop the hardening label, in v2. 
> > @@ -73,10 +74,25 @@ static int mana_gd_init_pf_regs(struct pci_dev *pdev)
> >     gc->phys_db_page_base = gc->bar0_pa + gc->db_page_off;
> >  
> >     sriov_base_off = mana_gd_r64(gc, GDMA_SRIOV_REG_CFG_BASE_OFF);
> > +   if (sriov_base_off >= gc->bar0_size ||
> > +       !IS_ALIGNED(sriov_base_off, sizeof(u32))) {
> > +           dev_err(gc->dev,
> > +                   "SRIOV base offset 0x%llx out of range or unaligned 
> > (BAR0 size 0x%llx)\n",
> > +                   sriov_base_off, (u64)gc->bar0_size);
> > +           return -EPROTO;
> > +   }
> 
> .. and if it is entirely normal and something that happens is EPROTO
> really the right way to deal with this race, or should the driver be
> looping somehow until the device stabilizes??
This is the current flow of the driver:
mana_serv_reset()
        mana_gd_suspend()
        msleep(MANA_SERVICE_PERIOD * 1000); -> 10s
        mana_gd_resume()
                mana_gd_init_registers();  -> read the garbage followed by fault
        mana_serv_rescan(); -> On mana_gd_resume err(EPROTO) full PCI remove + 
rescan

The 10-second sleep in mana_serv_reset() already happens before
mana_gd_resume() is called, so by the time we read the registers
hardware should have stabilized. If we still see garbage after 10
seconds, it suggests deeper hardware issue where PCI rescan is
recomended from HW. This patch returns -EPROTO on detection
of unaligned / out of range offset and that err code triggers the
mana_serv_rescan().

> 
> Jason

Thanks Jason for the review comments, will post a v2 to drop the
hardening label.

Regards
Dipayaan Roy


Reply via email to