Please correct me if I am wrong. After a bit more digging I found out that it is indeed command_id got corrupted is causing this problem. Although the tag and command_id range is checked like you said, the elements in rqs cannot be guaranteed to be not NULL. thus although the range check is passed, blk_mq_tag_to_rq() can still return NULL. It is clear that the current sanitization is not enough and there's more implication about this -- when all rqs got populated, a corrupted command_id may silently corrupt other data not belonging to the current command.
- Tong On Thu, Sep 17, 2020 at 8:44 PM Tong Zhang <[email protected]> wrote: > > Hmm..Yeah.. I see your point. > I was naivly thinking the command_id was the culprit. > > On Thu, Sep 17, 2020 at 1:14 PM Keith Busch <[email protected]> wrote: > > > > On Thu, Sep 17, 2020 at 12:56:59PM -0400, Tong Zhang wrote: > > > The command_id in CQE is writable by NVMe controller, driver should > > > check its sanity before using it. > > > > We already do that.

