Re: [PATCH 2/3 v2] xhci: Fix race related to abort operation
Mathias Nymanwrites: > The retry was a workaround for a case triggered on v4.1.4 by > Vincent Pelletier (added to CC) > >http://marc.info/?l=linux-usb=144031185222108=2 > > The race could very well explain that issue. >From logs of that thread, it would has possibility about this race. >> @@ -320,15 +367,29 @@ static int xhci_abort_cmd_ring(struct xh >> udelay(1000); >> ret = xhci_handshake(>op_regs->cmd_ring, >> CMD_RING_RUNNING, 0, 3 * 1000 * 1000); >> -if (ret == 0) >> -return 0; >> +if (ret < 0) { >> +xhci_err(xhci, "Stopped the command ring failed, " >> + "maybe the host is dead\n"); >> +xhci->xhc_state |= XHCI_STATE_DYING; >> +xhci_halt(xhci); >> +return -ESHUTDOWN; >> +} >> +} > > So if we can verify that the race fix solves the old issue for Vincent > Pelletier then we could get rid of the abort retry above as well. Right. However, I'm not sure (and maybe you neither). So I didn't remove it in this patchset. (After this patchset, I guess the retry will hit only when actual chip internal confusion. I.e. retry will fail, then will chip halted.) Well, anyway, if you decide to try to remove the retry, I also think it would worth to try. However, it would be [PATCH 4/4] or as another patchset. >> +/* >> + * Writing the CMD_RING_ABORT bit should cause a cmd completion event, >> + * however on some host hw the CMD_RING_RUNNING bit is correctly cleared >> + * but the completion event in never sent. Wait 2 secs (arbitrary >> + * number) to handle those cases after negation of CMD_RING_RUNNING. >> + */ > > I'm speculating a bit here, but as we now can sleep, and if we could > get rid of the abort retry couldn't we swap the order of the > xhci_handshake(CMD_RING_RUNNING) busywait and > wait_for_completion_timeout(). We could also use a standard command > timeout time while waiting for the completion > > something like: > > hci_write_64(CMD_RING_ABORT) > timed_out = wait_for_completion_timeout(XHCI_CMD_DEFAULT_TIMEOUT) > xhci_handshake(CMD_RING_RUNNING) > if (handshake fail) { > xhci_halt(), etc.. >return -ESHUTDOWN > } > if (timed out) { >xhci_cleanup_command_queue(xhci); >return > } > > It would reduce the time we spend in the xhci_handshake() busyloop BTW, busyloop removal was done by "[PATCH 3/3] ...". By [PATCH 3/3], we can simply use straight forward coding without busyloop. >> diff -puN drivers/usb/host/xhci.h~xhci-fix-abort-race2 >> drivers/usb/host/xhci.h >> --- xhci/drivers/usb/host/xhci.h~xhci-fix-abort-race22016-11-16 >> 18:38:52.552146764 +0900 >> +++ xhci-hirofumi/drivers/usb/host/xhci.h2016-11-16 18:38:52.554146762 >> +0900 >> @@ -1574,6 +1574,7 @@ struct xhci_hcd { >> struct list_headcmd_list; >> unsigned intcmd_ring_reserved_trbs; >> struct delayed_work cmd_timer; >> +struct completion stop_completion; > > Tiny thing, but maybe change stop_completion to > cmd_ring_stop_completion. to avoid mixing it with stopping host > controller, stop endpoint, stop device etc OK. -- OGAWA Hirofumi -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3 v2] xhci: Fix race related to abort operation
On 16.11.2016 11:50, OGAWA Hirofumi wrote: Current abort operation has race. xhci_handle_command_timeout() xhci_abort_cmd_ring() xhci_write_64(CMD_RING_ABORT) xhci_handshake(5s) do { check CMD_RING_RUNNING udelay(1) ... COMP_CMD_ABORT event COMP_CMD_STOP event xhci_handle_stopped_cmd_ring() restart cmd_ring CMD_RING_RUNNING become 1 again } while () return -ETIMEDOUT xhci_write_64(CMD_RING_ABORT) /* can abort random command */ To do abort operation correctly, we have to wait both of COMP_CMD_STOP event and negation of CMD_RING_RUNNING. But like above, while timeout handler is waiting negation of CMD_RING_RUNNING, event handler can restart cmd_ring. So timeout handler never be notice negation of CMD_RING_RUNNING, and retry of CMD_RING_ABORT can abort random command (BTW, I guess retry of CMD_RING_ABORT was workaround of this race). Nice catch Didn't see that race, with the possible abortion of a random command. The retry was a workaround for a case triggered on v4.1.4 by Vincent Pelletier (added to CC) http://marc.info/?l=linux-usb=144031185222108=2 The race could very well explain that issue. To fix this race, this moves xhci_handle_stopped_cmd_ring() to xhci_abort_cmd_ring(). And timeout handler waits COMP_CMD_STOP event. Sounds reasonable At this point, timeout handler is owner of cmd_ring, and safely restart cmd_ring by using xhci_handle_stopped_cmd_ring(). [FWIW, as bonus, this way would be easily extend to add CMD_RING_PAUSE operation] Signed-off-by: OGAWA Hirofumi--- drivers/usb/host/xhci-mem.c |1 drivers/usb/host/xhci-ring.c | 162 ++ drivers/usb/host/xhci.h |1 3 files changed, 87 insertions(+), 77 deletions(-) diff -puN drivers/usb/host/xhci-mem.c~xhci-fix-abort-race2 drivers/usb/host/xhci-mem.c --- xhci/drivers/usb/host/xhci-mem.c~xhci-fix-abort-race2 2016-11-16 18:38:52.551146764 +0900 +++ xhci-hirofumi/drivers/usb/host/xhci-mem.c 2016-11-16 18:38:52.553146763 +0900 @@ -2344,6 +2344,7 @@ int xhci_mem_init(struct xhci_hcd *xhci, /* init command timeout work */ INIT_DELAYED_WORK(>cmd_timer, xhci_handle_command_timeout); + init_completion(>stop_completion); page_size = readl(>op_regs->page_size); xhci_dbg_trace(xhci, trace_xhci_dbg_init, diff -puN drivers/usb/host/xhci-ring.c~xhci-fix-abort-race2 drivers/usb/host/xhci-ring.c --- xhci/drivers/usb/host/xhci-ring.c~xhci-fix-abort-race2 2016-11-16 18:38:52.552146764 +0900 +++ xhci-hirofumi/drivers/usb/host/xhci-ring.c 2016-11-16 18:39:07.027136278 +0900 @@ -284,6 +284,60 @@ static bool xhci_mod_cmd_timer(struct xh return mod_delayed_work(system_wq, >cmd_timer, delay); } +static struct xhci_command *xhci_next_queued_cmd(struct xhci_hcd *xhci) +{ + return list_first_entry_or_null(>cmd_list, struct xhci_command, + cmd_list); +} + +/* + * Turn all commands on command ring with status set to "aborted" to no-op trbs. + * If there are other commands waiting then restart the ring and kick the timer. + * This must be called with command ring stopped and xhci->lock held. + */ +static void xhci_handle_stopped_cmd_ring(struct xhci_hcd *xhci, +struct xhci_command *cur_cmd) +{ + struct xhci_command *i_cmd; + u32 cycle_state; + + /* Turn all aborted commands in list to no-ops, then restart */ + list_for_each_entry(i_cmd, >cmd_list, cmd_list) { + + if (i_cmd->status != COMP_CMD_ABORT) + continue; + + i_cmd->status = COMP_CMD_STOP; + + xhci_dbg(xhci, "Turn aborted command %p to no-op\n", +i_cmd->command_trb); + /* get cycle state from the original cmd trb */ + cycle_state = le32_to_cpu( + i_cmd->command_trb->generic.field[3]) & TRB_CYCLE; + /* modify the command trb to no-op command */ + i_cmd->command_trb->generic.field[0] = 0; + i_cmd->command_trb->generic.field[1] = 0; + i_cmd->command_trb->generic.field[2] = 0; + i_cmd->command_trb->generic.field[3] = cpu_to_le32( + TRB_TYPE(TRB_CMD_NOOP) | cycle_state); + + /* +* caller waiting for completion is called when command +* completion event is received for these no-op commands +*/ + } + + xhci->cmd_ring_state = CMD_RING_STATE_RUNNING; + + /* ring command
[PATCH 2/3 v2] xhci: Fix race related to abort operation
Current abort operation has race. xhci_handle_command_timeout() xhci_abort_cmd_ring() xhci_write_64(CMD_RING_ABORT) xhci_handshake(5s) do { check CMD_RING_RUNNING udelay(1) ... COMP_CMD_ABORT event COMP_CMD_STOP event xhci_handle_stopped_cmd_ring() restart cmd_ring CMD_RING_RUNNING become 1 again } while () return -ETIMEDOUT xhci_write_64(CMD_RING_ABORT) /* can abort random command */ To do abort operation correctly, we have to wait both of COMP_CMD_STOP event and negation of CMD_RING_RUNNING. But like above, while timeout handler is waiting negation of CMD_RING_RUNNING, event handler can restart cmd_ring. So timeout handler never be notice negation of CMD_RING_RUNNING, and retry of CMD_RING_ABORT can abort random command (BTW, I guess retry of CMD_RING_ABORT was workaround of this race). To fix this race, this moves xhci_handle_stopped_cmd_ring() to xhci_abort_cmd_ring(). And timeout handler waits COMP_CMD_STOP event. At this point, timeout handler is owner of cmd_ring, and safely restart cmd_ring by using xhci_handle_stopped_cmd_ring(). [FWIW, as bonus, this way would be easily extend to add CMD_RING_PAUSE operation] Signed-off-by: OGAWA Hirofumi--- drivers/usb/host/xhci-mem.c |1 drivers/usb/host/xhci-ring.c | 162 ++ drivers/usb/host/xhci.h |1 3 files changed, 87 insertions(+), 77 deletions(-) diff -puN drivers/usb/host/xhci-mem.c~xhci-fix-abort-race2 drivers/usb/host/xhci-mem.c --- xhci/drivers/usb/host/xhci-mem.c~xhci-fix-abort-race2 2016-11-16 18:38:52.551146764 +0900 +++ xhci-hirofumi/drivers/usb/host/xhci-mem.c 2016-11-16 18:38:52.553146763 +0900 @@ -2344,6 +2344,7 @@ int xhci_mem_init(struct xhci_hcd *xhci, /* init command timeout work */ INIT_DELAYED_WORK(>cmd_timer, xhci_handle_command_timeout); + init_completion(>stop_completion); page_size = readl(>op_regs->page_size); xhci_dbg_trace(xhci, trace_xhci_dbg_init, diff -puN drivers/usb/host/xhci-ring.c~xhci-fix-abort-race2 drivers/usb/host/xhci-ring.c --- xhci/drivers/usb/host/xhci-ring.c~xhci-fix-abort-race2 2016-11-16 18:38:52.552146764 +0900 +++ xhci-hirofumi/drivers/usb/host/xhci-ring.c 2016-11-16 18:39:07.027136278 +0900 @@ -284,6 +284,60 @@ static bool xhci_mod_cmd_timer(struct xh return mod_delayed_work(system_wq, >cmd_timer, delay); } +static struct xhci_command *xhci_next_queued_cmd(struct xhci_hcd *xhci) +{ + return list_first_entry_or_null(>cmd_list, struct xhci_command, + cmd_list); +} + +/* + * Turn all commands on command ring with status set to "aborted" to no-op trbs. + * If there are other commands waiting then restart the ring and kick the timer. + * This must be called with command ring stopped and xhci->lock held. + */ +static void xhci_handle_stopped_cmd_ring(struct xhci_hcd *xhci, +struct xhci_command *cur_cmd) +{ + struct xhci_command *i_cmd; + u32 cycle_state; + + /* Turn all aborted commands in list to no-ops, then restart */ + list_for_each_entry(i_cmd, >cmd_list, cmd_list) { + + if (i_cmd->status != COMP_CMD_ABORT) + continue; + + i_cmd->status = COMP_CMD_STOP; + + xhci_dbg(xhci, "Turn aborted command %p to no-op\n", +i_cmd->command_trb); + /* get cycle state from the original cmd trb */ + cycle_state = le32_to_cpu( + i_cmd->command_trb->generic.field[3]) & TRB_CYCLE; + /* modify the command trb to no-op command */ + i_cmd->command_trb->generic.field[0] = 0; + i_cmd->command_trb->generic.field[1] = 0; + i_cmd->command_trb->generic.field[2] = 0; + i_cmd->command_trb->generic.field[3] = cpu_to_le32( + TRB_TYPE(TRB_CMD_NOOP) | cycle_state); + + /* +* caller waiting for completion is called when command +* completion event is received for these no-op commands +*/ + } + + xhci->cmd_ring_state = CMD_RING_STATE_RUNNING; + + /* ring command ring doorbell to restart the command ring */ + if ((xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) && + !(xhci->xhc_state & XHCI_STATE_DYING)) { + xhci->current_cmd = cur_cmd; + xhci_mod_cmd_timer(xhci, XHCI_CMD_DEFAULT_TIMEOUT); + xhci_ring_cmd_db(xhci); + } +} + static int