from:"Keith Busch"

Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable

2018-02-08 Thread Keith Busch

On Thu, Feb 08, 2018 at 05:56:49PM +0200, Sagi Grimberg wrote:
> Given the discussion on this set, you plan to respin again
> for 4.16?

With the exception of maybe patch 1, this needs more consideration than
I'd feel okay with for the 4.16 release.

Re: [PATCH V2 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable

2018-02-09 Thread Keith Busch

On Fri, Feb 09, 2018 at 09:50:58AM +0800, jianchao.wang wrote:
> 
> if we set NVME_REQ_CANCELLED and return BLK_EH_HANDLED as the RESETTING case,
> nvme_reset_work will hang forever, because no one could complete the entered 
> requests.

Except it's no longer in the "RESETTING" case since you added the
"CONNECTING" state, so that's already broken for other reasons...

Re: [PATCH RESENT] nvme-pci: suspend queues based on online_queues

2018-02-13 Thread Keith Busch

On Mon, Feb 12, 2018 at 09:05:13PM +0800, Jianchao Wang wrote:
> @@ -1315,9 +1315,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
>   nvmeq->cq_vector = -1;
>   spin_unlock_irq(&nvmeq->q_lock);
>  
> - if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
> - blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
> -

This doesn't look related to this patch.

>   pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
>  
>   return 0;
> @@ -1461,13 +1458,14 @@ static int nvme_create_queue(struct nvme_queue 
> *nvmeq, int qid)
>   nvme_init_queue(nvmeq, qid);
>   result = queue_request_irq(nvmeq);
>   if (result < 0)
> - goto release_sq;
> + goto offline;
>  
>   return result;
>  
> - release_sq:
> +offline:
> + dev->online_queues--;
>   adapter_delete_sq(dev, qid);

In addition to the above, set nvmeq->cq_vector to -1.

Everything else that follows doesn't appear to be necessary.

> @@ -2167,6 +2167,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
> shutdown)
>   int i;
>   bool dead = true;
>   struct pci_dev *pdev = to_pci_dev(dev->dev);
> + int onlines;
>  
>   mutex_lock(&dev->shutdown_lock);
>   if (pci_is_enabled(pdev)) {
> @@ -2175,8 +2176,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> bool shutdown)
>   if (dev->ctrl.state == NVME_CTRL_LIVE ||
>   dev->ctrl.state == NVME_CTRL_RESETTING)
>   nvme_start_freeze(&dev->ctrl);
> - dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
> - pdev->error_state  != pci_channel_io_normal);
> +
> + dead = !!((csts & NVME_CSTS_CFS) ||
> + !(csts & NVME_CSTS_RDY) ||
> + (pdev->error_state  != pci_channel_io_normal) ||
> + (dev->online_queues == 0));
>   }
>  
>   /*
> @@ -2200,9 +2204,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> bool shutdown)
>   nvme_disable_io_queues(dev);
>   nvme_disable_admin_queue(dev, shutdown);
>   }
> - for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
> +
> + onlines = dev->online_queues;
> + for (i = onlines - 1; i >= 0; i--)
>   nvme_suspend_queue(&dev->queues[i]);
>  
> + if (dev->ctrl.admin_q)
> + blk_mq_quiesce_queue(dev->ctrl.admin_q);
> +
>   nvme_pci_disable(dev);
>  
>   blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
> @@ -2341,16 +2350,18 @@ static void nvme_reset_work(struct work_struct *work)
>   if (result)
>   goto out;
>  
> - /*
> -  * Keep the controller around but remove all namespaces if we don't have
> -  * any working I/O queue.
> -  */
> - if (dev->online_queues < 2) {
> +
> + /* In case of online_queues is zero, it has gone to out */
> + if (dev->online_queues == 1) {
> + /*
> +  * Keep the controller around but remove all namespaces if we
> +  * don't have any working I/O queue.
> +  */
>   dev_warn(dev->ctrl.device, "IO queues not created\n");
>   nvme_kill_queues(&dev->ctrl);
>   nvme_remove_namespaces(&dev->ctrl);
>   new_state = NVME_CTRL_ADMIN_ONLY;
> - } else {
> + } else if (dev->online_queues > 1) {
>   nvme_start_queues(&dev->ctrl);
>   nvme_wait_freeze(&dev->ctrl);
>   /* hit this only when allocate tagset fails */
> --

Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch

On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> 
> > That difference has been there since the beginning of DPC, so it has
> > nothing to do with *this* series EXCEPT for the fact that it really
> > complicates the logic you're adding to reset_link() and
> > broadcast_error_message().
> > 
> > We ought to be able to simplify that somehow because the only real
> > difference between AER and DPC should be that DPC automatically
> > disables the link and AER does it in software.
> 
> I agree this should be possible. Code execution path should be almost
> identical to fatal error case.
> 
> Is there any reason why you went to stop driver path, Keith?

The fact is the link is truly down during a DPC event. When the link
is enabled again, you don't know at that point if the device(s) on the
other side have changed. Calling a driver's error handler for the wrong
device in an unknown state may have undefined results. Enumerating the
slot from scratch should be safe, and will assign resources, tune bus
settings, and bind to the matching driver.

Per spec, DPC is the recommended way for handling surprise removal
events and even recommends DPC capable slots *not* set 'Surprise'
in Slot Capabilities so that removals are always handled by DPC. This
service driver was developed with that use in mind.

Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch

On Mon, Mar 12, 2018 at 08:16:38PM +0530, p...@codeaurora.org wrote:
> On 2018-03-12 19:55, Keith Busch wrote:
> > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> > > 
> > > > That difference has been there since the beginning of DPC, so it has
> > > > nothing to do with *this* series EXCEPT for the fact that it really
> > > > complicates the logic you're adding to reset_link() and
> > > > broadcast_error_message().
> > > >
> > > > We ought to be able to simplify that somehow because the only real
> > > > difference between AER and DPC should be that DPC automatically
> > > > disables the link and AER does it in software.
> > > 
> > > I agree this should be possible. Code execution path should be almost
> > > identical to fatal error case.
> > > 
> > > Is there any reason why you went to stop driver path, Keith?
> > 
> > The fact is the link is truly down during a DPC event. When the link
> > is enabled again, you don't know at that point if the device(s) on the
> > other side have changed. Calling a driver's error handler for the wrong
> > device in an unknown state may have undefined results. Enumerating the
> > slot from scratch should be safe, and will assign resources, tune bus
> > settings, and bind to the matching driver.
> > 
> > Per spec, DPC is the recommended way for handling surprise removal
> > events and even recommends DPC capable slots *not* set 'Surprise'
> > in Slot Capabilities so that removals are always handled by DPC. This
> > service driver was developed with that use in mind.
> 
> Now it begs the question, that
> 
> after DPC trigger
> 
> should we enumerate the devices, ?
> or
> error handling callbacks, followed by stop devices followed by enumeration ?
> or
> error handling callbacks, followed by enumeration ? (no stop devices)

I'm not sure I understand. The link is disabled while DPC is triggered,
so if anything, you'd want to un-enumerate everything below the contained
port (that's what it does today).

After releasing a slot from DPC, the link is allowed to retrain. If there
is a working device on the other side, a link up event occurs. That
event is handled by the pciehp driver, and that schedules enumeration
no matter what you do to the DPC driver.

Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch

On Mon, Mar 12, 2018 at 09:04:47PM +0530, p...@codeaurora.org wrote:
> On 2018-03-12 20:28, Keith Busch wrote:
> > I'm not sure I understand. The link is disabled while DPC is triggered,
> > so if anything, you'd want to un-enumerate everything below the
> > contained
> > port (that's what it does today).
> > 
> > After releasing a slot from DPC, the link is allowed to retrain. If
> > there
> > is a working device on the other side, a link up event occurs. That
> > event is handled by the pciehp driver, and that schedules enumeration
> > no matter what you do to the DPC driver.
> 
> yes, that is what i current, but this patch-set makes DPC aware of error
> handling driver callbacks.

I've been questioning the utility of doing that since the very first
version of this patch set.

> besides, in absence of pciehp there is nobody to do enumeration.

If you configure your kernel to not have a feature, you don't get to
expect the feature works.

You can still manually rescan through sysfs, /sys/bus/pci/rescan.

> And, I was talking about pci_stop_and_remove_bus_device() in dpc.
> if DPC calls driver's error callbacks, is it required to stop the devices  ?

DPC is PCIe's preferred surprise removal mechanism. If you don't want
the DPC driver to handle removing devices downstream the containment,
how do you want those devices get removed? Just calling driver's error
callbacks leaves the kernel's view of the topology in the wrong state.

Re: [pci PATCH v5 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources

2018-03-12 Thread Keith Busch

On Mon, Mar 12, 2018 at 10:21:29AM -0700, Alexander Duyck wrote:
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 024a1beda008..9cab9d0d51dc 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1953,6 +1953,7 @@ static inline void pci_mmcfg_late_init(void) { }
>  int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
>  #else

I recommend stubbing 'pci_sriov_configure_simple' or defining it to
NULL in the '#else' section here so you don't need to repeat the "#ifdef
CONFIG_PCI_IOV" in each driver wishing to use this function. Otherwise
looks fine to me.

Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch

On Mon, Mar 12, 2018 at 01:41:07PM -0400, Sinan Kaya wrote:
> I was just writing a reply to you. You acted first :)
> 
> On 3/12/2018 1:33 PM, Keith Busch wrote:
> >>> After releasing a slot from DPC, the link is allowed to retrain. If
> >>> there
> >>> is a working device on the other side, a link up event occurs. That
> >>> event is handled by the pciehp driver, and that schedules enumeration
> >>> no matter what you do to the DPC driver.
> >> yes, that is what i current, but this patch-set makes DPC aware of error
> >> handling driver callbacks.
> > I've been questioning the utility of doing that since the very first
> > version of this patch set.
> > 
> 
> I think we should all agree that shutting down the device drivers with active
> work is not safe. There could be outstanding work that the endpoint driver
> needs to take care of. 
> 
> That was the motivation for this change so that we give endpoint drivers an 
> error callback when something goes wrong. 
> 
> The rest is implementation detail that we can all figure out.

I'm not sure if I agree here. All Linux device drivers are supposed to
cope with sudden/unexpected loss of communication at any time. This
includes cleaning up appropriately when requested to unbind from an
inaccessible device with active outstanding work.

Re: [pci PATCH v5 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources

2018-03-12 Thread Keith Busch

On Mon, Mar 12, 2018 at 11:09:34AM -0700, Alexander Duyck wrote:
> On Mon, Mar 12, 2018 at 10:40 AM, Keith Busch  wrote:
> > On Mon, Mar 12, 2018 at 10:21:29AM -0700, Alexander Duyck wrote:
> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
> >> index 024a1beda008..9cab9d0d51dc 100644
> >> --- a/include/linux/pci.h
> >> +++ b/include/linux/pci.h
> >> @@ -1953,6 +1953,7 @@ static inline void pci_mmcfg_late_init(void) { }
> >>  int pci_vfs_assigned(struct pci_dev *dev);
> >>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
> >>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
> >> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
> >>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
> >>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
> >>  #else
> >
> > I recommend stubbing 'pci_sriov_configure_simple' or defining it to
> > NULL in the '#else' section here so you don't need to repeat the "#ifdef
> > CONFIG_PCI_IOV" in each driver wishing to use this function. Otherwise
> > looks fine to me.
> 
> My concern with defining it as NULL is that somebody may end up
> calling it in the future directly and that may end up causing issues.
> One thought I have been debating is moving it to a different file. I
> am just not sure where the best place to put something like this would
> be. I could move this function to drivers/pci/pci.c if everyone is
> okay with it and then I could just strip the contents out by wrapping
> them in a #ifdef instead.

Okay, instead of NULL, a stub implementation in the header file may
suffice when CONFIG_PCI_IOV is not defined:

static inline int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn)
{
return -ENOSYS;
}

See pci_iov_virtfn_bus, pci_iov_virtfn_devfn, pci_iov_add_virtfn, or
pci_enable_sriov for other examples.

Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-03-12 Thread Keith Busch

Hi Jianchao,

The patch tests fine on all hardware I had. I'd like to queue this up
for the next 4.16-rc. Could you send a v3 with the cleanup changes Andy
suggested and a changelog aligned with Ming's insights?

Thanks,
Keith

Re: [PATCH v12 0/6] Address error and recovery for AER and DPC

2018-03-12 Thread Keith Busch

On Mon, Mar 12, 2018 at 02:47:30PM -0500, Bjorn Helgaas wrote:
> [+cc Alex]
> 
> On Mon, Mar 12, 2018 at 08:25:51AM -0600, Keith Busch wrote:
> > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> > > 
> > > > That difference has been there since the beginning of DPC, so it has
> > > > nothing to do with *this* series EXCEPT for the fact that it really
> > > > complicates the logic you're adding to reset_link() and
> > > > broadcast_error_message().
> > > > 
> > > > We ought to be able to simplify that somehow because the only real
> > > > difference between AER and DPC should be that DPC automatically
> > > > disables the link and AER does it in software.
> > > 
> > > I agree this should be possible. Code execution path should be almost
> > > identical to fatal error case.
> > > 
> > > Is there any reason why you went to stop driver path, Keith?
> > 
> > The fact is the link is truly down during a DPC event. When the link
> > is enabled again, you don't know at that point if the device(s) on the
> > other side have changed.
> 
> When DPC is triggered, the port takes the link down.  When we handle
> an uncorrectable (nonfatal or fatal) AER event, software takes the
> link down.
> 
> In both cases, devices on the other side are at least reset.  Whenever
> the link goes down, it's conceivable the device could be replaced with
> a different one before the link comes back up.  Is this why you remove
> and re-enumerate?  (See tangent [1] below.)

Yes. Truthfully, DPC events due to changing topologies was the motivating
use case when we initially developed this. We were also going for
simplicity (at least initially), and remove + re-enumerate seemed
safe without concerning this driver with other capability regsiters, or
coordinating with/depending on other drivers. For example, a successful
reset may depend on any particular driver calling pci_restore_state from
a good saved state.

> The point is that from the device's hardware perspective, these
> scenarios are the same (it sent a ERR_NONFATAL or ERR_FATAL message
> and it sees the link go down).  I think we should make them the same
> on the software side, too: the driver should see the same callbacks,
> in the same order, whether we're doing AER or DPC.
> 
> If removing and re-enumerating is the right thing for DPC, I think
> that means it's also the right thing for AER.
> 
> Along this line, we had a question a while back about logging AER
> information after a DPC trigger.  Obviously we can't collect any
> logged information from the downstream devices while link is down, but
> I noticed the AER status bits are RW1CS, which means they're sticky
> and are not modified by hot reset or FLR.
> 
> So we may be able to read and log the AER information after the DPC
> driver brings the link back up.  We may want to do the same with AER,
> i.e., reset the downstream devices first, then collect the AER status
> bits after the link comes back up.

I totally agree. We could consult Slot and AER status to guide a
smarter approach.

> > Calling a driver's error handler for the wrong device in an unknown
> > state may have undefined results. Enumerating the slot from scratch
> > should be safe, and will assign resources, tune bus settings, and
> > bind to the matching driver.
> 
> I agree with this; I think this is heading toward doing
> remove/re-enumerate on AER errors as well because it has also reset
> the device.
> 
> > Per spec, DPC is the recommended way for handling surprise removal
> > events and even recommends DPC capable slots *not* set 'Surprise'
> > in Slot Capabilities so that removals are always handled by DPC. This
> > service driver was developed with that use in mind.
> 
> Thanks for this tip.  The only thing I've found so far is the mention
> of Surprise Down triggering DPC in the last paragraph of sec 6.7.5.
> Are there other references I should look at?  I haven't found the
> recommendation about not setting 'Surprise' in Slot Capabilities yet.

No problem, it's in the "IMPLEMENTATION NOTE" at the end of 6.2.10.4,
"Avoid Disabled Link and Hot-Plug Surprise Use with DPC".

Outside the spec, Microsemi as one of the PCI-SIG contributors and early
adopters of the capability published a white paper "Hot and Surprise
Plug Recommendations for Enterprise PCIe" providing guidance for OS
drivers using DPC. We originally developed to that guidance. The pa

Re: [PATCH V2] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-03-09 Thread Keith Busch

On Thu, Mar 08, 2018 at 08:42:20AM +0100, Christoph Hellwig wrote:
> 
> So I suspect we'll need to go with a patch like this, just with a way
> better changelog.

I have to agree this is required for that use case. I'll run some
quick tests and propose an alternate changelog.

Longer term, the current way we're including offline present cpus either
(a) has the driver allocate resources it can't use or (b) spreads the
ones it can use thinner than they need to be. Why don't we rerun the
irq spread under a hot cpu notifier for only online CPUs?

Re: [PATCH v2 1/1] nvme: implement log page low/high offset and dwords

2018-03-02 Thread Keith Busch

Thanks, applied for 4.17.

A side note for those interested, some bug fixing commits introduced in
the nvme-4.17 branch were applied to 4.16-rc. I've rebased these on top
of linux-block/for-next so we don't have the duplicate commits for the
4.17 merge window. The nvme branch currently may be viewed here:

  http://git.infradead.org/nvme.git/shortlog/refs/heads/for-next

Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Keith Busch

On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote:
> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe  wrote:
> > @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq,
> >  {
> > u16 tail = nvmeq->sq_tail;
> 
> > -   if (nvmeq->sq_cmds_io)
> > -   memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd));
> > -   else
> > -   memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> > +   memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> 
> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC
> the _toio() variant enforces alignment, does the copy with 4 byte
> stores, and has a full barrier after the copy. In comparison our
> regular memcpy() does none of those things and may use unaligned and
> vector load/stores. For normal (cacheable) memory that is perfectly
> fine, but they can cause alignment faults when targeted at MMIO
> (cache-inhibited) memory.
> 
> I think in this particular case it might be ok since we know SEQs are
> aligned to 64 byte boundaries and the copy is too small to use our
> vectorised memcpy(). I'll assume we don't need explicit ordering
> between writes of SEQs since the existing code doesn't seem to care
> unless the doorbell is being rung, so you're probably fine there too.
> That said, I still think this is a little bit sketchy and at the very
> least you should add a comment explaining what's going on when the CMB
> is being used. If someone more familiar with the NVMe driver could
> chime in I would appreciate it.

I may not be understanding the concern, but I'll give it a shot.

You're right, the start of any SQE is always 64-byte aligned, so that
should satisfy alignment requirements.

The order when writing multiple/successive SQEs in a submission queue
does matter, and this is currently serialized through the q_lock.

The order in which the bytes of a single SQE is written doesn't really
matter as long as the entire SQE is written into the CMB prior to writing
that SQ's doorbell register.

The doorbell register is written immediately after copying a command
entry into the submission queue (ignore "shadow buffer" features),
so the doorbells written to commands submitted is 1:1.

If a CMB SQE and DB order is not enforced with the memcpy, then we do
need a barrier after the SQE's memcpy and before the doorbell's writel.

Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Keith Busch

On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> So when reading the above mlx code, we see the first wmb() being used
> to ensure that CPU stores to cachable memory are visible to the DMA
> triggered by the doorbell ring.

IIUC, we don't need a similar barrier for NVMe to ensure memory is
visibile to DMA since the SQE memory is allocated DMA coherent when the
SQ is not within a CMB.
 
> The mmiowb() is used to ensure that DB writes are not combined and not
> issued in any order other than implied by the lock that encloses the
> whole thing. This is needed because uar_map is WC memory.
> 
> We don't have ordering with respect to two writel's here, so if ARM
> performance was a concern the writel could be switched to
> writel_relaxed().
> 
> Presumably nvme has similar requirments, although I guess the DB
> register is mapped UC not WC?

Yep, the NVMe DB register is required by the spec to be mapped UC.

Re: [PATCH] nvme: enforce 64bit offset for nvme_get_log_ext fn

2018-03-27 Thread Keith Busch

On Tue, Mar 27, 2018 at 08:00:33PM +0200, Matias Bjørling wrote:
> Compiling on 32 bits system produces a warning for the shift width
> when shifting 32 bit integer with 64bit integer.
> 
> Make sure that offset always is 64bit, and use macros for retrieving
> lower and upper bits of the offset.

Thanks, applied to nvme.

Re: [PATCH] nvme: use upper_32_bits() instead of bit shift

2018-03-28 Thread Keith Busch

On Wed, Mar 28, 2018 at 03:57:47PM +0200, Arnd Bergmann wrote:
> @@ -2233,8 +2233,8 @@ int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct 
> nvme_ns *ns,
>   c.get_log_page.lid = log_page;
>   c.get_log_page.numdl = cpu_to_le16(dwlen & ((1 << 16) - 1));
>   c.get_log_page.numdu = cpu_to_le16(dwlen >> 16);
> - c.get_log_page.lpol = cpu_to_le32(offset & ((1ULL << 32) - 1));
> - c.get_log_page.lpou = cpu_to_le32(offset >> 32ULL);
> + c.get_log_page.lpol = cpu_to_le32(lower_32_bits(offset));
> + c.get_log_page.lpou = cpu_to_le32(upper_32_bits(offset));
>  
>   return nvme_submit_sync_cmd(ctrl->admin_q, &c, log, size);
>  }

Right, Matias posted the same fix here:

 http://lists.infradead.org/pipermail/linux-nvme/2018-March/016474.html

In addition to the type safe shifting, a 64-bit type was used to match
the spec.

Re: [PATCH] nvme: target: fix buffer overflow

2018-03-28 Thread Keith Busch

Thanks, applied.

Re: [PATCH] nvme: unexport nvme_start_keep_alive

2018-03-28 Thread Keith Busch

Thanks, applied.

Re: [PATCH] nvme: don't send keep-alives to the discovery controller

2018-03-28 Thread Keith Busch

Thanks, applied.

Re: [PATCH] nvme-multipath: implement active-active round-robin path selector

2018-03-28 Thread Keith Busch

On Wed, Mar 28, 2018 at 10:06:46AM +0200, Christoph Hellwig wrote:
> For PCIe devices the right policy is not a round robin but to use
> the pcie device closer to the node.  I did a prototype for that
> long ago and the concept can work.  Can you look into that and
> also make that policy used automatically for PCIe devices?

Yeah, that is especially true if you've multiple storage accessing
threads scheduled on different nodes. On the other hand, round-robin
may still benefit if both paths are connected to different root ports
on the same node (who would do that?!).

But I wasn't aware people use dual-ported PCIe NVMe connected to a
single host (single path from two hosts seems more common). If that's a
thing, we should get some numa awareness. I couldn't find your prototype,
though. I had one stashed locally from a while back and hope it resembles
what you had in mind:
---
struct nvme_ns *nvme_find_path_numa(struct nvme_ns_head *head)
{
int distance, current = INT_MAX, node = cpu_to_node(smp_processor_id());
struct nvme_ns *ns, *path = NULL;

list_for_each_entry_rcu(ns, &head->list, siblings) {
if (ns->ctrl->state != NVME_CTRL_LIVE)
continue;
if (ns->disk->node_id == node)
return ns;

distance = node_distance(node, ns->disk->node_id);
if (distance < current) {
current = distance;
path = ns;
}
}
return path;
}
--

Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system

2018-04-12 Thread Keith Busch

On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote:
> On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
> > 
> > I think the scenario you are describing is two systems that are
> > identical except that in the first, the endpoint is below a hotplug
> > bridge, while in the second, it's below a non-hotplug bridge.  There's
> > no physical hotplug (no drive removed or inserted), and DPC is
> > triggered in both systems.
> > 
> > I suggest that DPC should be handled identically in both systems:
> > 
> >   - The PCI core should have the same view of the endpoint: it should
> > be removed and re-added in both cases (or in neither case).
> > 
> >   - The endpoint itself should not be able to tell the difference: it
> > should see a link down event, followed by a link retrain, followed
> > by the same sequence of config accesses, etc.
> > 
> >   - The endpoint driver should not be able to tell the difference,
> > i.e., we should be calling the same pci_error_handlers callbacks
> > in both cases.
> > 
> > It's true that in the non-hotplug system, pciehp probably won't start
> > re-enumeration, so we might need an alternate path to trigger that.
> > 
> > But that's not what we're doing in this patch.  In this patch we're
> > adding a much bigger difference: for hotplug bridges, we stop and
> > remove the hierarchy below the bridge; for non-hotplug bridges, we do
> > the AER-style flow of calling pci_error_handlers callbacks.
> 
> Our approach on V12 was to go to AER style recovery for all DPC events
> regardless of hotplug support or not. 
> 
> Keith was not comfortable with this approach. That's why, we special cased
> hotplug.
> 
> If we drop 6/6 on this patch on v13, we achieve this. We still have to
> take care of Keith's inputs on individual patches.
> 
> we have been struggling with the direction for a while.
> 
> Keith, what do you think?

My only concern was for existing production environments that use DPC
for handling surprise removal, and I don't wish to break the existing
uses.

Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system

2018-04-12 Thread Keith Busch

On Thu, Apr 12, 2018 at 08:39:54AM -0600, Keith Busch wrote:
> On Thu, Apr 12, 2018 at 10:34:37AM -0400, Sinan Kaya wrote:
> > On 4/12/2018 10:06 AM, Bjorn Helgaas wrote:
> > > 
> > > I think the scenario you are describing is two systems that are
> > > identical except that in the first, the endpoint is below a hotplug
> > > bridge, while in the second, it's below a non-hotplug bridge.  There's
> > > no physical hotplug (no drive removed or inserted), and DPC is
> > > triggered in both systems.
> > > 
> > > I suggest that DPC should be handled identically in both systems:
> > > 
> > >   - The PCI core should have the same view of the endpoint: it should
> > > be removed and re-added in both cases (or in neither case).
> > > 
> > >   - The endpoint itself should not be able to tell the difference: it
> > > should see a link down event, followed by a link retrain, followed
> > > by the same sequence of config accesses, etc.
> > > 
> > >   - The endpoint driver should not be able to tell the difference,
> > > i.e., we should be calling the same pci_error_handlers callbacks
> > > in both cases.
> > > 
> > > It's true that in the non-hotplug system, pciehp probably won't start
> > > re-enumeration, so we might need an alternate path to trigger that.
> > > 
> > > But that's not what we're doing in this patch.  In this patch we're
> > > adding a much bigger difference: for hotplug bridges, we stop and
> > > remove the hierarchy below the bridge; for non-hotplug bridges, we do
> > > the AER-style flow of calling pci_error_handlers callbacks.
> > 
> > Our approach on V12 was to go to AER style recovery for all DPC events
> > regardless of hotplug support or not. 
> > 
> > Keith was not comfortable with this approach. That's why, we special cased
> > hotplug.
> > 
> > If we drop 6/6 on this patch on v13, we achieve this. We still have to
> > take care of Keith's inputs on individual patches.
> > 
> > we have been struggling with the direction for a while.
> > 
> > Keith, what do you think?
> 
> My only concern was for existing production environments that use DPC
> for handling surprise removal, and I don't wish to break the existing
> uses.

Also, I thought the plan was to keep hotplug and non-hotplug the same,
except for the very end: if not a hotplug bridge, initiate the rescan
automatically after releasing from containment, otherwise let pciehp
handle it when the link reactivates.

Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system

2018-04-12 Thread Keith Busch

On Thu, Apr 12, 2018 at 12:27:20PM -0400, Sinan Kaya wrote:
> On 4/12/2018 11:02 AM, Keith Busch wrote:
> > 
> > Also, I thought the plan was to keep hotplug and non-hotplug the same,
> > except for the very end: if not a hotplug bridge, initiate the rescan
> > automatically after releasing from containment, otherwise let pciehp
> > handle it when the link reactivates.
> > 
> 
> Hmm...
> 
> AER driver doesn't do stop and rescan approach for fatal errors. AER driver
> makes an error callback followed by secondary bus reset and finally driver
> the resume callback on the endpoint only if link recovery is successful.
> Otherwise, AER driver bails out with recovery unsuccessful message.

I'm not sure if that's necessarily true. People have reported AER
handling triggers PCIe hotplug events, and creates some interesting race
conditions:

https://marc.info/?l=linux-pci&m=152336615707640&w=2

https://www.spinics.net/lists/linux-pci/msg70614.html

> Why do we need an additional rescan in the DPC driver if the link is up
> and driver resumes operation?

I thought the plan was to have DPC always go through the removal path
to ensure all devices are properly configured when containment is
released. In order to reconfigure those, you'll need to initiate the
rescan from somewhere.

Re: [PATCH] NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage

2018-04-12 Thread Keith Busch

Thanks, applied for 4.17-rc1.

I was a little surprised git was able to apply this since the patch
format is off, but it worked!

Re: [PATCH] nvmet: fix nvmet_execute_write_zeroes function

2018-04-02 Thread Keith Busch

On Fri, Mar 30, 2018 at 06:18:50PM -0300, Rodrigo R. Galvao wrote:
>   sector = le64_to_cpu(write_zeroes->slba) <<
>   (req->ns->blksize_shift - 9);
>   nr_sector = (((sector_t)le16_to_cpu(write_zeroes->length)) <<
> - (req->ns->blksize_shift - 9)) + 1;
> + (req->ns->blksize_shift - 9));

I see what's wrong here. The +1 needs to be on the native format prior
to converting to 512b, so the fix needs to be:

---
sector = le64_to_cpu(write_zeroes->slba) <<
(req->ns->blksize_shift - 9);
-   nr_sector = (((sector_t)le16_to_cpu(write_zeroes->length)) <<
-   (req->ns->blksize_shift - 9)) + 1;
+   nr_sector = (((sector_t)le16_to_cpu(write_zeroes->length + 1)) <<
+   (req->ns->blksize_shift - 9));
 
if (__blkdev_issue_zeroout(req->ns->bdev, sector, nr_sector,
GFP_KERNEL, &bio, 0))
--

Re: [PATCH] nvmet: fix nvmet_execute_write_zeroes function

2018-04-02 Thread Keith Busch

On Mon, Apr 02, 2018 at 10:47:10AM -0300, Rodrigo Rosatti Galvao wrote:
> One thing that I just forgot to explain previously, but I think its
> relevant:
> 
> 1. The command is failing with 4k logical block size, but works with 512B
> 
> 2. With the patch, the command is working for both 512B and 4K.

While you're not getting errors with your patch, you're not zeroing out
the requested blocks, so that's a data corruption.

The issue is the +1 is in the wrong place. It needs to be added to the
native format prior to converting it to a 512b sector count. Do you want
to resend with that change?

Re: [PATCH v1] PCI/DPC: Rename from pcie-dpc.c to dpc.c.

2018-04-02 Thread Keith Busch

On Sat, Mar 31, 2018 at 05:34:26PM -0500, Bjorn Helgaas wrote:
> From: Bjorn Helgaas 
> 
> Rename from pcie-dpc.c to dpc.c.  The path "drivers/pci/pcie/pcie-dpc.c"
> has more occurrences of "pci" than necessary.
> 
> Signed-off-by: Bjorn Helgaas 

Looks good.

Acked-by: Keith Busch

Re: [PATCH v2] nvmet: fix nvmet_execute_write_zeroes function

2018-04-02 Thread Keith Busch

On Mon, Apr 02, 2018 at 11:49:41AM -0300, Rodrigo R. Galvao wrote:
> When trying to issue write_zeroes command against TARGET with a 4K block
> size, it ends up hitting the following condition at __blkdev_issue_zeroout:
> 
>  if ((sector | nr_sects) & bs_mask)
> return -EINVAL;
> 
> Causing the command to always fail.
> Considering we need to add 1 to get the correct block count, that addition
> needs to be performed in the native format, so we moved the +1 to within
> le16_to_cpu prior to converting to 512b.
> 
> Signed-off-by: Rodrigo R. Galvao 
> ---
>  drivers/nvme/target/io-cmd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c
> index 28bbdff..5292bc3 100644
> --- a/drivers/nvme/target/io-cmd.c
> +++ b/drivers/nvme/target/io-cmd.c
> @@ -173,8 +173,8 @@ static void nvmet_execute_write_zeroes(struct nvmet_req 
> *req)
>  
>   sector = le64_to_cpu(write_zeroes->slba) <<
>   (req->ns->blksize_shift - 9);
> - nr_sector = (((sector_t)le16_to_cpu(write_zeroes->length)) <<
> - (req->ns->blksize_shift - 9)) + 1;
> + nr_sector = (((sector_t)le16_to_cpu(write_zeroes->length + 1)) <<
> + (req->ns->blksize_shift - 9));

I'm terribly sorry, but the +1 actually needs to be outside the
le16_to_cpu. The above will work on little-endian machines, but not big.

Re: [PATCH v3] nvmet: fix nvmet_execute_write_zeroes function

2018-04-02 Thread Keith Busch

Thanks, I've applied the patch with a simpler changelog explaining
the bug.

Re: [PATCH] nvme-multipath: implement active-active round-robin path selector

2018-04-04 Thread Keith Busch

On Fri, Mar 30, 2018 at 09:04:46AM +, Eric H. Chang wrote:
> We internally call PCIe-retimer as HBA. It's not a real Host Bus Adapter that 
> translates the interface from PCIe to SATA or SAS. Sorry for the confusion.

Please don't call a PCIe retimer an "HBA"! :)

While your experiment is setup to benefit from round-robin, my only
concern is it has odd performance in a real world scenario with
IO threads executing in different nodes. Christoph's proposal will
naturally utilize both paths optimally there, where round-robin will
saturate node interlinks.

Not that I'm against having the choice; your setup probably does represent
real use. But if we're going to have multiple choice, user documentation
on nvme path selectors will be useful here.

Re: [PATCH 08/12] lightnvm: implement get log report chunk helpers

2018-03-21 Thread Keith Busch

On Wed, Mar 21, 2018 at 03:06:05AM -0700, Matias Bjørling wrote:
> > outside of nvme core so that we can use it form lightnvm.
> > 
> > Signed-off-by: Javier González 
> > ---
> >   drivers/lightnvm/core.c  | 11 +++
> >   drivers/nvme/host/core.c |  6 ++--
> >   drivers/nvme/host/lightnvm.c | 74 
> > 
> >   drivers/nvme/host/nvme.h |  3 ++
> >   include/linux/lightnvm.h | 24 ++
> >   5 files changed, 115 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 2e9e9f973a75..af642ce6ba69 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -2127,9 +2127,9 @@ static int nvme_init_subsystem(struct nvme_ctrl 
> > *ctrl, struct nvme_id_ctrl *id)
> > return ret;
> >   }
> >   
> > -static int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > -   u8 log_page, void *log,
> > -   size_t size, size_t offset)
> > +int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > +u8 log_page, void *log,
> > +size_t size, size_t offset)
> >   {
> > struct nvme_command c = { };
> > unsigned long dwlen = size / 4 - 1;
> > diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
> > index 08f0f6b5bc06..ffd64a83c8c3 100644
> > --- a/drivers/nvme/host/lightnvm.c
> > +++ b/drivers/nvme/host/lightnvm.c
> > @@ -35,6 +35,10 @@ enum nvme_nvm_admin_opcode {
> > nvme_nvm_admin_set_bb_tbl   = 0xf1,
> >   };
> >   
> 
> 
> 
> >   
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index 1ca08f4993ba..505f797f8c6c 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -396,6 +396,9 @@ int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
> >   int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
> >   int nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
> >   
> > +int nvme_get_log_ext(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > +u8 log_page, void *log, size_t size, size_t offset);
> > +
> >   extern const struct attribute_group nvme_ns_id_attr_group;
> >   extern const struct block_device_operations nvme_ns_head_ops;
> >   
> 
> 
> Keith, Christoph, Sagi, Is it okay that these two changes that exposes 
> the nvme_get_log_ext fn are carried through Jens' tree after the nvme 
> tree for 4.17 has been pulled?

That's okay with me. Alteratively, if you want to split the generic nvme
part out, I can apply that immediately and the API will be in the first
nvme-4.17 pull request.

Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices

2018-03-21 Thread Keith Busch

On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote:
> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote:
> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote:
> > >> NVMe driver uses threads for the work at device reset, including enabling
> > >> the PCIe device. When multiple NVMe devices are initialized, their reset
> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be
> > >> called in parallel on multiple cores.
> > >> 
> > >> This causes a loop of enabling of all upstream bridges in
> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations
> > >> including __pci_set_master and architecture-specific functions that
> > >> call ones like and pci_enable_resources(). Both __pci_set_master()
> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space
> > >> and change it. This is done as read/modify/write.
> > >> 
> > >> Imagine that the PCIe tree looks like:
> > >> A - B - switch -  C - D
> > >>\- E - F
> > >> 
> > >> D and F are two NVMe disks and all devices from B are not enabled and bus
> > >> mastering is not set. If their reset work are scheduled in parallel the 
> > >> two
> > >> modifications of PCI_COMMAND may happen in parallel without locking and 
> > >> the
> > >> system may end up with the part of PCIe tree not enabled.
> > > 
> > > Then looks serialized reset should be used, and I did see the commit
> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed
> > > to mark controller state' in reset stress test.
> > > 
> > > But that commit only covers case of PCI reset from sysfs attribute, and
> > > maybe other cases need to be dealt with in similar way too.
> > > 
> > 
> > It seems to me that the serialized reset works for multiple resets of the
> > same device, doesn't it? Our problem is linked to resets of different 
> > devices
> > that share the same PCIe tree.
> 
> Given reset shouldn't be a frequent action, it might be fine to serialize all
> reset from different devices.

The driver was much simpler when we had serialized resets in line with
probe, but that had a bigger problems with certain init systems when
you put enough nvme devices in your server, making them unbootable.

Would it be okay to serialize just the pci_enable_device across all
other tasks messing with the PCI topology?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cef5ce851a92..e0a2f6c0f1cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev)
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);

-   if (pci_enable_device_mem(pdev))
-   return result;
+   pci_lock_rescan_remove();
+   result = pci_enable_device_mem(pdev);
+   pci_unlock_rescan_remove();
+   if (result)
+   return -ENODEV;

pci_set_master(pdev);

--

Re: [PATCH] nvme: make nvme_get_log_ext non-static

2018-03-21 Thread Keith Busch

On Wed, Mar 21, 2018 at 08:27:07PM +0100, Matias Bjørling wrote:
> Enable the lightnvm integration to use the nvme_get_log_ext()
> function.
> 
> Signed-off-by: Matias Bjørling 

Thanks, applied to nvme-4.17.

Re: [PATCH v13 1/6] PCI/AER: Rename error recovery to generic PCI naming

2018-04-09 Thread Keith Busch

On Mon, Apr 09, 2018 at 10:41:49AM -0400, Oza Pawandeep wrote:
> This patch renames error recovery to generic name with pcie prefix
> 
> Signed-off-by: Oza Pawandeep 

Looks fine.

Reviewed-by: Keith Busch

Re: [PATCH v13 2/6] PCI/AER: Factor out error reporting from AER

2018-04-09 Thread Keith Busch

On Mon, Apr 09, 2018 at 10:41:50AM -0400, Oza Pawandeep wrote:
> This patch factors out error reporting callbacks, which are currently
> tightly coupled with AER.
> 
> DPC should be able to register callbacks and attempt recovery when DPC
> trigger event occurs.
> 
> Signed-off-by: Oza Pawandeep 

Looks fine.

Reviewed-by: Keith Busch

Re: [PATCH v13 3/6] PCI/PORTDRV: Implement generic find service

2018-04-09 Thread Keith Busch

On Mon, Apr 09, 2018 at 10:41:51AM -0400, Oza Pawandeep wrote:
> This patch implements generic pcie_port_find_service() routine.
> 
> Signed-off-by: Oza Pawandeep 

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH v13 5/6] PCI: Unify wait for link active into generic PCI

2018-04-09 Thread Keith Busch

On Mon, Apr 09, 2018 at 10:41:53AM -0400, Oza Pawandeep wrote:
> +/**
> + * pcie_wait_for_link - Wait for link till it's active/inactive
> + * @pdev: Bridge device
> + * @active: waiting for active or inactive ?
> + *
> + * Use this to wait till link becomes active or inactive.
> + */
> +bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> +{
> + int timeout = 1000;
> + bool ret;
> + u16 lnk_status;
> +
> + for (;;) {
> + pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
> + ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
> + if (ret == active)
> + return true;
> + if (timeout <= 0)
> + break;
> + timeout -= 10;
> + }

This is missing an msleep(10) at each iteration.

> +
> + pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
> +  active ? "set" : "cleared");
> +
> + return false;
> +}

Re: [PATCH v13 4/6] PCI/DPC: Unify and plumb error handling into DPC

2018-04-09 Thread Keith Busch

On Mon, Apr 09, 2018 at 10:41:52AM -0400, Oza Pawandeep wrote:
> +static int find_dpc_dev_iter(struct device *device, void *data)
> +{
> + struct pcie_port_service_driver *service_driver;
> + struct device **dev;
> +
> + dev = (struct device **) data;
> +
> + if (device->bus == &pcie_port_bus_type && device->driver) {
> + service_driver = to_service_driver(device->driver);
> + if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
> + *dev = device;
> + return 1;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
> +{
> + struct device *dev = NULL;
> +
> + device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
> +
> + return dev;
> +}

The only caller of this doesn't seem to care to use struct device. This
should probably just extract struct dpc_dev directly from in here.

Re: [PATCH] nvmet: fix nvmet_execute_write_zeroes function

2018-03-30 Thread Keith Busch

On Fri, Mar 30, 2018 at 06:18:50PM -0300, Rodrigo R. Galvao wrote:
> When trying to issue write_zeroes command against TARGET the nr_sector is
> being incremented by 1, which ends up hitting the following condition at
> __blkdev_issue_zeroout:
> 
>  if ((sector | nr_sects) & bs_mask)
> return -EINVAL;
> 
> Causing the command to always fail. Removing the increment makes the
> command to work properly.

Doesn't that mean your host is using this command wrong? The NLB is a
0's based value, we're supposed to +1 to get the correct block count.

Re: [PATCH v2] nvme: explicitly disable APST on quirked devices

2017-06-26 Thread Keith Busch

On Mon, Jun 26, 2017 at 12:01:29AM -0700, Kai-Heng Feng wrote:
> A user reports APST is enabled, even when the NVMe is quirked or with
> option "default_ps_max_latency_us=0".
> 
> The current logic will not set APST if the device is quirked. But the
> NVMe in question will enable APST automatically.
> 
> Separate the logic "apst is supported" and "to enable apst", so we can
> use the latter one to explicitly disable APST at initialiaztion.
> 
> BugLink: https://bugs.launchpad.net/bugs/1699004
> Signed-off-by: Kai-Heng Feng 

There's something off about the format with this patch such that it is
getting mangled when I save and apply it.

I'm not sure if the content type and encoding are the culprit, but that
just stood out as different. Normally patches I receive have this header:

  Content-Type: text/plain; charset="us-ascii"
  Content-Transfer-Encoding: 7bit

But this one has:

  Content-Type: text/plain; charset="iso-8859-1"
  Content-Transfer-Encoding: quoted-printable

In any case, I'll hand apply it with Andy's review.

Re: [PATCH] fs: System memory leak when running HTX with T10 DIF enabled

2017-06-28 Thread Keith Busch

On Wed, Jun 28, 2017 at 11:32:51AM -0500, wenxi...@linux.vnet.ibm.com wrote:
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 519599d..e871444 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -264,6 +264,10 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>  
>   if (unlikely(bio.bi_error))
>   return bio.bi_error;
> +
> + if (bio_integrity(&bio))
> + bio_integrity_free(&bio);
> +
>   return ret;
>  }

We don't want to leak the integrity payload in case of bi_error either.

Re: [PATCH] nvme: Makefile: remove dead build rule

2017-06-29 Thread Keith Busch

On Thu, Jun 29, 2017 at 08:59:07AM +0200, Valentin Rothberg wrote:
> Remove dead build rule for drivers/nvme/host/scsi.c which has been
> removed by commit ("nvme: Remove SCSI translations").
> 
> Signed-off-by: Valentin Rothberg 

Oops, thanks for the fix.

Reviewed-by: Keith Busch

Re: [PATCH] nvme-pci: Use PCI bus address for data/queues in CMB

2017-09-29 Thread Keith Busch

On Fri, Sep 29, 2017 at 10:59:26AM +0530, Abhishek Shah wrote:
> Currently, NVMe PCI host driver is programming CMB dma address as
> I/O SQs addresses. This results in failures on systems where 1:1
> outbound mapping is not used (example Broadcom iProc SOCs) because
> CMB BAR will be progammed with PCI bus address but NVMe PCI EP will
> try to access CMB using dma address.
> 
> To have CMB working on systems without 1:1 outbound mapping, we
> program PCI bus address for I/O SQs instead of dma address. This
> approach will work on systems with/without 1:1 outbound mapping.
> 
> The patch is tested on Broadcom Stingray platform(arm64), which
> does not have 1:1 outbound mapping, as well as on x86 platform,
> which has 1:1 outbound mapping.
> 
> Fixes: 8ffaadf7 ("NVMe: Use CMB for the IO SQes if available")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Abhishek Shah 
> Reviewed-by: Anup Patel 
> Reviewed-by: Ray Jui 
> Reviewed-by: Scott Branden 

Thanks for the patch.

On a similar note, we also break CMB usage in virutalization with direct
assigned devices: the guest doesn't know the host physical bus address,
so it sets the CMB queue address incorrectly there, too. I don't know of
a way to fix that other than disabling CMB.


>  static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
>  {
> + int rc;
>   u64 szu, size, offset;
>   resource_size_t bar_size;
>   struct pci_dev *pdev = to_pci_dev(dev->dev);
> @@ -1553,6 +1574,13 @@ static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
>  
>   dev->cmb_dma_addr = dma_addr;
>   dev->cmb_size = size;
> +
> + rc = nvme_find_cmb_bus_addr(pdev, dma_addr, size, &dev->cmb_bus_addr);
> + if (rc) {
> + iounmap(cmb);
> + return NULL;
> + }
> +
>   return cmb;
>  }

Minor suggestion: it's a little simpler if you find the bus address
before ioremap:

---
@@ -1554,6 +1554,10 @@ static void __iomem *nvme_map_cmb(struct nvme_dev *dev)
size = bar_size - offset;
 
dma_addr = pci_resource_start(pdev, NVME_CMB_BIR(dev->cmbloc)) + offset;
+
+   if (nvme_find_cmb_bus_addr(pdev, dma_addr, size, &dev->cmb_bus_addr))
+   return NULL;
+
cmb = ioremap_wc(dma_addr, size);
if (!cmb)
return NULL;
--

Re: [PATCH 2/2] nvme: use device_add_disk_with_groups()

2017-09-29 Thread Keith Busch

On Thu, Sep 28, 2017 at 09:36:37PM +0200, Martin Wilck wrote:
> By using device_add_disk_with_groups(), we can avoid the race
> condition with udev rule processing, because no udev event will
> be triggered before all attributes are available.
> 
> Signed-off-by: Martin Wilck 

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH 1/2] block: genhd: add device_add_disk_with_groups

2017-09-29 Thread Keith Busch

On Thu, Sep 28, 2017 at 09:36:36PM +0200, Martin Wilck wrote:
> In the NVME subsystem, we're seeing a race condition with udev where
> device_add_disk() is called (which triggers an "add" uevent), and a
> sysfs attribute group is added to the disk device afterwards.
> If udev rules access these attributes before they are created,
> udev processing of the device is incomplete, in particular, device
> WWIDs may not be determined correctly.
> 
> To fix this, this patch introduces a new function
> device_add_disk_with_groups(), which takes a list of attribute groups
> and adds them to the device before sending out uevents.
> 
> Signed-off-by: Martin Wilck 

Is NVMe the only one having this problem? Was putting our attributes in
the disk's kobj a bad choice?

Any, looks fine to me.

Reviewed-by: Keith Busch

Re: [PATCH] NVMe: Added another device ID with stripe quirk

2017-07-06 Thread Keith Busch

On Fri, Jul 07, 2017 at 12:17:54AM +0200, Christoph Hellwig wrote:
> On Thu, Jul 06, 2017 at 04:12:31PM -0600, David Wayne Fugate wrote:
> > Adds a fourth Intel controller which has the "stripe" quirk.
> 
> NVMe has stadardized a way to communicate this information through
> the Namespace Optimal IO Boundary (NOIOB) field in the Identify
> Namespace structure, and Keith and Amber at Intel helped to define
> this, so please actually implement it in your controllers.

That's true for all Intel controllers going forward, but this
is actually an older controller that pre-dates NOIOB. It's the exact
same as the 8086:0A54 model, but a particular vendor decided their
rebranded device needs to be made special with a different DID.

We all agree that's a terrible way to go about this for mutliple reasons,
but we can't go back in time to tell the decision makers of this folly. So
I think we need to let this last one go through with the quirk.

Acked-by: Keith Busch

Re: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-07 Thread Keith Busch

On Mon, Aug 07, 2017 at 08:45:25AM -0700, James Bottomley wrote:
> On Mon, 2017-08-07 at 20:01 +0530, Kashyap Desai wrote:
> > 
> > We have to attempt this use case and see how it behaves. I have not
> > tried this, so not sure if things are really bad or just some tuning
> > may be helpful. I will revert back to you on this.
> > 
> > I understood request as -  We need some udev rules to be working well
> > for *same* NVME drives if it is behind  or native .
> > Example - If user has OS installed on NVME drive which is behind
> >  driver as SCSI disk should be able to boot if he/she hooked
> > same NVME drive which is detected by native  driver (and vice
> > versa.)
> 
> It's not just the udev rules, it's the tools as well; possibly things
> like that nvme-cli toolkit Intel is doing.

It looks like they can make existing nvme tooling work with little
effort if they have the driver implement NVME_IOCTL_ADMIN_COMMAND, and
then have their driver build the MPI NVMe Encapsulated Request from that.

Re: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-08 Thread Keith Busch

On Tue, Aug 08, 2017 at 12:33:40PM +0530, Sreekanth Reddy wrote:
> On Tue, Aug 8, 2017 at 9:34 AM, Keith Busch  wrote:
> >
> > It looks like they can make existing nvme tooling work with little
> > effort if they have the driver implement NVME_IOCTL_ADMIN_COMMAND, and
> 
> Precisely, I was thinking on the same line and you clarified. I will
> spend sometime on this and get back to you.

Sounds good. Just note that while the majority of NVMe management should
be reachable with just that IOCTL, some tooling features may not work
correctly since they look for /dev/nvmeX, and I assume this driver will
create /dev/sdX instead.

Re: [patch 15/55] PCI: vmd: Create named irq domain

2017-06-20 Thread Keith Busch

On Tue, Jun 20, 2017 at 01:37:15AM +0200, Thomas Gleixner wrote:
>  static int vmd_enable_domain(struct vmd_dev *vmd)
>  {
>   struct pci_sysdata *sd = &vmd->sysdata;
> + struct fwnode_handle *fn;
>   struct resource *res;
>   u32 upper_bits;
>   unsigned long flags;
> @@ -617,8 +618,13 @@ static int vmd_enable_domain(struct vmd_
>  
>   sd->node = pcibus_to_node(vmd->dev->bus);
>  
> - vmd->irq_domain = pci_msi_create_irq_domain(NULL, &vmd_msi_domain_info,
> + fn = irq_domain_alloc_named_id_fwnode("VMD-MSI", vmd->sysdata.domain);
> + if (!fn)
> + return -ENODEV;
> +
> + vmd->irq_domain = pci_msi_create_irq_domain(fn, &vmd_msi_domain_info,
>   x86_vector_domain);
> + kfree(fn);

If I'm following all this correctly, it looks like we need to use
irq_domain_free_fwnode with irq_domain_alloc_named_id_fwnode instead of
freeing 'fn' directly, otherwise we leak 'fwid->name'.

Re: [patch 32/55] x86/irq: Restructure fixup_irqs()

2017-06-20 Thread Keith Busch

On Tue, Jun 20, 2017 at 01:37:32AM +0200, Thomas Gleixner wrote:
> @@ -441,18 +440,27 @@ void fixup_irqs(void)
>  
>   for_each_irq_desc(irq, desc) {
>   const struct cpumask *affinity;
> - int break_affinity = 0;
> - int set_affinity = 1;
> + bool break_affinity = false;
>  
>   if (!desc)
>   continue;
> - if (irq == 2)
> - continue;
>  
>   /* interrupt's are disabled at this point */
>   raw_spin_lock(&desc->lock);
>  
>   data = irq_desc_get_irq_data(desc);
> + chip = irq_data_get_irq_chip(data);
> + /*
> +  * The interrupt descriptor might have been cleaned up
> +  * already, but it is not yet removed from the radix
> +  * tree. If the chip does not have an affinity setter,
> +  * nothing to do here.
> +  */
> + if (!chip !chip->irq_set_affinity) {
> + raw_spin_unlock(&desc->lock);
> + continue;
> + }

A bit of a moot point since the very next patch deletes all of this,
but found this broken 'if' condition when compiling one at a time,
missing the '&&'.

Re: [PATCH V4] PCI: handle CRS returned by device after FLR

2017-07-13 Thread Keith Busch

On Thu, Jul 13, 2017 at 07:17:58AM -0500, Bjorn Helgaas wrote:
> On Thu, Jul 06, 2017 at 05:07:14PM -0400, Sinan Kaya wrote:
> > An endpoint is allowed to issue Configuration Request Retry Status (CRS)
> > following a Function Level Reset (FLR) request to indicate that it is not
> > ready to accept new requests.
> > 
> > Seen a timeout message with Intel 750 NVMe drive and FLR reset.
> > 
> > Kernel enables CRS visibility in pci_enable_crs() function for each bridge
> > it discovers. The OS observes a special vendor ID read value of 0x0001
> > in this case. We need to keep polling until this special read value
> > disappears. pci_bus_read_dev_vendor_id() takes care of CRS handling for a
> > given vendor id read request under the covers.
> > 
> > Adding a vendor ID read if this is a physical function before attempting
> > to read any other registers on the endpoint. A CRS indication will only
> > be given if the address to be read is vendor ID register.
> > 
> > Note that virtual functions report their vendor ID through another
> > mechanism.
> > 
> > The spec is calling to wait up to 1 seconds if the device is sending CRS.
> > The NVMe device seems to be requiring more. Relax this up to 60 seconds.
> 
> Can you add a pointer to the "1 second" requirement in the spec here?
> We use 60 seconds in pci_scan_device() and acpiphp_add_context().  Is
> there a basis in the spec for the 60 second timeout?

I also don't see anywhere that says CRS is limited to only 1 second. It
looks to me that the spec allows a device to return CRS for as long as it
takes to complete initialization.

>From PCIe Base Spec, Section 2.3.1 CRS Implementation note:

  A device in receipt of a Configuration Request following a valid reset
  condition may respond with a CRS Completion Status to terminate the
  Request, and thus effectively stall the Configuration Request until
  such time that the subsystem has completed local initialization and
  is ready to communicate with the host.

No time limit specified here, or anywhere else for that matter AFAICT.
Where is 1 second requirement coming from?

Re: [PATCH V4] PCI: handle CRS returned by device after FLR

2017-07-13 Thread Keith Busch

On Thu, Jul 13, 2017 at 11:44:12AM -0400, Sinan Kaya wrote:
> On 7/13/2017 8:17 AM, Bjorn Helgaas wrote:
> >> he spec is calling to wait up to 1 seconds if the device is sending CRS.
> >> The NVMe device seems to be requiring more. Relax this up to 60 seconds.
> > Can you add a pointer to the "1 second" requirement in the spec here?
> > We use 60 seconds in pci_scan_device() and acpiphp_add_context().  Is
> > there a basis in the spec for the 60 second timeout?
> 
> This does not specify a hard limit above on how long SW need to wait. 
> 
> "6.6.2 Function Level Reset
> After an FLR has been initiated by writing a 1b to the Initiate Function 
> Level Reset bit, 
> the Function must complete the FLR within 100 ms.
> 
> While a Function is required to complete the FLR operation within the time 
> limit described above,
> the subsequent Function-specific initialization sequence may require 
> additional time. 
> If additional time is required, the Function must return a Configuration 
> Request Retry Status (CRS) 
> Completion Status when a Configuration Request is received 15 after the time 
> limit above. 
> After the Function responds to a Configuration Request with a Completion 
> status other than CRS, 
> it is not permitted to return CRS until it is reset again."
> 
> However, another indirect reference here tells us it is capped by 1 second 
> below.
> 
> "6.23. Readiness Notifications (RN)
> Readiness Notifications (RN) is intended to reduce the time software needs to
> wait before issuing Configuration Requests to a Device or Function following 
> DRS
> Events or FRS Events. RN includes both the Device Readiness Status (DRS) and
> Function Readiness Status (FRS) mechanisms. These mechanisms provide a direct
> indication of Configuration-Readiness (see 5 Terms and Acronyms entry for 
> “Configuration-Ready”). 
> 
> When used, DRS and FRS allow an improved behavior over the CRS mechanism, and 
> eliminate
> its associated periodic polling time of up to 1 second following a reset."

That wording is just confusing. It looks to me the 1 second polling is
to be used following a reset if CRS is not implemented.

  
https://pcisig.com/sites/default/files/specification_documents/ECN_RN_29_Aug_2013.pdf

"
  Through the mechanisms defined by this ECR, we can avoid the long,
  architected, fixed delays following various forms of reset before
  software is permitted to perform its first Configuration Request. These
  delays are very large:

  1 second if Configuration Retry Status (CRS) is not used
"

It goes on to say CRS is usually much lower, but doesn't specify an
upper bound either.

Re: [PATCH V4] PCI: handle CRS returned by device after FLR

2017-07-13 Thread Keith Busch

On Thu, Jul 13, 2017 at 12:42:44PM -0400, Sinan Kaya wrote:
> On 7/13/2017 12:29 PM, Keith Busch wrote:
> > That wording is just confusing. It looks to me the 1 second polling is
> > to be used following a reset if CRS is not implemented.
> > 
> >   
> > https://pcisig.com/sites/default/files/specification_documents/ECN_RN_29_Aug_2013.pdf
> > 
> > "
> >   Through the mechanisms defined by this ECR, we can avoid the long,
> >   architected, fixed delays following various forms of reset before
> >   software is permitted to perform its first Configuration Request. These
> >   delays are very large:
> > 
> >   1 second if Configuration Retry Status (CRS) is not used
> > "
> > 
> > It goes on to say CRS is usually much lower, but doesn't specify an
> > upper bound either.
> > 
> 
> I see, we got caught on spec language where we don't know what 'its' is. 

Well, I don't know for certain if your original interpretation is
incorrect. Just saying the CRS intention doesn't explicitly stand out to me.

On a side note, I'll also see if I can get clarification on what
expectations the hardware people have for this particular product. Your
observation seems a little high to me, but I don't know if that's
outside the product's limits.

Re: [PATCH 1/3] MAINTAINERS: Add Jonathan Derrick as VMD maintainer

2017-08-11 Thread Keith Busch

On Mon, Aug 07, 2017 at 01:57:11PM -0600, Jon Derrick wrote:
> Add myself as VMD maintainer
> 
> Signed-off-by: Jon Derrick 

Thanks for adding. 

Acked-by: Keith Busch 


> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f66488d..3ec39df 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10090,6 +10090,7 @@ F:drivers/pci/dwc/*imx6*
>  
>  PCI DRIVER FOR INTEL VOLUME MANAGEMENT DEVICE (VMD)
>  M:   Keith Busch 
> +M:   Jonathan Derrick 
>  L:   linux-...@vger.kernel.org
>  S:   Supported
>  F:   drivers/pci/host/vmd.c
> -- 
> 2.9.4
>

Re: [PATCH] pciehp: Fix infinite interupt handler loop

2017-08-14 Thread Keith Busch

On Mon, Aug 14, 2017 at 03:59:48PM -0500, Bjorn Helgaas wrote:
> On Tue, Aug 01, 2017 at 03:11:52AM -0400, Keith Busch wrote:
> > We've encountered a particular platform that under some circumstances
> > always has the power fault detected status raised. The pciehp irq handler
> > would loop forever because it thinks it is handling new events when in
> > fact the power fault is not new. This patch fixes that by masking off
> > the power fault status from new events if the driver hasn't seen the
> > power fault clear from the previous handling attempt.
> 
> Can you say which platform this is?  If this is a hardware defect,
> it'd be interesting to know where it happens.
> 
> But I'm not sure we handle PCI_EXP_SLTSTA correctly.  We basically
> have this:
> 
>   pciehp_isr()
>   {
> pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status);
> events = status & ();
> pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
> 
>   }
> 
> The write to PCI_EXP_SLTSTA clears PCI_EXP_SLTSTA_PFD because it's
> RW1C.  But we haven't done anything that would actually change the
> situation that caused a power fault, so I don't think it would be
> surprising if the hardware immediately reasserted it.
> 
> So maybe this continual assertion of power fault is really a software
> bug, not a hardware problem?

I *think* it's a software bug for the exact reason you provided, but I'm
sure it must be isolated to certain conditions with certain hardware. We'd
have heard about this regression during 4.9 if it was more wide-spread.

This is on a PEX8733 bridge, and it reports power fault detected status as
long as the power fault exists. While we can write 1 to clear the event,
that just rearms the port to retrigger power fault detected status for as
long as the power controller detects its faulted. The status is cleared
for good only when the power fault no longer exists rather than when
it is acknowledged. The spec seems to support that view (Table (7-21:
Slot Status Register):

  Power Fault Detected – If a Power Controller that supports power fault
  detection is implemented, this bit is Set when the Power Controller
  detects a power fault at this slot.

Re: [PATCH v3 3/3] nvme: wwid_show: strip trailing 0-bytes

2017-07-20 Thread Keith Busch

On Thu, Jul 20, 2017 at 06:34:02PM +0200, Martin Wilck wrote:
> Some broken targets (such as the current Linux target) pad
> model or serial fields with 0-bytes rather than spaces. The
> NVME spec disallows 0 bytes in "ASCII" fields.
> Thus strip trailing 0-bytes, too. Also make sure that we get no
> underflow for pathological input.
> 
> Signed-off-by: Martin Wilck 
> Reviewed-by: Hannes Reinecke 
> Acked-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch

Re: [BUG? NVME Linux-4.15] Dracut loops indefinitely with 4.15

2018-02-15 Thread Keith Busch

On Thu, Feb 15, 2018 at 02:49:56PM +0100, Julien Durillon wrote:
> I opened an issue here:
> https://github.com/dracutdevs/dracut/issues/373 for dracut. You can
> read there how dracuts enters an infinite loop.
> 
> TL;DR: in linux-4.14, trying to find the last "slave" of /dev/dm-0
> ends with a maj:min of "249:0" which does not exist in /sys/dev/block.
> In linux-4.15, the same thing loops as "249:0" exists.

The problem is dracut assumes the 'dev' file for an nvme namespace's
parent device contains a maj:min for a block device. It's actually a
character device, and its maj:min happens to also be be the same as the
device mapper's block device.

Re: [PATCH] nvme-pci: drain the entered requests after ctrl is shutdown

2018-02-12 Thread Keith Busch

On Mon, Feb 12, 2018 at 08:43:58PM +0200, Sagi Grimberg wrote:
> 
> > Currently, we will unquiesce the queues after the controller is
> > shutdown to avoid residual requests to be stuck. In fact, we can
> > do it more cleanly, just wait freeze and drain the queue in
> > nvme_dev_disable and finally leave the queues quiesced.
> 
> Does this fix a bug? What is the benefit of leaving the queues
> quiesced in shutdown?

This doesn't appear to fix anything. The things this patch does do are
either unnecessary (quiece), or already done elsewhere (wait freeze).

Re: [PATCH 2/3] nvme: fix the deadlock in nvme_update_formats

2018-02-12 Thread Keith Busch

This looks good.

Reviewed-by: Keith Busch

Re: [PATCH 1/3] nvme: fix the dangerous reference of namespaces list

2018-02-12 Thread Keith Busch

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH 3/3] nvme: change namespaces_mutext to namespaces_rwsem

2018-02-12 Thread Keith Busch

On Mon, Feb 12, 2018 at 08:47:47PM +0200, Sagi Grimberg wrote:
> This looks fine to me, but I really want Keith and/or Christoph to have
> a look as well.

This looks fine to me as well.

Reviewed-by: Keith Busch

Re: [PATCH 2/3] nvme: fix the deadlock in nvme_update_formats

2018-02-12 Thread Keith Busch

Hi Sagi,

This one is fixing a deadlock in namespace detach. It is still not a
widely supported operation, but becoming more common. While the other two
patches in this series look good for 4.17, I would really recommend this
one for 4.16-rc, and add a Cc to linux-stable for 4.15 too. Sound okay?

Thanks,
Keith

Re: [PATCH v2 2/4] PCI/DPC/AER: Address Concurrency between AER and DPC

2018-01-02 Thread Keith Busch

On Tue, Jan 02, 2018 at 08:25:08AM -0500, Sinan Kaya wrote:
> > 2. A DPC event suppresses the error message required for the Linux
> > AER driver to run. How can AER and DPC run concurrently?
> > 
> 
> As we briefly discussed in previous email exchanges, I think you are
> looking at a use case with a switch that supports DPC functionality. 

No, I'm interested in DPC in a general.

> Oza and I are looking at a root port functionality with DPC feature. 
> 
> As you already know, AER errors are logged to AER capability register
> independent of the DPC driver presence.

The error is noted in the Uncorrectable Error Status Register if that's
what triggered the DPC event. This register has nothing to do with the
Root Error Status Register, which is required to have received an error
Message in order to have a status for the AER driver.

> A root port is also allowed to share the MSI interrupts across DPC and
> AER. 
> 
> Therefore, when a DPC interrupt fires; both AER driver and DPC driver
> starts recovery work. This is the issue we are trying to deal with. 

If DPC is implemented correctly, the AER Root Status can't have an
uncorrectable status for the driver to deal with. The only thing the AER
driver could possibly see is a correctable error if DPC ERR_COR Enable
is set.

> In the end, the driver needs to work for both root port and switches.
> I think you verified it against a switch. We are doing the same for a
> root port and submitting the plumbing code. 

I think we need to consider the possibility you are enabling a platform
that implemented DPC incorrectly. There's nothing in the specification
that says that DPC enabled root ports are not to discard the error message
if it came from downstream, or skip signalling the message for root port
detected errors.

Re: [PATCH v2 0/4] Address error and recovery for AER and DPC

2018-01-02 Thread Keith Busch

On Tue, Jan 02, 2018 at 01:02:15PM -0600, Bjorn Helgaas wrote:
> On Fri, Dec 29, 2017 at 12:54:15PM +0530, Oza Pawandeep wrote:
> > This patch set brings in support for DPC and AER to co-exist and not to
> > race for recovery.
> > 
> > The current implementation of AER and error message broadcasting to the
> > EP driver is tightly coupled and limited to AER service driver.
> > It is important to factor out broadcasting and other link handling
> > callbacks. So that not only when AER gets triggered, but also when DPC get
> > triggered, or both get triggered simultaneously (for e.g. ERR_FATAL),
> > callbacks are handled appropriately.
> > having modularized the code, the race between AER and DPC is handled
> > gracefully.
> > for e.g. when DPC is active and kicked in, AER should not attempt to do
> > recovery, because DPC takes care of it.
> 
> High-level question:
> 
> We have some convoluted code in negotiate_os_control() and
> aer_service_init() that (I think) essentially disables AER unless the
> platform firmware grants us permission to use it.
> 
> The last implementation note in PCIe r3.1, sec 6.2.10 says
> 
>   DPC may be controlled in some configurations by platform firmware
>   and in other configurations by the operating system. DPC
>   functionality is strongly linked with the functionality in Advanced
>   Error Reporting. To avoid conflicts over whether platform firmware
>   or the operating system have control of DPC, it is recommended that
>   platform firmware and operating systems always link the control of
>   DPC to the control of Advanced Error Reporting.
> 
> I read that as suggesting that we should enable DPC support in Linux
> if and only if we also enable AER.  But I don't see anything in DPC
> that looks like that.  Should there be something there?  Should DPC be
> restructured so it's enabled and handled inside the AER driver instead
> of being a separate driver?

Yes, I agree the two should be linked. I submitted a patch for that here,
though driver responsibilities are still separate in this series:

  https://marc.info/?l=linux-pci&m=151371742225111&w=2

Re: [PATCH] nvme: initialize hostid uuid in nvmf_host_default to not leak kernel memory

2018-01-09 Thread Keith Busch

On Tue, Jan 09, 2018 at 04:20:43PM +0100, Johannes Thumshirn wrote:
> Alexander reports:
>   according to KMSAN (and common sense as well) the following code in
>   drivers/nvme/host/fabrics.c
>   
> (http://elixir.free-electrons.com/linux/latest/source/drivers/nvme/host/fabrics.c#L68):
> 
> 72 host = kmalloc(sizeof(*host), GFP_KERNEL);
> 73 if (!host)
> 74 return NULL;
> 75
> 76 kref_init(&host->ref);
> 77 snprintf(host->nqn, NVMF_NQN_SIZE,
> 78 "nqn.2014-08.org.nvmexpress:uuid:%pUb", &host->id);
> 
>   uses uninitialized heap memory to generate the unique id for the NVMF host.
>   If I'm understanding correctly, it can be then passed to the
>   userspace, so the contents of the uninitialized chunk may potentially
>   leak.
>   If the specification doesn't rely on this UID to be random or unique,
>   I suggest using kzalloc() here, otherwise it might be a good idea to
>   use a real RNG.
> 
> this assumption is correct so initialize the host->id using uuid_gen() as
> it was done before commit 6bfe04255d5e ("nvme: add hostid token to fabric
> options").
> 
> Fixes: 6bfe04255d5e ("nvme: add hostid token to fabric options")
> Reported-by: Alexander Potapenko 
> Signed-off-by: Johannes Thumshirn 

Thanks for the report and the fix. It'd still be good to use the kzalloc
variant in addition to this.

Reviewed-by: Keith Busch

Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case

2018-02-02 Thread Keith Busch

On Fri, Feb 02, 2018 at 03:00:45PM +0800, Jianchao Wang wrote:
> Currently, request queue will be frozen and quiesced for both reset
> and shutdown case. This will trigger ioq requests in RECONNECTING
> state which should be avoided to prepare for following patch.
> Just freeze request queue for shutdown case and drain all the resudual
> entered requests after controller has been shutdown.

Freezing is not just for shutdown. It's also used so
blk_mq_update_nr_hw_queues will work if the queue count changes across
resets.

Re: [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-02 Thread Keith Busch

On Fri, Feb 02, 2018 at 03:00:47PM +0800, Jianchao Wang wrote:
> Currently, the complicated relationship between nvme_dev_disable
> and nvme_timeout has become a devil that will introduce many
> circular pattern which may trigger deadlock or IO hang. Let's
> enumerate the tangles between them:
>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>controller doing DMA access before free the request.
>  - nvme_dev_disable has to depend on nvme_timeout to complete
>adminq requests to set HMB or delete sq/cq when the controller
>has no response.
>  - nvme_dev_disable will race with nvme_timeout when cancels the
>outstanding requests.

Your patch is releasing a command back to the OS with the
PCI controller bus master still enabled. This could lead to data or
memory corruption.

In any case, it's not as complicated as you're making it out to
be. It'd be easier to just enforce the exisiting rule that commands
issued in the disabling path not depend on completions or timeout
handling. All of commands issued in this path already do this except
for HMB disabling. Let'sjust fix that command, right?

Re: [PATCH 1/6] nvme-pci: move clearing host mem behind stopping queues

2018-02-02 Thread Keith Busch

On Fri, Feb 02, 2018 at 03:00:44PM +0800, Jianchao Wang wrote:
> Move clearing host mem behind stopping queues. Prepare for
> following patch which will grab all the outstanding requests.
> 
> Signed-off-by: Jianchao Wang 

This one makes sense, though I would alter the change log to something
like:

  This patch quiecses new IO prior to disabling device HMB access.
  A controller using HMB may be relying on it to efficiently complete
  IO commands.

Reviewed-by: Keith Busch 

> ---
>  drivers/nvme/host/pci.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6fe7af0..00cffed 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> bool shutdown)
>   if (!dead) {
>   if (shutdown)
>   nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> + }
> + nvme_stop_queues(&dev->ctrl);
>  
> + if (!dead) {
>   /*
>* If the controller is still alive tell it to stop using the
>* host memory buffer.  In theory the shutdown / reset should
> @@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> bool shutdown)
>*/
>   if (dev->host_mem_descs)
>   nvme_set_host_mem(dev, 0);
> -
> - }
> - nvme_stop_queues(&dev->ctrl);
> -
> - if (!dead) {
>   nvme_disable_io_queues(dev);
>   nvme_disable_admin_queue(dev, shutdown);
>   }
> -- 
> 2.7.4

Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case

2018-02-05 Thread Keith Busch

On Mon, Feb 05, 2018 at 10:26:03AM +0800, jianchao.wang wrote:
> > Freezing is not just for shutdown. It's also used so
> > blk_mq_update_nr_hw_queues will work if the queue count changes across
> > resets.
> blk_mq_update_nr_hw_queues will freeze the queue itself. Please refer to.
> static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,

Yes, but how many requests are you letting enter to their demise by
freezing on the wrong side of the reset? We don't use the nuclear option
out of convenience.

Re: [PATCH] nvme-pci: Fix incorrect use of CMB size to calculate q_depth

2018-02-06 Thread Keith Busch

On Mon, Feb 05, 2018 at 03:32:23PM -0700, sba...@raithlin.com wrote:
>  
> - if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> + if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {

Is this a prep patch for something coming later? dev->cmb is already
NULL if use_cmb_sqes is false.

Re: [PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case

2018-02-06 Thread Keith Busch

On Tue, Feb 06, 2018 at 09:46:36AM +0800, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your kindly response.
> 
> On 02/05/2018 11:13 PM, Keith Busch wrote:
> >  but how many requests are you letting enter to their demise by
> > freezing on the wrong side of the reset?
> 
> There are only two difference with this patch from the original one.
> 1. Don't freeze the queue for the reset case. At the moment, the outstanding 
> requests will be requeued back to blk-mq queues.
>The new entered requests during reset will also stay in blk-mq queues. All 
> this requests will not enter into nvme driver layer
>due to quiescent request_queues. And they will be issued after the reset 
> is completed successfully.
> 2. Drain the request queue before nvme_dev_disable. This is nearly same with 
> the previous rule which will also unquiesce the queue
>and let the requests be able to be drained. The only difference is this 
> patch will invoke wait_freeze in nvme_dev_disable instead
>of nvme_reset_work.
> 
> We don't sacrifice any request. This patch do the same thing with the 
> previous one and make things clearer.

No, what you're proposing is quite different.

By "enter", I'm referring to blk_queue_enter. Once a request enters
into an hctx, it can not be backed out to re-enter a new hctx if the
original one is invalidated.

Prior to a reset, all requests that have entered the queue are committed
to that hctx, and we can't do anything about that. The only thing we can
do is prevent new requests from entering until we're sure that hctx is
valid on the other side of the reset.

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-15 Thread Keith Busch

This is all way over my head, but the part that obviously shows
something's gone wrong:

  kworker/u674:3-1421  [028] d...   335.307051: irq_matrix_reserve_managed: 
bit=56 cpu=0 online=1 avl=86 alloc=116 managed=3 online_maps=112 
global_avl=22084, global_rsvd=157, total_alloc=570
  kworker/u674:3-1421  [028] d...   335.307053: irq_matrix_remove_managed: 
bit=56 cpu=0 online=1 avl=87 alloc=116 managed=2 online_maps=112 
global_avl=22085, global_rsvd=157, total_alloc=570
  kworker/u674:3-1421  [028]    335.307054: vector_reserve_managed: irq=45 
ret=-28
  kworker/u674:3-1421  [028]    335.307054: vector_setup: irq=45 
is_legacy=0 ret=-28
  kworker/u674:3-1421  [028] d...   335.307055: vector_teardown: irq=45 
is_managed=1 has_reserved=0

Which leads me to x86_vector_alloc_irqs goto error:

error:
x86_vector_free_irqs(domain, virq, i + 1);

The last parameter looks weird. It's the nr_irqs, and since we failed and
bailed, I would think we'd need to subtract 1 rather than add 1. Adding
1 would doublely remove the failed one, and remove the next one that
was never setup, right?

Or maybe irq_matrix_reserve_managed wasn't expected to fail in the
first place?

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch

On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote:
> What we want is s/i + 1/i/
> 
> That's correct because x86_vector_free_irqs() does:
> 
>for (i = 0; i < nr; i++)
>  
> 
> So if we fail at the first irq, then the loop will do nothing. Failing on
> the second will free the first 
> 
> Fix below.
> 
> Thanks,
> 
>   tglx

Thanks! This looks much better. I'll try to verify by tomorrow, though
the hardware I was using to recreate is not available to me at the
moment. I may be able to synth the conditions on something else.

 
> 8<--
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f8b03bb8e725..3cc471beb50b 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain 
> *domain, unsigned int virq,
>  
>   err = assign_irq_vector_policy(irqd, info);
>   trace_vector_setup(virq + i, false, err);
> - if (err)
> + if (err) {
> + irqd->chip_data = NULL;
> + free_apic_chip_data(apicd);
>   goto error;
> + }
>   }
>  
>   return 0;
>  
>  error:
> - x86_vector_free_irqs(domain, virq, i + 1);
> + x86_vector_free_irqs(domain, virq, i);
>   return err;
>  }
>

Re: [PATCH v2 0/2] add tracepoints for nvme command submission and completion

2018-01-16 Thread Keith Busch

On Tue, Jan 16, 2018 at 03:28:19PM +0100, Johannes Thumshirn wrote:
> Add tracepoints for nvme command submission and completion. The tracepoints
> are modeled after SCSI's trace_scsi_dispatch_cmd_start() and
> trace_scsi_dispatch_cmd_done() tracepoints and fulfil a similar purpose,
> namely a fast way to check which command is going to be queued into the HW or
> Fabric driver and which command is completed again.

I like this very much, thanks for doing this. I think you could make the
submission trace point tighter for PCI as Hannes was suggesting since
an MMIO write can't fail, but doesn't look as doable for FC and RDMA.

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch

On Tue, Jan 16, 2018 at 12:20:18PM +0100, Thomas Gleixner wrote:
> 8<--
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index f8b03bb8e725..3cc471beb50b 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -542,14 +542,17 @@ static int x86_vector_alloc_irqs(struct irq_domain 
> *domain, unsigned int virq,
>  
>   err = assign_irq_vector_policy(irqd, info);
>   trace_vector_setup(virq + i, false, err);
> - if (err)
> + if (err) {
> + irqd->chip_data = NULL;
> + free_apic_chip_data(apicd);
>   goto error;
> + }
>   }
>  
>   return 0;
>  
>  error:
> - x86_vector_free_irqs(domain, virq, i + 1);
> + x86_vector_free_irqs(domain, virq, i);
>   return err;
>  }
>  

The patch does indeed fix all the warnings and allows device binding to
succeed, albeit in a degraded performance mode. Despite that, this is
a good fix, and looks applicable to 4.4-stable, so:

Tested-by: Keith Busch 

I'm still concerned assign_irq_vector_policy is failing. That has
interrupt allocation abandon MSI-x and fall back to legacy IRQ. 

Your patch does address my main concern, though. Are you comfortable
enough to queue this up for 4.15?

Thanks,
Keith

Re: [BUG 4.15-rc7] IRQ matrix management errors

2018-01-16 Thread Keith Busch

On Wed, Jan 17, 2018 at 08:34:22AM +0100, Thomas Gleixner wrote:
> Can you trace the matrix allocations from the very beginning or tell me how
> to reproduce. I'd like to figure out why this is happening.

Sure, I'll get the irq_matrix events.

I reproduce this on a machine with 112 CPUs and 3 NVMe controllers. The
first two NVMe want 112 MSI-x vectors, and the last only 31 vectors. The
test runs 'modprobe nvme' and 'modprobe -r nvme' in a loop with 10
second delay between each step. Repro occurs within a few iterations,
sometimes already broken after the initial boot.

Re: [PATCH V2] nvme-pci: set cq_vector to -1 if io queue setup fails

2018-02-26 Thread Keith Busch

On Thu, Feb 15, 2018 at 07:13:41PM +0800, Jianchao Wang wrote:
> nvme cq irq is freed based on queue_count. When the sq/cq creation
> fails, irq will not be setup. free_irq will warn 'Try to free
> already-free irq'.
> 
> To fix it, set the nvmeq->cq_vector to -1, then nvme_suspend_queue
> will ignore it.

Applied with updates to change log and removing the miscellaneous white
space changes.

Re: [PATCH] nvme-multipath: fix sysfs dangerously created links

2018-02-26 Thread Keith Busch

On Mon, Feb 26, 2018 at 05:51:23PM +0900, baeg...@gmail.com wrote:
> From: Baegjae Sung 
> 
> If multipathing is enabled, each NVMe subsystem creates a head
> namespace (e.g., nvme0n1) and multiple private namespaces
> (e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
> private namespaces, links of head namespace are used, so the
> namespace creation order must be followed (e.g., nvme0n1 ->
> nvme0c1n1). If the order is not followed, links of sysfs will be
> incomplete or kernel panic will occur.
> 
> The kernel panic was:
>   kernel BUG at fs/sysfs/symlink.c:27!
>   Call Trace:
> nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
> nvme_validate_ns+0x5c2/0x850 [nvme_core]
> nvme_scan_work+0x1af/0x2d0 [nvme_core]
> 
> Correct order
> Context A Context B
> nvme0n1
> nvme0c0n1 nvme0c1n1
> 
> Incorrect order
> Context A Context B
>   nvme0c1n1
> nvme0n1
> nvme0c0n1
> 
> The function of a head namespace creation is moved to maintain the
> correct order. We verified the code with or without multipathing
> using three vendors of dual-port NVMe SSDs.
> 
> Signed-off-by: Baegjae Sung 

Thanks, I see what you mean on the potential ordering problem here.
Calling nvme_mpath_add_disk, though, before the 'head' has any namespace
paths available looks like you'll get a lot of 'no path available'
warnings during the bring-up. It should resolve itself shortly after,
but the warnings will be a bit alarming, right?

> ---
>  drivers/nvme/host/core.c | 12 +++-
>  1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 0fe7ea35c221..28777b7352a5 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2844,7 +2844,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct 
> nvme_ctrl *ctrl,
>  }
>  
>  static int nvme_init_ns_head(struct nvme_ns *ns, unsigned nsid,
> - struct nvme_id_ns *id, bool *new)
> + struct nvme_id_ns *id)
>  {
>   struct nvme_ctrl *ctrl = ns->ctrl;
>   bool is_shared = id->nmic & (1 << 0);
> @@ -2860,8 +2860,7 @@ static int nvme_init_ns_head(struct nvme_ns *ns, 
> unsigned nsid,
>   ret = PTR_ERR(head);
>   goto out_unlock;
>   }
> -
> - *new = true;
> + nvme_mpath_add_disk(head);
>   } else {
>   struct nvme_ns_ids ids;
>  
> @@ -2873,8 +2872,6 @@ static int nvme_init_ns_head(struct nvme_ns *ns, 
> unsigned nsid,
>   ret = -EINVAL;
>   goto out_unlock;
>   }
> -
> - *new = false;
>   }
>  
>   list_add_tail(&ns->siblings, &head->list);
> @@ -2945,7 +2942,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
> unsigned nsid)
>   struct nvme_id_ns *id;
>   char disk_name[DISK_NAME_LEN];
>   int node = dev_to_node(ctrl->dev), flags = GENHD_FL_EXT_DEVT;
> - bool new = true;
>  
>   ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
>   if (!ns)
> @@ -2971,7 +2967,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
> unsigned nsid)
>   if (id->ncap == 0)
>   goto out_free_id;
>  
> - if (nvme_init_ns_head(ns, nsid, id, &new))
> + if (nvme_init_ns_head(ns, nsid, id))
>   goto out_free_id;
>   nvme_setup_streams_ns(ctrl, ns);
>   
> @@ -3037,8 +3033,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
> unsigned nsid)
>   pr_warn("%s: failed to register lightnvm sysfs group for 
> identification\n",
>   ns->disk->disk_name);
>  
> - if (new)
> - nvme_mpath_add_disk(ns->head);
>   nvme_mpath_add_disk_links(ns);
>   return;
>   out_unlink_ns:
> -- 
> 2.16.2
>

Re: [PATCH] nvme-pci: assign separate irq vectors for adminq and ioq0

2018-02-27 Thread Keith Busch

On Tue, Feb 27, 2018 at 04:46:17PM +0800, Jianchao Wang wrote:
> Currently, adminq and ioq0 share the same irq vector. This is
> unfair for both amdinq and ioq0.
>  - For adminq, its completion irq has to be bound on cpu0.
>  - For ioq0, when the irq fires for io completion, the adminq irq
>action has to be checked also.

This change log could use some improvements. Why is it bad if admin
interrupts affinity is with cpu0?

Are you able to measure _any_ performance difference on IO queue 1 vs IO
queue 2 that you can attribute to IO queue 1's sharing vector 0?
 
> @@ -1945,11 +1947,11 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>* setting up the full range we need.
>*/
>   pci_free_irq_vectors(pdev);
> - nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
> - PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
> - if (nr_io_queues <= 0)
> + ret = pci_alloc_irq_vectors_affinity(pdev, 1, (nr_io_queues + 1),
> + PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
> + if (ret <= 0)
>   return -EIO;
> - dev->max_qid = nr_io_queues;
> + dev->max_qid = ret - 1;

So controllers that have only legacy or single-message MSI don't get any
IO queues?

[PATCH] pciehp: Fix infinite interupt handler loop

2017-08-01 Thread Keith Busch

We've encountered a particular platform that under some circumstances
always has the power fault detected status raised. The pciehp irq handler
would loop forever because it thinks it is handling new events when in
fact the power fault is not new. This patch fixes that by masking off
the power fault status from new events if the driver hasn't seen the
power fault clear from the previous handling attempt.

Fixes: fad214b0aa72 ("PCI: pciehp: Process all hotplug events before looking 
for new ones")

Cc:  # 4.9+
Cc: Mayurkumar Patel 
Signed-off-by: Keith Busch 
---
Resending due to send-email setup error; this patch may appear twice
for some.

 drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 026830a..8ecbc13 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -583,7 +583,9 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
 * Slot Status contains plain status bits as well as event
 * notification bits; right now we only want the event bits.
 */
-   events = status & (PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
+   events = status & (PCI_EXP_SLTSTA_ABP |
+ (ctrl->power_fault_detected ?
+   0 : PCI_EXP_SLTSTA_PFD) |
   PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_CC |
   PCI_EXP_SLTSTA_DLLSC);
if (!events)
-- 
2.5.5

Re: [PATCH] nvme-pci: Fix an error handling path in 'nvme_probe()'

2017-07-17 Thread Keith Busch

On Sun, Jul 16, 2017 at 10:39:03AM +0200, Christophe JAILLET wrote:
> Release resources in the correct order in order not to miss a
> 'put_device()' if 'nvme_dev_map()' fails.
> 
> Fixes: b00a726a9fd8 ("NVMe: Don't unmap controller registers on reset")
> Signed-off-by: Christophe JAILLET 

Indeed, thanks for the fix. Alternatively this can be fixed by relocating
nvme_dev_map prior to the 'get_device' a few lines up. This patch is
okay, too.

Reviewed-by: Keith Busch 

> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index d10d2f279d19..005beffd005d 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2282,7 +2282,7 @@ static int nvme_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>  
>   result = nvme_dev_map(dev);
>   if (result)
> - goto free;
> + goto put_pci;
>  
>   INIT_WORK(&dev->ctrl.reset_work, nvme_reset_work);
>   INIT_WORK(&dev->remove_work, nvme_remove_dead_ctrl_work);
> @@ -2291,7 +2291,7 @@ static int nvme_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>  
>   result = nvme_setup_prp_pools(dev);
>   if (result)
> - goto put_pci;
> + goto unmap;
>  
>   quirks |= check_dell_samsung_bug(pdev);
>  
> @@ -2308,9 +2308,10 @@ static int nvme_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>  
>   release_pools:
>   nvme_release_prp_pools(dev);
> + unmap:
> + nvme_dev_unmap(dev);
>   put_pci:
>   put_device(dev->dev);
> - nvme_dev_unmap(dev);
>   free:
>   kfree(dev->queues);
>   kfree(dev);

Re: [PATCH] nvme: Acknowledge completion queue on each iteration

2017-07-17 Thread Keith Busch

On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
> Code is moving the completion queue doorbell after processing all completed
> events and sending callbacks to the block layer on each iteration.
> 
> This is causing a performance drop when a lot of jobs are queued towards
> the HW. Move the completion queue doorbell on each loop instead and allow new
> jobs to be queued by the HW.

That doesn't make sense. Aggregating doorbell writes should be much more
efficient for high depth workloads.

Re: [PATCH] nvme: Acknowledge completion queue on each iteration

2017-07-17 Thread Keith Busch

On Mon, Jul 17, 2017 at 06:46:11PM -0400, Sinan Kaya wrote:
> Hi Keith,
> 
> On 7/17/2017 6:45 PM, Keith Busch wrote:
> > On Mon, Jul 17, 2017 at 06:36:23PM -0400, Sinan Kaya wrote:
> >> Code is moving the completion queue doorbell after processing all completed
> >> events and sending callbacks to the block layer on each iteration.
> >>
> >> This is causing a performance drop when a lot of jobs are queued towards
> >> the HW. Move the completion queue doorbell on each loop instead and allow 
> >> new
> >> jobs to be queued by the HW.
> > 
> > That doesn't make sense. Aggregating doorbell writes should be much more
> > efficient for high depth workloads.
> > 
> 
> Problem is that code is throttling the HW as HW cannot queue more completions 
> until
> SW get a chance to clear it. 
> 
> As an example:
> 
> for each in N
> (
>   blk_layer()
> )
> ring door bell
> 
> HW cannot queue new job until N x blk_layer operations are processed and queue
> element ownership is passed to the HW after the loop. HW is just sitting idle
> there if no queue entries are available.

If no completion queue entries are available, then there can't possibly
be any submission queue entries for the HW to work on either.

Re: [PATCH 02/13] mpt3sas: SGL to PRP Translation for I/Os to NVMe devices

2017-07-11 Thread Keith Busch

On Tue, Jul 11, 2017 at 01:55:02AM -0700, Suganath Prabu S wrote:
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h 
> b/drivers/scsi/mpt3sas/mpt3sas_base.h
> index 60fa7b6..cebdd8e 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_base.h
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
> @@ -54,6 +54,7 @@
>  #include "mpi/mpi2_raid.h"
>  #include "mpi/mpi2_tool.h"
>  #include "mpi/mpi2_sas.h"
> +#include "mpi/mpi2_pci.h"

Could you ajust your patch order for this series so each can compile? Here
in patch 2 you're including a header that's not defined until patch 12.

Re: [PATCH 02/13] mpt3sas: SGL to PRP Translation for I/Os to NVMe devices

2017-07-11 Thread Keith Busch

On Tue, Jul 11, 2017 at 01:55:02AM -0700, Suganath Prabu S wrote:
> +/**
> + * _base_check_pcie_native_sgl - This function is called for PCIe end 
> devices to
> + * determine if the driver needs to build a native SGL.  If so, that native
> + * SGL is built in the special contiguous buffers allocated especially for
> + * PCIe SGL creation.  If the driver will not build a native SGL, return
> + * TRUE and a normal IEEE SGL will be built.  Currently this routine
> + * supports NVMe.
> + * @ioc: per adapter object
> + * @mpi_request: mf request pointer
> + * @smid: system request message index
> + * @scmd: scsi command
> + * @pcie_device: points to the PCIe device's info
> + *
> + * Returns 0 if native SGL was built, 1 if no SGL was built
> + */
> +static int
> +_base_check_pcie_native_sgl(struct MPT3SAS_ADAPTER *ioc,
> + Mpi25SCSIIORequest_t *mpi_request, u16 smid, struct scsi_cmnd *scmd,
> + struct _pcie_device *pcie_device)
> +{



> + /* Return 0, indicating we built a native SGL. */
> + return 1;
> +}

This function doesn't return 0 ever. Not sure why it's here.

Curious about your device, though, if a nvme native SGL can *not* be
built, does the HBA firmware then buffer it in its local memory before
sending/receiving to/from the host?

And if a native SGL can be built, does the NVMe target DMA directly
to/from host memory, giving a performance boost?

Re: [PATCH] nvme: Acknowledge completion queue on each iteration

2017-07-18 Thread Keith Busch

On Mon, Jul 17, 2017 at 07:07:00PM -0400, ok...@codeaurora.org wrote:
> Maybe, I need to understand the design better. I was curious why completion
> and submission queues were protected by a single lock causing lock
> contention.

Ideally the queues are tied to CPUs, so you couldn't have one thread
submitting to a particular queue-pair while another thread is reaping
completions from it. Such a setup wouldn't get lock contention.

Some machines have so many CPUs, though, that sharing hardware queues
is required. We've experimented with separate submission and completion
locks for such cases, but I've never seen an improved performance as a
result.

Re: [PATCH] nvme: Acknowledge completion queue on each iteration

2017-07-18 Thread Keith Busch

On Tue, Jul 18, 2017 at 02:52:26PM -0400, Sinan Kaya wrote:
> On 7/18/2017 10:36 AM, Keith Busch wrote:
> 
> I do see that the NVMe driver is creating a completion interrupt on
> each CPU core for the completions. No problems with that. 
> 
> However, I don't think you can guarantee that there will always be a single
> CPU core targeting one submission queue especially with asynchronous IO.
>
> Lock contention counters from CONFIG_LOCK_STAT are pointing to nvmeq->lock
> in my FIO tests.
> 
> Did I miss something?

I think that must mean your machine has many more CPUs than your nvme
controller has IO queues.

Re: [PATCH] PCI/DPC: Fix shared interrupt handling

2017-12-14 Thread Keith Busch

On Wed, Dec 13, 2017 at 05:01:58PM -0700, Alex Williamson wrote:
> @@ -109,6 +109,7 @@ static void interrupt_event_handler(struct work_struct 
> *work)
>   struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>   struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
>   struct pci_bus *parent = pdev->subordinate;
> + u16 ctl;
>  
>   pci_lock_rescan_remove();
>   list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
> @@ -135,6 +136,10 @@ static void interrupt_event_handler(struct work_struct 
> *work)
>  
>   pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_STATUS,
>   PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT);
> +
> + pci_read_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, &ctl);
> + pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL,
> +   ctl & ~PCI_EXP_DPC_CTL_INT_EN);

Did you mean to re-enable the interrupt here rather than mask it off?

Re: [PATCH v2] PCI/DPC: Fix shared interrupt handling

2017-12-14 Thread Keith Busch

On Thu, Dec 14, 2017 at 08:20:18AM -0700, Alex Williamson wrote:
> DPC supports shared interrupts, but it plays very loosely with testing
> whether the interrupt is generated by DPC before generating spurious
> log messages, such as:
> 
>  dpc :10:01.2:pcie010: DPC containment event, status:0x1f00 source:0x
> 
> Testing the status register for zero or -1 is not sufficient when the
> device supports the RP PIO First Error Pointer register.  Change this
> to test whether the interrupt is enabled in the control register,
> retaining the device present test, and that the status reports the
> interrupt as signaled and DPC is triggered, clearing as a spurious
> interrupt otherwise.
> 
> Additionally, since the interrupt is actually serviced by a workqueue,
> disable the interrupt in the control register until that completes or
> else we may never see it execute due to further incoming interrupts.
> A software generated DPC floods the system otherwise.
> 
> Signed-off-by: Alex Williamson 

Thanks, looks good.

Reviewed-by: Keith Busch

Re: [PATCH 0/4] Address error and recovery for AER and DPC

2017-12-28 Thread Keith Busch

On Wed, Dec 27, 2017 at 02:20:18AM -0800, Oza Pawandeep wrote:
> DPC should enumerate the devices after recovering the link, which is
> achieved by implementing error_resume callback.

Wouldn't that race with the link-up event that pciehp currently handles?

Re: [PATCH v2 2/4] PCI/DPC/AER: Address Concurrency between AER and DPC

2017-12-29 Thread Keith Busch

On Fri, Dec 29, 2017 at 12:54:17PM +0530, Oza Pawandeep wrote:
> This patch addresses the race condition between AER and DPC for recovery.
> 
> Current DPC driver does not do recovery, e.g. calling end-point's driver's
> callbacks, which sanitize the device.
> DPC driver implements link_reset callback, and calls pci_do_recovery.

I'm not sure I see why any of this is necessary for two reasons:

1. A downstream port containment event disables the link. How can a driver
sanitize an end device when all the end devices below the containment are
physically inaccessible? Any attempt to access such devices will just
end with either CA or UR (depending on DPC control settings). Since we
already know the failed outcome from attempting to access such devices,
why do you want the drivers to do anything?

2. A DPC event suppresses the error message required for the Linux
AER driver to run. How can AER and DPC run concurrently?

Re: [PATCH v2 2/4] PCI/DPC/AER: Address Concurrency between AER and DPC

2017-12-29 Thread Keith Busch

On Fri, Dec 29, 2017 at 11:30:02PM +0530, p...@codeaurora.org wrote:
> On 2017-12-29 22:53, Keith Busch wrote:
> 
> > 2. A DPC event suppresses the error message required for the Linux
> > AER driver to run. How can AER and DPC run concurrently?
> 
> I afraid I could not grasp the first line completely.

A DPC capable and enabled port discards error messages; the ERR_FATAL
or ERR_NONFATAL message required to trigger AER handling won't exist in
such a setup.

This behavior is defined in the specification 6.2.10 for Downstream
Port Containment:

  When DPC is triggered due to receipt of an uncorrectable error Message,
  the Requester ID from the Message is recorded in the DPC Error
  Source ID register and that Message is discarded and not forwarded
  Upstream. When DPC is triggered by an unmasked uncorrectable error,
  that error will not be signaled with an uncorrectable error Message,
  even if otherwise enabled.

Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only

2017-12-15 Thread Keith Busch

On Thu, Dec 14, 2017 at 06:21:55PM -0600, Bjorn Helgaas wrote:
> [+cc Rajat, Keith, linux-kernel]
> 
> On Thu, Dec 14, 2017 at 07:47:01PM +0100, Maik Broemme wrote:
> > I have a Samsung 960 PRO NVMe SSD (Non-Volatile memory controller:
> > Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961). It
> > works fine until I enable powersupersave via
> > /sys/module/pcie_aspm/parameters/policy
> > 
> > ASPM is enabled in BIOS and works fine for all devices and in
> > powersave mode. I'm able to reproduce this always at any time while
> > the system is up and running via:
> > 
> > $> echo powersupersave > /sys/module/pcie_aspm/parameters/policy
> > 
> > The Linux kernel is 4.14.4 and APST for my device is working with
> > powersave. As soon as I enable powersupersave I get:
> > 
> > [11535.142755] dpc :00:10.0:pcie010: DPC containment event, 
> > status:0x1f09 source:0x
> > [11535.142760] dpc :00:10.0:pcie010: DPC unmasked uncorrectable error 
> > detected, remove downstream devices
> > [11535.15] nvme0n1: detected capacity change from 1024209543168 to 0
> > ...
> 
> Can you start by opening a bug report at https://bugzilla.kernel.org,
> category Drivers/PCI, and attaching the complete "lspci -vv" output
> (as root) and the complete dmesg log?  Make sure you have a new enough
> lspci to decode the ASPM L1 Substates capability and the LTR bits.
> Source is at git://git.kernel.org/pub/scm/utils/pciutils/pciutils.git
> 
> powersupersave enables ASPM L1 Substates.  Rajat, do you have any
> ideas about this or how we might debug it?
>
> Keith, is this really all the information about the event that we can
> get out of DPC?  Is there some AER logging we might be able to get via
> "lspci -vv"?  Sounds like this is the boot disk, so Maik may not be
> able to run lspci after the DPC event.  If there *is* any AER info,
> can we connect up the DPC event so we can print the AER info from the
> kernel?

There should be information in the AER register. The base spec section
6.2.5 ("Sequence of Device Eror Signaling and Logging") says the
corresponding bit in the AER Uncorrectable Error Status register should
be set before triggerring DPC. The sequence ends with the DPC trigger,
so the Linux AER service was never notified to handle the event.

As an enhancement to the DPC driver, we may be able to enqueue an AER
event to see if that may provide additional details about the error. I can
implement that enhanmcement, and should have something for consideration
sometime in the next week.

On a side note, now that root ports are implementing DPC, we should
probably consult the platform for AER firmware first. The PCIe
specification strongly recommends linking DPC control to that of AER, so
I'll try to add that check as well.

Re: [PATCH v2 2/2] dm unstripe: Add documentation for unstripe target

2017-12-11 Thread Keith Busch

On Mon, Dec 11, 2017 at 09:00:19AM -0700, Scott Bauer wrote:
> +Example scripts:
> +
> +
> +dmsetup create nvmset1 --table '0 1 dm-unstripe /dev/nvme0n1 1 2 0'
> +dmsetup create nvmset0 --table '0 1 dm-unstripe /dev/nvme0n1 0 2 0'
> +
> +There will now be two mappers:
> +/dev/mapper/nvmset1
> +/dev/mapper/nvmset0
> +
> +that will expose core 0 and core 1.
> +
> +
> +In a Raid 0 with 4 drives of stripe size 128K:
> +dmsetup create raid_disk0 --table '0 1 dm-unstripe /dev/nvme0n1 0 4 256'
> +dmsetup create raid_disk1 --table '0 1 dm-unstripe /dev/nvme0n1 1 4 256'
> +dmsetup create raid_disk2 --table '0 1 dm-unstripe /dev/nvme0n1 2 4 256'
> +dmsetup create raid_disk3 --table '0 1 dm-unstripe /dev/nvme0n1 3 4 256'

While this device mapper is intended for H/W RAID where the member disks
are hidden, we can test it using DM software striping so we don't need
any particular hardware.

Here's a little test script I wrote for that. It sets up a striped
device backed by files, unstripes it into different sets, then compares
each to its original backing file after writing random data to it,
cleaning up the test artifacts before exiting. The parameters at the
top can be modified to test different striping scenarios.

---
#!/bin/bash

MEMBER_SIZE=$((128 * 1024 * 1024))
NUM=4
SEQ_END=$((${NUM}-1))
CHUNK=256
BS=4096

RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
COUNT=$((${MEMBER_SIZE} / ${BS}))

for i in $(seq 0 ${SEQ_END}); do
  dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
  losetup /dev/loop${i} member-${i}
  DM_PARMS+=" /dev/loop${i} 0"
done

echo $DM_PARMS | dmsetup create raid0
for i in $(seq 0 ${SEQ_END}); do
  echo "0 1 dm-unstripe /dev/mapper/raid0 ${i} ${NUM} ${CHUNK}" | dmsetup 
create set-${i}
done;

for i in $(seq 0 ${SEQ_END}); do
  dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} 
oflag=direct
  diff /dev/mapper/set-${i} member-${i}
done;

for i in $(seq 0 ${SEQ_END}); do
  dmsetup remove set-${i}
done
dmsetup remove raid0

for i in $(seq 0 ${SEQ_END}); do
  losetup -d /dev/loop${i}
  rm -f member-${i}
done
--

Re: [PATCH v2 1/2] dm-unstripe: unstripe of IO across RAID 0

2017-12-11 Thread Keith Busch

On Mon, Dec 11, 2017 at 09:00:18AM -0700, Scott Bauer wrote:
> +
> + dm_set_target_max_io_len(ti, target->max_hw_sectors);

The return for this function has "__must_check", so it's currently
throwing an a compiler warning.

Otherwise, this looks like it's doing what you want, and tests
successfully on my synthetic workloads.

Acked-by: Keith Busch

Re: [PATCH v2 2/2] dm unstripe: Add documentation for unstripe target

2017-12-12 Thread Keith Busch

On Tue, Dec 12, 2017 at 01:35:13PM +0200, Nikolay Borisov wrote:
> On 11.12.2017 18:00, Scott Bauer wrote:
> > +As an example:
> > +
> > +Intel NVMe drives contain two cores on the physical device.
> > +Each core of the drive has segregated access to its LBA range.
> > +The current LBA model has a RAID 0 128k stripe across the two cores:
> > +
> > +   Core 0:Core 1:
> > +  ____
> > +  | LBA 511|| LBA 768|
> > +  | LBA 0  || LBA 256|
> > +  ⎻⎻⎻⎻
> 
> If it's 128k stripe shouldn't it be LBAs 0/256 on core0 and LBAs 128/511
> on core1?

Ah, this device's makers call the "stripe" size what should be called
"chunk". This device has a 128k chunk per core with two cores, so the
full stripe is 256k. The above should have core 0 owning LBA 512 rather
than 511 (assuming 512b LBA format).

Re: [PATCH 1/6] genirq: allow assigning affinity to present but not online CPUs

2017-02-06 Thread Keith Busch

On Sun, Feb 05, 2017 at 05:40:23PM +0100, Christoph Hellwig wrote:
> Hi Joe,
> 
> On Fri, Feb 03, 2017 at 08:58:09PM -0500, Joe Korty wrote:
> > IIRC, some years ago I ran across a customer system where
> > the #cpus_present was twice as big as #cpus_possible.
> > 
> > Hyperthreading was turned off in the BIOS so it was not
> > entirely out of line for the extra cpus to be declared
> > present, even though none of them would ever be available
> > for use.
> 
> This sounds like a system we should quirk around instead of optimizing
> for it.  Unless I totally misunderstand the idea behind cpu_possible
> and cpu_present.

Can we use the online CPUs and create a new hot-cpu notifier to the nvme
driver to free/reallocate as needed? We were doing that before blk-mq. Now
blk-mq can change the number hardware contexts on a live queue, so we
can reintroduce that behavior to nvme and only allocate what we need.

Re: [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems

2019-07-31 Thread Keith Busch

On Wed, Jul 31, 2019 at 11:25:51PM +0200, Rafael J. Wysocki wrote:
> 
> A couple of remarks if you will.
> 
> First, we don't know which case is the majority at this point.  For
> now, there is one example of each, but it may very well turn out that
> the SK Hynix BC501 above needs to be quirked.
> 
> Second, the reference here really is 5.2, so if there are any systems
> that are not better off with 5.3-rc than they were with 5.2, well, we
> have not made progress.  However, if there are systems that are worse
> off with 5.3, that's bad.  In the face of the latest findings the only
> way to avoid that is to be backwards compatible with 5.2 and that's
> where my patch is going.  That cannot be achieved by quirking all
> cases that are reported as "bad", because there still may be
> unreported ones.

I have to agree. I think your proposal may allow PCI D3cold, in which
case we do need to reintroduce the HMB handling.

Re: [Regression] Commit "nvme/pci: Use host managed power state for suspend" has problems

2019-08-01 Thread Keith Busch

On Thu, Aug 01, 2019 at 02:05:54AM -0700, Kai-Heng Feng wrote:
> at 06:33, Rafael J. Wysocki  wrote:
> > On Thu, Aug 1, 2019 at 12:22 AM Keith Busch  wrote:
> >
> >> In which case we do need to reintroduce the HMB handling.
> >
> > Right.
> 
> The patch alone doesn’t break HMB Toshiba NVMe I tested. But I think it’s  
> still safer to do proper HMB handling.

Spec requires host request controller release HMB for D3cold. I suspect
you're only getting to D3hot.

< 3 4 5 6 7 8 9 10 >

701 - 800 of 931 matches

Mail list logo