Re: [Qemu-block] [PATCH v4] block/nvme: add support for discard

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 15:50 +0200, Max Reitz wrote:
> On 03.07.19 18:07, Maxim Levitsky wrote:
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c   | 81 ++
> >  block/trace-events |  2 ++
> >  2 files changed, 83 insertions(+)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 02e0846643..96a715dcc1 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -460,6 +461,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >s->page_size / sizeof(uint64_t) * s->page_size);
> >  
> >  s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
> > +s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;
> 
> Shouldn’t this be le16_to_cpu(idctrl->oncs)?  Same in the previous
> patch, now that I think about it.

This reminds me of how I basically scrubbed though nvme-mdev looking for 
endiannes bugs, 
manually searching for every reference to hardware controlled structure.
Thank you very much!!


> 
> >  
> >  memset(resp, 0, 4096);
> >  
> > @@ -1149,6 +1151,84 @@ static coroutine_fn int 
> > nvme_co_pwrite_zeroes(BlockDriverState *bs,
> >  }
> >  
> >  
> > +static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
> > + int64_t offset,
> > + int bytes)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +NVMeQueuePair *ioq = s->queues[1];
> > +NVMeRequest *req;
> > +NvmeDsmRange *buf;
> > +QEMUIOVector local_qiov;
> > +int r;
> > +
> > +NvmeCmd cmd = {
> > +.opcode = NVME_CMD_DSM,
> > +.nsid = cpu_to_le32(s->nsid),
> > +.cdw10 = 0, /*number of ranges - 0 based*/
> 
> I’d make this cpu_to_le32(0).  Sure, there is no effect for 0, but in
> theory this is a variable value, so...
Let it be.

> 
> > +.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
> > +};
> > +
> > +NVMeCoData data = {
> > +.ctx = bdrv_get_aio_context(bs),
> > +.ret = -EINPROGRESS,
> > +};
> > +
> > +if (!s->supports_discard) {
> > +return -ENOTSUP;
> > +}
> > +
> > +assert(s->nr_queues > 1);
> > +
> > +buf = qemu_try_blockalign0(bs, 4096);
> 
> I’m not sure whether this needs to be 4096 or whether 16 would suffice,
>  but I suppose this gets us the least trouble.
Exactly. Even better would be now that I think about it to use 's->page_size', 
the device page size.
It is at least 4K (spec minimum).

Speaking of which, there is a theoretical bug there - the device in theory can 
indicate that its minimal page size is larger that 4K.
The kernel currently rejects such devices, but here the driver just forces 4K 
page size in the CC register

> 
> > +if (!buf) {
> > +return -ENOMEM;
> 
> Indentation is off.
True!

> 
> > +}
> > +
> > +buf->nlb = cpu_to_le32(bytes >> s->blkshift);
> > +buf->slba = cpu_to_le64(offset >> s->blkshift);
> > +buf->cattr = 0;
> > +
> > +qemu_iovec_init(_qiov, 1);
> > +qemu_iovec_add(_qiov, buf, 4096);
> > +
> > +req = nvme_get_free_req(ioq);
> > +assert(req);
> > +
> > +qemu_co_mutex_lock(>dma_map_lock);
> > +r = nvme_cmd_map_qiov(bs, , req, _qiov);
> > +qemu_co_mutex_unlock(>dma_map_lock);
> > +
> > +if (r) {
> > +req->busy = false;
> > +return r;
> 
> Leaking buf and local_qiov here.
True, fixed.

> 
> > +}
> > +
> > +trace_nvme_dsm(s, offset, bytes);
> > +
> > +nvme_submit_command(s, ioq, req, , nvme_rw_cb, );
> > +
> > +data.co = qemu_coroutine_self();
> > +while (data.ret == -EINPROGRESS) {
> > +qemu_coroutine_yield();
> > +}
> > +
> > +qemu_co_mutex_lock(>dma_map_lock);
> > +r = nvme_cmd_unmap_qiov(bs, _qiov);
> > +qemu_co_mutex_unlock(>dma_map_lock);
> > +if (r) {
> > +return r;
> 
> Leaking buf and local_qiov here, too.
True, fixed - next time will check error paths better.

> 
> Max
> 
> > +}
> > +
> > +trace_nvme_dsm_done(s, offset, bytes, data.ret);
> > +
> > +qemu_iovec_destroy(_qiov);
> > +qemu_vfree(buf);
> > +return data.ret;
> > +
> > +}
> > +
> > +
> >  static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> > BlockReopenQueue *queue, Error **errp)
> >  {
> 
> 

Thanks for the review,
Best regards,
Maxim Levitsky





Re: [Qemu-block] [PATCH v3 5/6] block/nvme: add support for write zeros

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 15:33 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 69 +++-
> >  block/trace-events   |  1 +
> >  include/block/nvme.h | 19 +++-
> >  3 files changed, 87 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 152d27b07f..02e0846643 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -469,6 +473,11 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  s->nsze = le64_to_cpu(idns->nsze);
> >  lbaf = >lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
> >  
> > +if (NVME_ID_NS_DLFEAT_WRITE_ZEROS(idns->dlfeat) &&
> > +NVME_ID_NS_DLFEAT_READ_BEHAVIOR(idns->dlfeat) ==
> > +NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS)
> > +bs->supported_write_flags |= BDRV_REQ_MAY_UNMAP;
> > +
> 
> This violates the coding style, there should be curly brackets here.
100% agree + I need to see if we can update the checkpatch.pl to catch this.


> 
> >  if (lbaf->ms) {
> >  error_setg(errp, "Namespaces with metadata are not yet supported");
> >  goto out;
> > @@ -763,6 +772,8 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  int ret;
> >  BDRVNVMeState *s = bs->opaque;
> >  
> > +bs->supported_write_flags = BDRV_REQ_FUA;
> > +
> >  opts = qemu_opts_create(_opts, NULL, 0, _abort);
> >  qemu_opts_absorb_qdict(opts, options, _abort);
> >  device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> > @@ -791,7 +802,6 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> > *options, int flags,
> >  goto fail;
> >  }
> >  }
> > -bs->supported_write_flags = BDRV_REQ_FUA;
> 
> Any reason for this movement?

This is because the nvme_identify checks if the underlying namespace
supports 'discarded data reads back as zeros', and in which case it sets the
BDRV_REQ_MAY_UNMAP in bs->supported_write_flags which later allow me to set
'deallocate' bit in the write zeros command which hints the controller
to discard the area.

This was moved to avoid overwriting the value. I could have instead just ored 
the value,
but this way I think is cleaner a bit.



> 
> >  return 0;
> >  fail:
> >  nvme_close(bs);
> > @@ -1085,6 +1095,60 @@ static coroutine_fn int 
> > nvme_co_flush(BlockDriverState *bs)
> >  }
> >  
> >  
> > +static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
> > +  int64_t offset,
> > +  int bytes,
> > +  BdrvRequestFlags flags)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +NVMeQueuePair *ioq = s->queues[1];
> > +NVMeRequest *req;
> > +
> > +if (!s->supports_write_zeros) {
> > +return -ENOTSUP;
> > +}
> > +
> > +uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0x;
> 
> Another coding style violation: Variable declarations and other code may
> not be mixed.
Another bug in checkpatch.pl :-)

> 
> > +
> > +NvmeCmd cmd = {
> > +.opcode = NVME_CMD_WRITE_ZEROS,
> > +.nsid = cpu_to_le32(s->nsid),
> > +.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
> > +.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
> > +};
> > +
> > +NVMeCoData data = {
> > +.ctx = bdrv_get_aio_context(bs),
> > +.ret = -EINPROGRESS,
> > +};
> 
> [...]
> 
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 3ec8efcc43..65eb65c740 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -653,12 +653,29 @@ typedef struct NvmeIdNs {
> >  uint8_t mc;
> >  uint8_t dpc;
> >  uint8_t dps;
> > -uint8_t res30[98];
> > +
> > +uint8_t nmic;
> > +uint8_t rescap;
> > +uint8_t fpi;
> > +uint8_t dlfeat;
> > +
> > +uint8_t res30[94];
> >  NvmeLBAFlbaf[16];
> >  uint8_t res192[192];
> >  uint8_t vs[3712];
> >  } NvmeIdNs;
> >  
> > +
> > +/*Deallocate Logical Block Features*/
> > +#define NVME_ID_NS_DLFEAT_GUARD_CRC(dlfeat)   ((dlfeat) & 0x10)
> > +#define NVME_ID_NS_DLFEAT_WRITE_ZEROS(dlfeat) ((dlfeat) & 0x04)
> 
> Isn’t it bit 3, i.e. 0x08?
Oops, I haven't noticed that 'read behavier' field is 3 bits and not 2!
Thank you very much. I haven't caught this since my device I tested on doesn't 
support this anyway (dlfeat == 0)

> 
> Max
> 
> > +
> > +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR(dlfeat) ((dlfeat) & 0x3)
> > +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_UNDEFINED   0
> > +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS   1
> > +#define NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ONES2
> > +
> > +
> >  #define NVME_ID_NS_NSFEAT_THIN(nsfeat)  ((nsfeat & 0x1))
> >  

Re: [Qemu-block] [PATCH v3 4/6] block/nvme: add support for image creation

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 14:09 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Tesed on a nvme device like that:
> > 
> > # create preallocated qcow2 image
> > $ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o 
> > preallocation=metadata
> > Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
> > cluster_size=65536 preallocation=metadata lazy_refcounts=off 
> > refcount_bits=16
> > 
> > # create an empty qcow2 image
> > $ qemu-img create -f qcow2 nvme://:06:00.0/1 10G -o preallocation=off
> > Formatting 'nvme://:06:00.0/1', fmt=qcow2 size=10737418240 
> > cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 108 +++
> >  1 file changed, 108 insertions(+)
> 
> Hm.  I’m not quite sure I like this, because this is not image creation.

I fully agree with you, and the whole thing did felt kind of wrong.
I kind of think that bdrv_co_create_opts is kind of outdated for the purpose, 
especially
with the nvme driver.
I think that it would be better if the bdrv_file_open just supported something 
like 'O_CREAT'.

I done this the mostly the same was as the file-posix does this on the block 
devices,
including that 'hack' of zeroing the first sector, for which I really don't 
know if this is the right solution.



> 
> What we need is a general interface for formatting existing files.  I
> mean, we have that in QMP (blockdev-create), but the problem is that
> this doesn’t really translate to qemu-img create.
> 
> I wonder whether it’s best to hack something up that makes
> bdrv_create_file() a no-op, or whether we should expose blockdev-create
> over qemu-img.  I’ll see how difficult the latter is, it sounds fun
> (famous last words).
For existing images, the 'bdrv_create_file' is already kind of a nop, other 
that zeroing the first sector,
which kind of makes sense, but probably best done on higher level than in each 
driver.

So these are my thoughts about this, thanks for the review!

Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v3 3/6] block/nvme: support larger that 512 bytes sector devices

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:58 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Currently the driver hardcodes the sector size to 512,
> > and doesn't check the underlying device. Fix that.
> > 
> > Also fail if underlying nvme device is formatted with metadata
> > as this needs special support.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 45 -
> >  1 file changed, 40 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 52798081b2..1f0d09349f 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> 
> [...]
> 
> > @@ -463,7 +467,22 @@ static void nvme_identify(BlockDriverState *bs, int 
> > namespace, Error **errp)
> >  }
> >  
> >  s->nsze = le64_to_cpu(idns->nsze);
> > +lbaf = >lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
> > +
> > +if (lbaf->ms) {
> > +error_setg(errp, "Namespaces with metadata are not yet supported");
> > +goto out;
> > +}
> > +
> > +hwsect_size = 1 << lbaf->ds;
> > +
> > +if (hwsect_size < BDRV_SECTOR_BITS || hwsect_size > s->page_size) {
> 
> s/BDRV_SECTOR_BITS/BDRV_SECTOR_SIZE/
Oops.

> 
> > +error_setg(errp, "Namespace has unsupported block size (%d)",
> > +hwsect_size);
> > +goto out;
> > +}
> >  
> > +s->blkshift = lbaf->ds;
> >  out:
> >  qemu_vfio_dma_unmap(s->vfio, resp);
> >  qemu_vfree(resp);
> > @@ -782,8 +801,22 @@ fail:
> >  static int64_t nvme_getlength(BlockDriverState *bs)
> >  {
> >  BDRVNVMeState *s = bs->opaque;
> > +return s->nsze << s->blkshift;
> > +}
> >  
> > -return s->nsze << BDRV_SECTOR_BITS;
> > +static int64_t nvme_get_blocksize(BlockDriverState *bs)
> > +{
> > +BDRVNVMeState *s = bs->opaque;
> > +assert(s->blkshift >= 9);
> 
> I think BDRV_SECTOR_BITS is more correct here (this is about what the
> general block layer code expects).  Also, there’s no pain in doing so,
> as you did check against BDRV_SECTOR_SIZE in nvme_identify().
> Max
Of course, thanks!!

> 
> > +return 1 << s->blkshift;
> > +}
> > +
> > +static int nvme_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
> > +{
> > +int64_t blocksize = nvme_get_blocksize(bs);
> > +bsz->phys = blocksize;
> > +bsz->log = blocksize;
> > +return 0;
> >  }
> >  
> >  /* Called with s->dma_map_lock */
> 
> 

Thanks for the review,
Best regards,
Maxim Levitsky




Re: [Qemu-block] [PATCH v3 2/6] block/nvme: fix doorbell stride

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:10 +0200, Max Reitz wrote:
> On 05.07.19 13:09, Max Reitz wrote:
> > On 03.07.19 17:59, Maxim Levitsky wrote:
> > > Fix the math involving non standard doorbell stride
> > > 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  block/nvme.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/block/nvme.c b/block/nvme.c
> > > index 6d4e7f3d83..52798081b2 100644
> > > --- a/block/nvme.c
> > > +++ b/block/nvme.c
> > > @@ -217,7 +217,7 @@ static NVMeQueuePair 
> > > *nvme_create_queue_pair(BlockDriverState *bs,
> > >  error_propagate(errp, local_err);
> > >  goto fail;
> > >  }
> > > -q->cq.doorbell = >regs->doorbells[idx * 2 * s->doorbell_scale + 
> > > 1];
> > > +q->cq.doorbell = >regs->doorbells[(idx * 2 + 1) * 
> > > s->doorbell_scale];
> > >  
> > >  return q;
> > >  fail:
> > 
> > Hm.  How has this ever worked?
> 
> (Ah, because CAP.DSTRD has probably been 0 in most devices.)
> 
Exactly, and I used cache line stride in my nvme-mdev, which broke this and I 
spend an evening figuring out
what is going on. I was sure that there is some memory ordering bug or 
something even weirder before (as usual)
finding that this is a very simple bug.
I tested nvme-mdev pretty much with everything I could get my hands on, 
including this driver.


Best regards,
Maxim Levitsky





Re: [Qemu-block] [PATCH v3 1/6] block/nvme: don't touch the completion entries

2019-07-07 Thread Maxim Levitsky
On Fri, 2019-07-05 at 13:03 +0200, Max Reitz wrote:
> On 03.07.19 17:59, Maxim Levitsky wrote:
> > Completion entries are meant to be only read by the host and written by the 
> > device.
> > The driver is supposed to scan the completions from the last point where it 
> > left,
> > and until it sees a completion with non flipped phase bit.
> 
> (Disclaimer: This is the first time I read the nvme driver, or really
> something in the nvme spec.)
> 
> Well, no, completion entries are also meant to be initialized by the
> host.  To me it looks like this is the place where that happens:
> Everything that has been processed by the device is immediately being
> re-initialized.
> 
> Maybe we shouldn’t do that here but in nvme_submit_command().  But
> currently we don’t, and I don’t see any other place where we currently
> initialize the CQ entries.

Hi!
I couldn't find any place in the spec that says that completion entries should 
be initialized.
It is probably wise to initialize that area to 0 on driver initialization, but 
nothing beyond that.
In particular that is what the kernel nvme driver does. 
Other that allocating a zeroed memory (and even that I am not sure it does), 
it doesn't write to the completion entries.

Thanks for the very very good review btw. I will go over all patches now and 
fix things.

Best regards,
Maxim Levitsky

> 
> Max
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/nvme.c | 5 +
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 73ed5fa75f..6d4e7f3d83 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -315,7 +315,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> > NVMeQueuePair *q)
> >  while (q->inflight) {
> >  int16_t cid;
> >  c = (NvmeCqe *)>cq.queue[q->cq.head * NVME_CQ_ENTRY_BYTES];
> > -if (!c->cid || (le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> > +if ((le16_to_cpu(c->status) & 0x1) == q->cq_phase) {
> >  break;
> >  }
> >  q->cq.head = (q->cq.head + 1) % NVME_QUEUE_SIZE;
> > @@ -339,10 +339,7 @@ static bool nvme_process_completion(BDRVNVMeState *s, 
> > NVMeQueuePair *q)
> >  qemu_mutex_unlock(>lock);
> >  req.cb(req.opaque, nvme_translate_error(c));
> >  qemu_mutex_lock(>lock);
> > -c->cid = cpu_to_le16(0);
> >  q->inflight--;
> > -/* Flip Phase Tag bit. */
> > -c->status = cpu_to_le16(le16_to_cpu(c->status) ^ 0x1);
> >  progress = true;
> >  }
> >  if (progress) {
> > 
> 
>