RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Matias Bjorling


> -Original Message-
> From: Klaus Jensen 
> Sent: Tuesday, 29 September 2020 20.00
> To: Keith Busch 
> Cc: Damien Le Moal ; Fam Zheng
> ; Kevin Wolf ; qemu-
> bl...@nongnu.org; Niklas Cassel ; Klaus Jensen
> ; qemu-de...@nongnu.org; Alistair Francis
> ; Philippe Mathieu-Daudé ;
> Matias Bjorling 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> Zoned Namespace Command Set
> 
> On Sep 29 10:29, Keith Busch wrote:
> > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > It is unmistakably clear that you are invalidating my arguments
> > > about portability and endianness issues by suggesting that we just
> > > remove persistent state and deal with it later, but persistence is
> > > the killer feature that sets the QEMU emulated device apart from
> > > other emulation options. It is not about using emulation in
> > > production (because yeah, why would you?), but persistence is what
> > > makes it possible to develop and test "zoned FTLs" or something that
> requires recovery at power up.
> > > This is what allows testing of how your host software deals with
> > > opened zones being transitioned to FULL on power up and the
> > > persistent tracking of LBA allocation (in my series) can be used to
> > > properly test error recovery if you lost state in the app.
> >
> > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > spec suggests it should be CLOSED. The spec does appear to support
> > going to FULL on a NVM Subsystem Reset, though. Actually, now that I'm
> > looking at this part of the spec, these implicit transitions seem a
> > bit less clear than I expected. I'm not sure it's clear enough to
> > evaluate qemu's compliance right now.
> >
> > But I don't see what testing these transitions has to do with having a
> > persistent state. You can reboot your VM without tearing down the
> > running QEMU instance. You can also unbind the driver or shutdown the
> > controller within the running operating system. That should make those
> > implicit state transitions reachable in order to exercise your FTL's
> > recovery.
> >
> 
> Oh dear - don't "spec" with me ;)
> 
> NVMe v1.4 Section 7.3.1:
> 
> An NVM Subsystem Reset is initiated when:
>   * Main power is applied to the NVM subsystem;
>   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> field;
>   * Requested using a method defined in the NVMe Management
> Interface specification; or
>   * A vendor specific event occurs.
> 
> In the context of QEMU, "Main power" is tearing down QEMU and starting it
> from scratch. Just like on a "real" host, unbinding the driver, rebooting or
> shutting down the controller does not cause a subsystem reset (and does not
> cause the zones to change state). And since the device does not indicate
> support for the optional NSSR.NSSRC register, that way to initiate a subsystem
> cannot be used.
> 
> The reason for moving to FULL is that write pointer updates are not persisted
> on each advancement, only when the zone state changes. So zones that were
> opened might have valid data, but invalid write pointer.
> So the device transitions them to FULL as it is allowed to.
> 

How about when one must also recover from intermediate states (i.e., 
open/closed upon power loss). For example, I don't hope a real SSD 
implementation transition zones to full when it has thousands of open 
simultaneously. That could be a disaster for the PE cycles, and a lot of media 
going to waste. One would want applications to support that kind of failure 
mode as well. 


RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Matias Bjorling


> -Original Message-
> From: Klaus Jensen 
> Sent: Tuesday, 29 September 2020 20.36
> To: Matias Bjorling 
> Cc: Keith Busch ; Damien Le Moal
> ; Fam Zheng ; Kevin Wolf
> ; qemu-block@nongnu.org; Niklas Cassel
> ; Klaus Jensen ; qemu-
> de...@nongnu.org; Alistair Francis ; Philippe
> Mathieu-Daudé 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> Zoned Namespace Command Set
> 
> On Sep 29 18:17, Matias Bjorling wrote:
> >
> >
> > > -Original Message-
> > > From: Klaus Jensen 
> > > Sent: Tuesday, 29 September 2020 20.00
> > > To: Keith Busch 
> > > Cc: Damien Le Moal ; Fam Zheng
> > > ; Kevin Wolf ; qemu-
> > > bl...@nongnu.org; Niklas Cassel ; Klaus
> > > Jensen ; qemu-de...@nongnu.org; Alistair
> > > Francis ; Philippe Mathieu-Daudé
> > > ; Matias Bjorling 
> > > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > > and Zoned Namespace Command Set
> > >
> > > On Sep 29 10:29, Keith Busch wrote:
> > > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > > It is unmistakably clear that you are invalidating my arguments
> > > > > about portability and endianness issues by suggesting that we
> > > > > just remove persistent state and deal with it later, but
> > > > > persistence is the killer feature that sets the QEMU emulated
> > > > > device apart from other emulation options. It is not about using
> > > > > emulation in production (because yeah, why would you?), but
> > > > > persistence is what makes it possible to develop and test "zoned
> > > > > FTLs" or something that
> > > requires recovery at power up.
> > > > > This is what allows testing of how your host software deals with
> > > > > opened zones being transitioned to FULL on power up and the
> > > > > persistent tracking of LBA allocation (in my series) can be used
> > > > > to properly test error recovery if you lost state in the app.
> > > >
> > > > Hold up -- why does an OPEN zone transition to FULL on power up?
> > > > The spec suggests it should be CLOSED. The spec does appear to
> > > > support going to FULL on a NVM Subsystem Reset, though. Actually,
> > > > now that I'm looking at this part of the spec, these implicit
> > > > transitions seem a bit less clear than I expected. I'm not sure
> > > > it's clear enough to evaluate qemu's compliance right now.
> > > >
> > > > But I don't see what testing these transitions has to do with
> > > > having a persistent state. You can reboot your VM without tearing
> > > > down the running QEMU instance. You can also unbind the driver or
> > > > shutdown the controller within the running operating system. That
> > > > should make those implicit state transitions reachable in order to
> > > > exercise your FTL's recovery.
> > > >
> > >
> > > Oh dear - don't "spec" with me ;)
> > >
> > > NVMe v1.4 Section 7.3.1:
> > >
> > > An NVM Subsystem Reset is initiated when:
> > >   * Main power is applied to the NVM subsystem;
> > >   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> > > field;
> > >   * Requested using a method defined in the NVMe Management
> > > Interface specification; or
> > >   * A vendor specific event occurs.
> > >
> > > In the context of QEMU, "Main power" is tearing down QEMU and
> > > starting it from scratch. Just like on a "real" host, unbinding the
> > > driver, rebooting or shutting down the controller does not cause a
> > > subsystem reset (and does not cause the zones to change state). And
> > > since the device does not indicate support for the optional
> > > NSSR.NSSRC register, that way to initiate a subsystem cannot be used.
> > >
> > > The reason for moving to FULL is that write pointer updates are not
> > > persisted on each advancement, only when the zone state changes. So
> > > zones that were opened might have valid data, but invalid write pointer.
> > > So the device transitions them to FULL as it is allowed to.
> > >
> >
> > How about when one must also recover from intermediate states (i.e.,
> > open/closed upon power loss). For example, I don't h

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Keith Busch
All,

Let's de-escalate this, please. There's no reason to doubt Klaus wants
to see this to work well, just as everyone else does. We unfortunately
have conflicting proposals posted, and everyone is passionate enough
about their work, but please simmer down.

As I mentioned earlier, I'd like to refocus on the basic implementation
and save the persistent state discussion once the core is solid. After
going through it all, I feel there's enough to discuss there to keep us
busy for little while longer. Additional comments on the code will be
coming from me later today.



RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Dmitry Fomichev



> -Original Message-
> From: Keith Busch 
> Sent: Tuesday, September 29, 2020 3:22 PM
> To: Klaus Jensen 
> Cc: Dmitry Fomichev ; Kevin Wolf
> ; Fam Zheng ; Damien Le Moal
> ; qemu-block@nongnu.org; Niklas Cassel
> ; Klaus Jensen ; qemu-
> de...@nongnu.org; Alistair Francis ; Philippe
> Mathieu-Daudé ; Matias Bjorling
> 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> All,
> 
> Let's de-escalate this, please. There's no reason to doubt Klaus wants
> to see this to work well, just as everyone else does. We unfortunately
> have conflicting proposals posted, and everyone is passionate enough
> about their work, but please simmer down.
> 
> As I mentioned earlier, I'd like to refocus on the basic implementation
> and save the persistent state discussion once the core is solid. After
> going through it all, I feel there's enough to discuss there to keep us
> busy for little while longer. Additional comments on the code will be
> coming from me later today.

OK, I agree with this and I will not be replying to the email prior to this
one it the thread. Let's calm down so we will be able to have a beer at a
conference one day :)

The only one thing that I would like to cover is lack of response to Klaus'
ZNS patchset. Klaus, you are right to complain about it. Since discovering
about the large backlog of NVMe patches that you had pending
(something that we were not aware at the time of publishing our patches),
we made the decision to rebase our series on top of the patches that you
had posted before the publication time of WDC ZNS patchset. Since then,
I got caught in the constant cycle of rebasing our patches on top of your
series and that prevented me from doing much in terms of reviewing of
your commits. Now, once we seem to catch up with the current head of
development, I should be able to do more of this. There is absolutely no
ill will involved :)

Dmitry



Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen
On Sep 29 15:42, Dmitry Fomichev wrote:
> > -Original Message-
> > From: Klaus Jensen 
> > Sent: Monday, September 28, 2020 2:37 AM
> > To: Dmitry Fomichev 
> > Cc: Keith Busch ; Damien Le Moal
> > ; Klaus Jensen ; Kevin
> > Wolf ; Philippe Mathieu-Daudé ;
> > Maxim Levitsky ; Fam Zheng ;
> > Niklas Cassel ; qemu-block@nongnu.org; qemu-
> > de...@nongnu.org; Alistair Francis ; Matias
> > Bjorling 
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > and Zoned Namespace Command Set
> > 
> > On Sep 28 02:33, Dmitry Fomichev wrote:
> > > > -Original Message-
> > > > From: Klaus Jensen 
> > > >
> > > > If it really needs to be memory mapped, then I think a hostmem-based
> > > > approach similar to what Andrzej did for PMR is needed (I think that
> > > > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > > > slightly tricky to get it to work on all platforms AFAIK).
> > >
> > > Ok, it looks that using the HostMemoryBackendFile backend will be
> > > more appropriate. This will remove the need for conditional compile.
> > >
> > > The mmap() portability is pretty decent across software platforms.
> > > Any poor Windows user who is forced to emulate ZNS on mingw will be
> > > able to do so, just without having zone state persistency. Considering
> > > how specialized this stuff is in first place, I estimate the number of 
> > > users
> > > affected by this "limitation" to be exactly zero.
> > >
> > 
> > QEMU is a cross platform project - we should strive for portability.
> > 
> > Alienating developers that use a Windows platform and calling them out
> > as "poor" is not exactly good for the zoned ecosystem.
> > 
> 
> Wow. By bringing up political correctness here you are basically admitting
> the fact that you have no real technical argument here.

I prefer that we support all platforms if and when we can. That's a
technical argument, not a personal one like you those you start using
now.

> The whole Windows issue is red herring that you are using to attack
> the code that is absolutely legit, but comes from a competitor.

I can't even...

> Your initial complaint was that it doesn't compile in mingw and that
> it uses "wrong" API. You have even suggested the API to use. Now, the
> code uses that API and builds fine, but now it's still not good simply
> because you "do not like it". It's a disgrace.
> 

I answered this in a previous reply.

> > > > But really,
> > > > since we do not require memory semantics for this, then I think the
> > > > abstraction is fundamentally wrong.
> > > >
> > >
> > > Seriously, what is wrong with using mmap :) ? It is used successfully for
> > > similar applications, for example -
> > > https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> > >
> > 
> > There is nothing fundamentally wrong with mmap. I just think it is the
> > wrong abstraction here (and it limits portability for no good reason).
> > For PMR there is a good reason - it requires memory semantics.
> > 
> 
> We are trying to emulate NVMEe controller NVRAM.  The best abstraction
> for emulating NVRAM would be... NVRAM!
> 

You never brought that up before and sure it could be a fair argument,
except it is not true.

PMR is emulating NVRAM (and requires memory semantics). Persistent state
is not emulating anything. It is an implementation detail.

> > > > I am, of course, blowing my own horn, since my implementation uses a
> > > > portable blockdev for this.
> > > >
> > >
> > > You are making it sound like the entire WDC series relies on this 
> > > approach.
> > > Actually, the persistency is introduced in the second to last patch in the
> > > series and it only adds a couple of lines of code in the i/o path to mark
> > > zones dirty. This is possible because of using mmap() and I find the way
> > > it is done to be quite elegant, not ugly :)
> > >
> > 
> > No, I understand that your implementation works fine without
> > persistance, but persistance is key. That is why my series adds it in
> > the first patch. Without persistence it is just a toy. And the QEMU
> > device is not just an "NVMe-version" of null_blk.
> > 
> > And I don't think I ever called the use of mmap ugly. I called out the
> > physical memory API shenanigans as a hack.
> 

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen
On Sep 29 18:17, Matias Bjorling wrote:
> 
> 
> > -Original Message-
> > From: Klaus Jensen 
> > Sent: Tuesday, 29 September 2020 20.00
> > To: Keith Busch 
> > Cc: Damien Le Moal ; Fam Zheng
> > ; Kevin Wolf ; qemu-
> > bl...@nongnu.org; Niklas Cassel ; Klaus Jensen
> > ; qemu-de...@nongnu.org; Alistair Francis
> > ; Philippe Mathieu-Daudé ;
> > Matias Bjorling 
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and
> > Zoned Namespace Command Set
> > 
> > On Sep 29 10:29, Keith Busch wrote:
> > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > It is unmistakably clear that you are invalidating my arguments
> > > > about portability and endianness issues by suggesting that we just
> > > > remove persistent state and deal with it later, but persistence is
> > > > the killer feature that sets the QEMU emulated device apart from
> > > > other emulation options. It is not about using emulation in
> > > > production (because yeah, why would you?), but persistence is what
> > > > makes it possible to develop and test "zoned FTLs" or something that
> > requires recovery at power up.
> > > > This is what allows testing of how your host software deals with
> > > > opened zones being transitioned to FULL on power up and the
> > > > persistent tracking of LBA allocation (in my series) can be used to
> > > > properly test error recovery if you lost state in the app.
> > >
> > > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > > spec suggests it should be CLOSED. The spec does appear to support
> > > going to FULL on a NVM Subsystem Reset, though. Actually, now that I'm
> > > looking at this part of the spec, these implicit transitions seem a
> > > bit less clear than I expected. I'm not sure it's clear enough to
> > > evaluate qemu's compliance right now.
> > >
> > > But I don't see what testing these transitions has to do with having a
> > > persistent state. You can reboot your VM without tearing down the
> > > running QEMU instance. You can also unbind the driver or shutdown the
> > > controller within the running operating system. That should make those
> > > implicit state transitions reachable in order to exercise your FTL's
> > > recovery.
> > >
> > 
> > Oh dear - don't "spec" with me ;)
> > 
> > NVMe v1.4 Section 7.3.1:
> > 
> > An NVM Subsystem Reset is initiated when:
> >   * Main power is applied to the NVM subsystem;
> >   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> > field;
> >   * Requested using a method defined in the NVMe Management
> > Interface specification; or
> >   * A vendor specific event occurs.
> > 
> > In the context of QEMU, "Main power" is tearing down QEMU and starting it
> > from scratch. Just like on a "real" host, unbinding the driver, rebooting or
> > shutting down the controller does not cause a subsystem reset (and does not
> > cause the zones to change state). And since the device does not indicate
> > support for the optional NSSR.NSSRC register, that way to initiate a 
> > subsystem
> > cannot be used.
> > 
> > The reason for moving to FULL is that write pointer updates are not 
> > persisted
> > on each advancement, only when the zone state changes. So zones that were
> > opened might have valid data, but invalid write pointer.
> > So the device transitions them to FULL as it is allowed to.
> > 
> 
> How about when one must also recover from intermediate states (i.e.,
> open/closed upon power loss). For example, I don't hope a real SSD
> implementation transition zones to full when it has thousands of open
> simultaneously. That could be a disaster for the PE cycles, and a lot
> of media going to waste. One would want applications to support that
> kind of failure mode as well. 

Christ. The WDC Strike Force is really jumping out of lightspeed here.
I'm afraid I don't have an opposing force to engage with. So I'll be
your only boxing bag for the evening.

As Keith just said, "Opened" is not a valid intial state. Didn't you
write the spec? ;) As for Closed, they will be brought up as is.

With that in mind, I'm not sure what you specifically refer to? I'll
gently remind you that the QEMU nvme device is not a real SSD and does
not deal with NAND so it does not really do any "recovering" of
intermediate states on power on if that is what you refer to?


signature.asc
Description: PGP signature


Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen
On Sep 29 11:15, Keith Busch wrote:
> On Tue, Sep 29, 2020 at 08:00:04PM +0200, Klaus Jensen wrote:
> > On Sep 29 10:29, Keith Busch wrote:
> > > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > > It is unmistakably clear that you are invalidating my arguments about
> > > > portability and endianness issues by suggesting that we just remove
> > > > persistent state and deal with it later, but persistence is the killer
> > > > feature that sets the QEMU emulated device apart from other emulation
> > > > options. It is not about using emulation in production (because yeah,
> > > > why would you?), but persistence is what makes it possible to develop
> > > > and test "zoned FTLs" or something that requires recovery at power up.
> > > > This is what allows testing of how your host software deals with opened
> > > > zones being transitioned to FULL on power up and the persistent tracking
> > > > of LBA allocation (in my series) can be used to properly test error
> > > > recovery if you lost state in the app.
> > > 
> > > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > > spec suggests it should be CLOSED. The spec does appear to support going
> > > to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> > > at this part of the spec, these implicit transitions seem a bit less
> > > clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> > > compliance right now.
> > > 
> > > But I don't see what testing these transitions has to do with having a
> > > persistent state. You can reboot your VM without tearing down the
> > > running QEMU instance. You can also unbind the driver or shutdown the
> > > controller within the running operating system. That should make those
> > > implicit state transitions reachable in order to exercise your FTL's
> > > recovery.
> > > 
> > 
> > Oh dear - don't "spec" with me ;)
> > 
> > NVMe v1.4 Section 7.3.1:
> > 
> > An NVM Subsystem Reset is initiated when:
> >   * Main power is applied to the NVM subsystem;
> >   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> > field;
> >   * Requested using a method defined in the NVMe Management
> > Interface specification; or
> >   * A vendor specific event occurs.
>  
> Okay. I wish the nvme twg would strip the changelog from the published
> TPs. We have unhelpful statements like this in the ZNS spec:
> 
>   "Default active zones to transition to Closed state on power/controller 
> reset."
> 
> > In the context of QEMU, "Main power" is tearing down QEMU and starting
> > it from scratch. Just like on a "real" host, unbinding the driver,
> > rebooting or shutting down the controller does not cause a subsystem
> > reset (and does not cause the zones to change state). 
> 
> That can't be right. The ZNS spec says:
> 
>   The initial state of a zone state machine is set as a result of:
> a) an NVM Subsystem Reset; or
> b) all controllers in the NVM subsystem reporting Shutdown
>processing complete ((i.e., 10b in the Shutdown Status (SHST)
>register)
> 
> So a CC.SHN had better cause an implicit transition of open zones to
> their "initial" state since 'open' is not a valid initial state.

Oh snap; true, you got me there.


signature.asc
Description: PGP signature


Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Keith Busch
On Tue, Sep 29, 2020 at 08:00:04PM +0200, Klaus Jensen wrote:
> On Sep 29 10:29, Keith Busch wrote:
> > On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > > It is unmistakably clear that you are invalidating my arguments about
> > > portability and endianness issues by suggesting that we just remove
> > > persistent state and deal with it later, but persistence is the killer
> > > feature that sets the QEMU emulated device apart from other emulation
> > > options. It is not about using emulation in production (because yeah,
> > > why would you?), but persistence is what makes it possible to develop
> > > and test "zoned FTLs" or something that requires recovery at power up.
> > > This is what allows testing of how your host software deals with opened
> > > zones being transitioned to FULL on power up and the persistent tracking
> > > of LBA allocation (in my series) can be used to properly test error
> > > recovery if you lost state in the app.
> > 
> > Hold up -- why does an OPEN zone transition to FULL on power up? The
> > spec suggests it should be CLOSED. The spec does appear to support going
> > to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> > at this part of the spec, these implicit transitions seem a bit less
> > clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> > compliance right now.
> > 
> > But I don't see what testing these transitions has to do with having a
> > persistent state. You can reboot your VM without tearing down the
> > running QEMU instance. You can also unbind the driver or shutdown the
> > controller within the running operating system. That should make those
> > implicit state transitions reachable in order to exercise your FTL's
> > recovery.
> > 
> 
> Oh dear - don't "spec" with me ;)
> 
> NVMe v1.4 Section 7.3.1:
> 
> An NVM Subsystem Reset is initiated when:
>   * Main power is applied to the NVM subsystem;
>   * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
> field;
>   * Requested using a method defined in the NVMe Management
> Interface specification; or
>   * A vendor specific event occurs.
 
Okay. I wish the nvme twg would strip the changelog from the published
TPs. We have unhelpful statements like this in the ZNS spec:

  "Default active zones to transition to Closed state on power/controller 
reset."

> In the context of QEMU, "Main power" is tearing down QEMU and starting
> it from scratch. Just like on a "real" host, unbinding the driver,
> rebooting or shutting down the controller does not cause a subsystem
> reset (and does not cause the zones to change state). 

That can't be right. The ZNS spec says:

  The initial state of a zone state machine is set as a result of:
a) an NVM Subsystem Reset; or
b) all controllers in the NVM subsystem reporting Shutdown
   processing complete ((i.e., 10b in the Shutdown Status (SHST)
   register)

So a CC.SHN had better cause an implicit transition of open zones to
their "initial" state since 'open' is not a valid initial state.



Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen
On Sep 29 10:29, Keith Busch wrote:
> On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> > It is unmistakably clear that you are invalidating my arguments about
> > portability and endianness issues by suggesting that we just remove
> > persistent state and deal with it later, but persistence is the killer
> > feature that sets the QEMU emulated device apart from other emulation
> > options. It is not about using emulation in production (because yeah,
> > why would you?), but persistence is what makes it possible to develop
> > and test "zoned FTLs" or something that requires recovery at power up.
> > This is what allows testing of how your host software deals with opened
> > zones being transitioned to FULL on power up and the persistent tracking
> > of LBA allocation (in my series) can be used to properly test error
> > recovery if you lost state in the app.
> 
> Hold up -- why does an OPEN zone transition to FULL on power up? The
> spec suggests it should be CLOSED. The spec does appear to support going
> to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
> at this part of the spec, these implicit transitions seem a bit less
> clear than I expected. I'm not sure it's clear enough to evaluate qemu's
> compliance right now.
> 
> But I don't see what testing these transitions has to do with having a
> persistent state. You can reboot your VM without tearing down the
> running QEMU instance. You can also unbind the driver or shutdown the
> controller within the running operating system. That should make those
> implicit state transitions reachable in order to exercise your FTL's
> recovery.
> 

Oh dear - don't "spec" with me ;)

NVMe v1.4 Section 7.3.1:

An NVM Subsystem Reset is initiated when:
  * Main power is applied to the NVM subsystem;
  * A value of 4E564D64h ("NVMe") is written to the NSSR.NSSRC
field;
  * Requested using a method defined in the NVMe Management
Interface specification; or
  * A vendor specific event occurs.

In the context of QEMU, "Main power" is tearing down QEMU and starting
it from scratch. Just like on a "real" host, unbinding the driver,
rebooting or shutting down the controller does not cause a subsystem
reset (and does not cause the zones to change state). And since the
device does not indicate support for the optional NSSR.NSSRC register,
that way to initiate a subsystem cannot be used.

The reason for moving to FULL is that write pointer updates are not
persisted on each advancement, only when the zone state changes. So
zones that were opened might have valid data, but invalid write pointer.
So the device transitions them to FULL as it is allowed to.

QED.

> I agree the persistent state provides conveniences for developers. I
> just don't want to gate ZNS enabling on it either since the core design
> doesn't depend on it.

I just don't see why we cant have the icing on the cake when it is
already there :)


signature.asc
Description: PGP signature


Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Keith Busch
On Tue, Sep 29, 2020 at 11:13:51AM +, Damien Le Moal wrote:
> OK. Then let's move the persistence implementation as the last patch in the
> series. This way, if it is still controversial, it will not block the rest.
> 
> Here is what I propose:
> Dmitry: remove persistence stuff from your patches, address comments and 
> resend.
> Klaus: Rebase your persistence patch(es) with reworked format on top of Dmitry
> series and send.
> 
> That creates a pipeline for reviews and persistence is not a blocker. And I
> agree that other ZNS feature can come after we get all of that done first.
> 
> Thoughts ? Keith ? Would that work for you ?

That works for me. I will have comments for Dmitry's v5, though, so
please wait one more day before considering a respin.



Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Keith Busch
On Tue, Sep 29, 2020 at 12:46:33PM +0200, Klaus Jensen wrote:
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.

Hold up -- why does an OPEN zone transition to FULL on power up? The
spec suggests it should be CLOSED. The spec does appear to support going
to FULL on a NVM Subsystem Reset, though. Actually, now that I'm looking
at this part of the spec, these implicit transitions seem a bit less
clear than I expected. I'm not sure it's clear enough to evaluate qemu's
compliance right now.

But I don't see what testing these transitions has to do with having a
persistent state. You can reboot your VM without tearing down the
running QEMU instance. You can also unbind the driver or shutdown the
controller within the running operating system. That should make those
implicit state transitions reachable in order to exercise your FTL's
recovery.

I agree the persistent state provides conveniences for developers. I
just don't want to gate ZNS enabling on it either since the core design
doesn't depend on it.



Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen
On Sep 29 15:43, Dmitry Fomichev wrote:
> > -Original Message-
> > From: Qemu-block  > bounces+dmitry.fomichev=wdc@nongnu.org> On Behalf Of Klaus
> > Jensen
> > Sent: Tuesday, September 29, 2020 6:47 AM
> > To: Damien Le Moal 
> > Cc: Fam Zheng ; Kevin Wolf ; qemu-
> > bl...@nongnu.org; Niklas Cassel ; Klaus Jensen
> > ; qemu-de...@nongnu.org; Alistair Francis
> > ; Keith Busch ; Philippe
> > Mathieu-Daudé ; Matias Bjorling
> > 
> > Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> > and Zoned Namespace Command Set
> > 
> > On Sep 28 22:54, Damien Le Moal wrote:
> > > On 2020/09/29 6:25, Keith Busch wrote:
> > > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> > > >> On Sep 28 02:33, Dmitry Fomichev wrote:
> > > >>> You are making it sound like the entire WDC series relies on this
> > approach.
> > > >>> Actually, the persistency is introduced in the second to last patch 
> > > >>> in the
> > > >>> series and it only adds a couple of lines of code in the i/o path to 
> > > >>> mark
> > > >>> zones dirty. This is possible because of using mmap() and I find the 
> > > >>> way
> > > >>> it is done to be quite elegant, not ugly :)
> > > >>>
> > > >>
> > > >> No, I understand that your implementation works fine without
> > > >> persistance, but persistance is key. That is why my series adds it in
> > > >> the first patch. Without persistence it is just a toy. And the QEMU
> > > >> device is not just an "NVMe-version" of null_blk.
> > > >
> > > > I really think we should be a bit more cautious of commiting to an
> > > > on-disk format for the persistent state. Both this and Klaus' persistent
> > > > state feels a bit ad-hoc, and with all the other knobs provided, it
> > > > looks too easy to have out-of-sync states, or just not being able to
> > > > boot at all if a qemu versions have different on-disk formats.
> > > >
> > > > Is anyone really considering zone emulation for production level stuff
> > > > anyway? I can't imagine a real scenario where you'd want put yourself
> > > > through that: you are just giving yourself all the downsides of a zoned
> > > > block device and none of the benefits. AFAIK, this is provided as a
> > > > development vehicle, closer to a "toy".
> > > >
> > > > I think we should consider trimming this down to a more minimal set that
> > > > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > > > bells & whistles and flush out the meta data's data marshalling scheme
> > > > for persistence later.
> > >
> > > +1 on this. Removing the persistence also removes the debate on
> > endianess. With
> > > that out of the way, it should be straightforward to get agreement on a
> > series
> > > that can be merged quickly to get developers started with testing ZNS
> > software
> > > with QEMU. That is the most important goal here. 5.9 is around the corner,
> > we
> > > need something for people to get started with ZNS quickly.
> > >
> > 
> > Wait. What. No. Stop!
> > 
> > It is unmistakably clear that you are invalidating my arguments about
> > portability and endianness issues by suggesting that we just remove
> > persistent state and deal with it later, but persistence is the killer
> > feature that sets the QEMU emulated device apart from other emulation
> > options. It is not about using emulation in production (because yeah,
> > why would you?), but persistence is what makes it possible to develop
> > and test "zoned FTLs" or something that requires recovery at power up.
> > This is what allows testing of how your host software deals with opened
> > zones being transitioned to FULL on power up and the persistent tracking
> > of LBA allocation (in my series) can be used to properly test error
> > recovery if you lost state in the app.
> > 
> > Please, work with me on this instead of just removing such an essential
> > feature. Since persistence seems to be the only thing we are really
> > discussing, we should have plenty of time until the soft-freeze to come
> > up with a proper solution on that.
> > 
> > I agree that my version had a format that was pretty ad-hoc and that
> > won&

RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Dmitry Fomichev
> -Original Message-
> From: Qemu-block  bounces+dmitry.fomichev=wdc@nongnu.org> On Behalf Of Klaus
> Jensen
> Sent: Tuesday, September 29, 2020 6:47 AM
> To: Damien Le Moal 
> Cc: Fam Zheng ; Kevin Wolf ; qemu-
> bl...@nongnu.org; Niklas Cassel ; Klaus Jensen
> ; qemu-de...@nongnu.org; Alistair Francis
> ; Keith Busch ; Philippe
> Mathieu-Daudé ; Matias Bjorling
> 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 28 22:54, Damien Le Moal wrote:
> > On 2020/09/29 6:25, Keith Busch wrote:
> > > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> > >> On Sep 28 02:33, Dmitry Fomichev wrote:
> > >>> You are making it sound like the entire WDC series relies on this
> approach.
> > >>> Actually, the persistency is introduced in the second to last patch in 
> > >>> the
> > >>> series and it only adds a couple of lines of code in the i/o path to 
> > >>> mark
> > >>> zones dirty. This is possible because of using mmap() and I find the way
> > >>> it is done to be quite elegant, not ugly :)
> > >>>
> > >>
> > >> No, I understand that your implementation works fine without
> > >> persistance, but persistance is key. That is why my series adds it in
> > >> the first patch. Without persistence it is just a toy. And the QEMU
> > >> device is not just an "NVMe-version" of null_blk.
> > >
> > > I really think we should be a bit more cautious of commiting to an
> > > on-disk format for the persistent state. Both this and Klaus' persistent
> > > state feels a bit ad-hoc, and with all the other knobs provided, it
> > > looks too easy to have out-of-sync states, or just not being able to
> > > boot at all if a qemu versions have different on-disk formats.
> > >
> > > Is anyone really considering zone emulation for production level stuff
> > > anyway? I can't imagine a real scenario where you'd want put yourself
> > > through that: you are just giving yourself all the downsides of a zoned
> > > block device and none of the benefits. AFAIK, this is provided as a
> > > development vehicle, closer to a "toy".
> > >
> > > I think we should consider trimming this down to a more minimal set that
> > > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > > bells & whistles and flush out the meta data's data marshalling scheme
> > > for persistence later.
> >
> > +1 on this. Removing the persistence also removes the debate on
> endianess. With
> > that out of the way, it should be straightforward to get agreement on a
> series
> > that can be merged quickly to get developers started with testing ZNS
> software
> > with QEMU. That is the most important goal here. 5.9 is around the corner,
> we
> > need something for people to get started with ZNS quickly.
> >
> 
> Wait. What. No. Stop!
> 
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.
> 
> Please, work with me on this instead of just removing such an essential
> feature. Since persistence seems to be the only thing we are really
> discussing, we should have plenty of time until the soft-freeze to come
> up with a proper solution on that.
> 
> I agree that my version had a format that was pretty ad-hoc and that
> won't fly - it needs magic and version capabilities like in Dmitry's
> series, which incidentially looks a lot like what we did in the
> OpenChannel implementation, so I agree with the strategy.

Are you insinuating that I somehow took stuff from OCSSD code and try
to claim priority this way? I am not at all that familiar with that code.
And I've already sent you the link to tcmu-runner code that served me
as an inspiration for implementing persistence in WDC patchs

RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Dmitry Fomichev
> -Original Message-
> From: Klaus Jensen 
> Sent: Monday, September 28, 2020 2:37 AM
> To: Dmitry Fomichev 
> Cc: Keith Busch ; Damien Le Moal
> ; Klaus Jensen ; Kevin
> Wolf ; Philippe Mathieu-Daudé ;
> Maxim Levitsky ; Fam Zheng ;
> Niklas Cassel ; qemu-block@nongnu.org; qemu-
> de...@nongnu.org; Alistair Francis ; Matias
> Bjorling 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 28 02:33, Dmitry Fomichev wrote:
> > > -Original Message-
> > > From: Klaus Jensen 
> > >
> > > If it really needs to be memory mapped, then I think a hostmem-based
> > > approach similar to what Andrzej did for PMR is needed (I think that
> > > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > > slightly tricky to get it to work on all platforms AFAIK).
> >
> > Ok, it looks that using the HostMemoryBackendFile backend will be
> > more appropriate. This will remove the need for conditional compile.
> >
> > The mmap() portability is pretty decent across software platforms.
> > Any poor Windows user who is forced to emulate ZNS on mingw will be
> > able to do so, just without having zone state persistency. Considering
> > how specialized this stuff is in first place, I estimate the number of users
> > affected by this "limitation" to be exactly zero.
> >
> 
> QEMU is a cross platform project - we should strive for portability.
> 
> Alienating developers that use a Windows platform and calling them out
> as "poor" is not exactly good for the zoned ecosystem.
> 

Wow. By bringing up political correctness here you are basically admitting
the fact that you have no real technical argument here. The whole Windows
issue is red herring that you are using to attack the code that is absolutely
legit, but comes from a competitor. Your initial complaint was that it
doesn't compile in mingw and that it uses "wrong" API. You have even
suggested the API to use. Now, the code uses that API and builds fine, but
now it's still not good simply because you "do not like it". It's a disgrace.

> > > But really,
> > > since we do not require memory semantics for this, then I think the
> > > abstraction is fundamentally wrong.
> > >
> >
> > Seriously, what is wrong with using mmap :) ? It is used successfully for
> > similar applications, for example -
> > https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> >
> 
> There is nothing fundamentally wrong with mmap. I just think it is the
> wrong abstraction here (and it limits portability for no good reason).
> For PMR there is a good reason - it requires memory semantics.
> 

We are trying to emulate NVMEe controller NVRAM.  The best abstraction
for emulating NVRAM would be... NVRAM!

> > > I am, of course, blowing my own horn, since my implementation uses a
> > > portable blockdev for this.
> > >
> >
> > You are making it sound like the entire WDC series relies on this approach.
> > Actually, the persistency is introduced in the second to last patch in the
> > series and it only adds a couple of lines of code in the i/o path to mark
> > zones dirty. This is possible because of using mmap() and I find the way
> > it is done to be quite elegant, not ugly :)
> >
> 
> No, I understand that your implementation works fine without
> persistance, but persistance is key. That is why my series adds it in
> the first patch. Without persistence it is just a toy. And the QEMU
> device is not just an "NVMe-version" of null_blk.
> 
> And I don't think I ever called the use of mmap ugly. I called out the
> physical memory API shenanigans as a hack.
> 
> > > Another issue is the complete lack of endian conversions. Does it
> > > matter? It depends. Will anyone ever use this on a big endian host and
> > > move the meta data backing file to a little endian host? Probably not.
> > > So does it really matter? Probably not, but it is cutting corners.
> > >
> 
> After I had replied this, I considered a follow-up, because there are
> probably QEMU developers that would call me out on this.
> 
> This definitely DOES matter to QEMU.
> 
> >
> > Great point on endianness! Naturally, all file backed values are stored in
> > their native endianness. This way, there is no extra overhead on big endian
> > hardware architectures. Portability concerns can be easily addressed by
> > storing metadata endianness as a byte flag in its header. Then, during
> > initialization, the metadata validation

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Damien Le Moal
On 2020/09/29 19:46, Klaus Jensen wrote:
> On Sep 28 22:54, Damien Le Moal wrote:
>> On 2020/09/29 6:25, Keith Busch wrote:
>>> On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
 On Sep 28 02:33, Dmitry Fomichev wrote:
> You are making it sound like the entire WDC series relies on this 
> approach.
> Actually, the persistency is introduced in the second to last patch in the
> series and it only adds a couple of lines of code in the i/o path to mark
> zones dirty. This is possible because of using mmap() and I find the way
> it is done to be quite elegant, not ugly :)
>

 No, I understand that your implementation works fine without
 persistance, but persistance is key. That is why my series adds it in
 the first patch. Without persistence it is just a toy. And the QEMU
 device is not just an "NVMe-version" of null_blk.
>>>
>>> I really think we should be a bit more cautious of commiting to an
>>> on-disk format for the persistent state. Both this and Klaus' persistent
>>> state feels a bit ad-hoc, and with all the other knobs provided, it
>>> looks too easy to have out-of-sync states, or just not being able to
>>> boot at all if a qemu versions have different on-disk formats.
>>>
>>> Is anyone really considering zone emulation for production level stuff
>>> anyway? I can't imagine a real scenario where you'd want put yourself
>>> through that: you are just giving yourself all the downsides of a zoned
>>> block device and none of the benefits. AFAIK, this is provided as a
>>> development vehicle, closer to a "toy".
>>>
>>> I think we should consider trimming this down to a more minimal set that
>>> we *do* agree on and commit for inclusion ASAP. We can iterate all the
>>> bells & whistles and flush out the meta data's data marshalling scheme
>>> for persistence later.
>>
>> +1 on this. Removing the persistence also removes the debate on endianess. 
>> With
>> that out of the way, it should be straightforward to get agreement on a 
>> series
>> that can be merged quickly to get developers started with testing ZNS 
>> software
>> with QEMU. That is the most important goal here. 5.9 is around the corner, we
>> need something for people to get started with ZNS quickly.
>>
> 
> Wait. What. No. Stop!
> 
> It is unmistakably clear that you are invalidating my arguments about
> portability and endianness issues by suggesting that we just remove
> persistent state and deal with it later, but persistence is the killer
> feature that sets the QEMU emulated device apart from other emulation
> options. It is not about using emulation in production (because yeah,
> why would you?), but persistence is what makes it possible to develop
> and test "zoned FTLs" or something that requires recovery at power up.
> This is what allows testing of how your host software deals with opened
> zones being transitioned to FULL on power up and the persistent tracking
> of LBA allocation (in my series) can be used to properly test error
> recovery if you lost state in the app.

I am not invalidating anything. I am in violent agreement with you about the
usefulness of persistence. My point was that I agree with Keith: let's first get
the base emulation in and improve on top of it. And the base emulation does not
need to include persistence and endianess of the saved zone meta for now. The
result of this would still be super useful to have in stable.

Then let's add persistence and others bells and whistles on top (see below).

> Please, work with me on this instead of just removing such an essential
> feature. Since persistence seems to be the only thing we are really
> discussing, we should have plenty of time until the soft-freeze to come
> up with a proper solution on that.
> 
> I agree that my version had a format that was pretty ad-hoc and that
> won't fly - it needs magic and version capabilities like in Dmitry's
> series, which incidentially looks a lot like what we did in the
> OpenChannel implementation, so I agree with the strategy.
> 
> ZNS-wise, the only thing my implementation stores is the zone
> descriptors (in spec-native little-endian format) and the zone
> descriptor extensions. So there are no endian issues with those. The
> allocation tracking bitmap is always stored in little endian, but
> converted to big-endian if running on a big-endian host.
> 
> Let me just conjure something up.
> 
> #define NVME_PSTATE_MAGIC ...
> #define NVME_PSTATE_V11
> 
> typedef struct NvmePstateHeader {
> uint32_t magic;
> uint32_t version;
> 
> uint64_t blk_len;
> 
> uint8_t  lbads;
> uint8_t  iocs;
> 
> uint8_t  rsvd18[3054];
> 
> struct {
> uint64_t zsze;
> uint8_t  zdes;
> } QEMU_PACKED zns;
> 
> uint8_t  rsvd3089[1007];
> } QEMU_PACKED NvmePstateHeader;
> 
> With such a header we have all we need. We can bail out if any
> parameters do not match and similar t

Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-29 Thread Klaus Jensen
On Sep 28 22:54, Damien Le Moal wrote:
> On 2020/09/29 6:25, Keith Busch wrote:
> > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> >> On Sep 28 02:33, Dmitry Fomichev wrote:
> >>> You are making it sound like the entire WDC series relies on this 
> >>> approach.
> >>> Actually, the persistency is introduced in the second to last patch in the
> >>> series and it only adds a couple of lines of code in the i/o path to mark
> >>> zones dirty. This is possible because of using mmap() and I find the way
> >>> it is done to be quite elegant, not ugly :)
> >>>
> >>
> >> No, I understand that your implementation works fine without
> >> persistance, but persistance is key. That is why my series adds it in
> >> the first patch. Without persistence it is just a toy. And the QEMU
> >> device is not just an "NVMe-version" of null_blk.
> > 
> > I really think we should be a bit more cautious of commiting to an
> > on-disk format for the persistent state. Both this and Klaus' persistent
> > state feels a bit ad-hoc, and with all the other knobs provided, it
> > looks too easy to have out-of-sync states, or just not being able to
> > boot at all if a qemu versions have different on-disk formats.
> > 
> > Is anyone really considering zone emulation for production level stuff
> > anyway? I can't imagine a real scenario where you'd want put yourself
> > through that: you are just giving yourself all the downsides of a zoned
> > block device and none of the benefits. AFAIK, this is provided as a
> > development vehicle, closer to a "toy".
> > 
> > I think we should consider trimming this down to a more minimal set that
> > we *do* agree on and commit for inclusion ASAP. We can iterate all the
> > bells & whistles and flush out the meta data's data marshalling scheme
> > for persistence later.
> 
> +1 on this. Removing the persistence also removes the debate on endianess. 
> With
> that out of the way, it should be straightforward to get agreement on a series
> that can be merged quickly to get developers started with testing ZNS software
> with QEMU. That is the most important goal here. 5.9 is around the corner, we
> need something for people to get started with ZNS quickly.
> 

Wait. What. No. Stop!

It is unmistakably clear that you are invalidating my arguments about
portability and endianness issues by suggesting that we just remove
persistent state and deal with it later, but persistence is the killer
feature that sets the QEMU emulated device apart from other emulation
options. It is not about using emulation in production (because yeah,
why would you?), but persistence is what makes it possible to develop
and test "zoned FTLs" or something that requires recovery at power up.
This is what allows testing of how your host software deals with opened
zones being transitioned to FULL on power up and the persistent tracking
of LBA allocation (in my series) can be used to properly test error
recovery if you lost state in the app.

Please, work with me on this instead of just removing such an essential
feature. Since persistence seems to be the only thing we are really
discussing, we should have plenty of time until the soft-freeze to come
up with a proper solution on that.

I agree that my version had a format that was pretty ad-hoc and that
won't fly - it needs magic and version capabilities like in Dmitry's
series, which incidentially looks a lot like what we did in the
OpenChannel implementation, so I agree with the strategy.

ZNS-wise, the only thing my implementation stores is the zone
descriptors (in spec-native little-endian format) and the zone
descriptor extensions. So there are no endian issues with those. The
allocation tracking bitmap is always stored in little endian, but
converted to big-endian if running on a big-endian host.

Let me just conjure something up.

#define NVME_PSTATE_MAGIC ...
#define NVME_PSTATE_V11

typedef struct NvmePstateHeader {
uint32_t magic;
uint32_t version;

uint64_t blk_len;

uint8_t  lbads;
uint8_t  iocs;

uint8_t  rsvd18[3054];

struct {
uint64_t zsze;
uint8_t  zdes;
} QEMU_PACKED zns;

uint8_t  rsvd3089[1007];
} QEMU_PACKED NvmePstateHeader;

With such a header we have all we need. We can bail out if any
parameters do not match and similar to nvme data structures it contains
reserved areas for future use. I'll be posting a v2 with this. If this
still feels too ad-hoc, we can be inspired by QCOW2 and the "extension"
feature.

I can agree that we drop other optional features like zone excursions
and the reset/finish recommended limit simulation, but PLEASE DO NOT
remove persistence and upstream a half-baked version when we are so
close and have time to get it right.


signature.asc
Description: PGP signature


Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-28 Thread Damien Le Moal
On 2020/09/29 6:25, Keith Busch wrote:
> On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
>> On Sep 28 02:33, Dmitry Fomichev wrote:
>>> You are making it sound like the entire WDC series relies on this approach.
>>> Actually, the persistency is introduced in the second to last patch in the
>>> series and it only adds a couple of lines of code in the i/o path to mark
>>> zones dirty. This is possible because of using mmap() and I find the way
>>> it is done to be quite elegant, not ugly :)
>>>
>>
>> No, I understand that your implementation works fine without
>> persistance, but persistance is key. That is why my series adds it in
>> the first patch. Without persistence it is just a toy. And the QEMU
>> device is not just an "NVMe-version" of null_blk.
> 
> I really think we should be a bit more cautious of commiting to an
> on-disk format for the persistent state. Both this and Klaus' persistent
> state feels a bit ad-hoc, and with all the other knobs provided, it
> looks too easy to have out-of-sync states, or just not being able to
> boot at all if a qemu versions have different on-disk formats.
> 
> Is anyone really considering zone emulation for production level stuff
> anyway? I can't imagine a real scenario where you'd want put yourself
> through that: you are just giving yourself all the downsides of a zoned
> block device and none of the benefits. AFAIK, this is provided as a
> development vehicle, closer to a "toy".
> 
> I think we should consider trimming this down to a more minimal set that
> we *do* agree on and commit for inclusion ASAP. We can iterate all the
> bells & whistles and flush out the meta data's data marshalling scheme
> for persistence later.

+1 on this. Removing the persistence also removes the debate on endianess. With
that out of the way, it should be straightforward to get agreement on a series
that can be merged quickly to get developers started with testing ZNS software
with QEMU. That is the most important goal here. 5.9 is around the corner, we
need something for people to get started with ZNS quickly.


-- 
Damien Le Moal
Western Digital Research



Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-28 Thread Keith Busch
On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote:
> On Sep 28 02:33, Dmitry Fomichev wrote:
> > You are making it sound like the entire WDC series relies on this approach.
> > Actually, the persistency is introduced in the second to last patch in the
> > series and it only adds a couple of lines of code in the i/o path to mark
> > zones dirty. This is possible because of using mmap() and I find the way
> > it is done to be quite elegant, not ugly :)
> > 
> 
> No, I understand that your implementation works fine without
> persistance, but persistance is key. That is why my series adds it in
> the first patch. Without persistence it is just a toy. And the QEMU
> device is not just an "NVMe-version" of null_blk.

I really think we should be a bit more cautious of commiting to an
on-disk format for the persistent state. Both this and Klaus' persistent
state feels a bit ad-hoc, and with all the other knobs provided, it
looks too easy to have out-of-sync states, or just not being able to
boot at all if a qemu versions have different on-disk formats.

Is anyone really considering zone emulation for production level stuff
anyway? I can't imagine a real scenario where you'd want put yourself
through that: you are just giving yourself all the downsides of a zoned
block device and none of the benefits. AFAIK, this is provided as a
development vehicle, closer to a "toy".

I think we should consider trimming this down to a more minimal set that
we *do* agree on and commit for inclusion ASAP. We can iterate all the
bells & whistles and flush out the meta data's data marshalling scheme
for persistence later.



Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-27 Thread Klaus Jensen
On Sep 28 02:33, Dmitry Fomichev wrote:
> > -Original Message-
> > From: Klaus Jensen 
> >
> > If it really needs to be memory mapped, then I think a hostmem-based
> > approach similar to what Andrzej did for PMR is needed (I think that
> > will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> > slightly tricky to get it to work on all platforms AFAIK).
> 
> Ok, it looks that using the HostMemoryBackendFile backend will be
> more appropriate. This will remove the need for conditional compile.
> 
> The mmap() portability is pretty decent across software platforms.
> Any poor Windows user who is forced to emulate ZNS on mingw will be
> able to do so, just without having zone state persistency. Considering
> how specialized this stuff is in first place, I estimate the number of users
> affected by this "limitation" to be exactly zero.
> 

QEMU is a cross platform project - we should strive for portability.

Alienating developers that use a Windows platform and calling them out
as "poor" is not exactly good for the zoned ecosystem.

> > But really,
> > since we do not require memory semantics for this, then I think the
> > abstraction is fundamentally wrong.
> > 
> 
> Seriously, what is wrong with using mmap :) ? It is used successfully for
> similar applications, for example -
> https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c
> 

There is nothing fundamentally wrong with mmap. I just think it is the
wrong abstraction here (and it limits portability for no good reason).
For PMR there is a good reason - it requires memory semantics.

> > I am, of course, blowing my own horn, since my implementation uses a
> > portable blockdev for this.
> > 
> 
> You are making it sound like the entire WDC series relies on this approach.
> Actually, the persistency is introduced in the second to last patch in the
> series and it only adds a couple of lines of code in the i/o path to mark
> zones dirty. This is possible because of using mmap() and I find the way
> it is done to be quite elegant, not ugly :)
> 

No, I understand that your implementation works fine without
persistance, but persistance is key. That is why my series adds it in
the first patch. Without persistence it is just a toy. And the QEMU
device is not just an "NVMe-version" of null_blk.

And I don't think I ever called the use of mmap ugly. I called out the
physical memory API shenanigans as a hack.

> > Another issue is the complete lack of endian conversions. Does it
> > matter? It depends. Will anyone ever use this on a big endian host and
> > move the meta data backing file to a little endian host? Probably not.
> > So does it really matter? Probably not, but it is cutting corners.
> > 

After I had replied this, I considered a follow-up, because there are
probably QEMU developers that would call me out on this.

This definitely DOES matter to QEMU.

> 
> Great point on endianness! Naturally, all file backed values are stored in
> their native endianness. This way, there is no extra overhead on big endian
> hardware architectures. Portability concerns can be easily addressed by
> storing metadata endianness as a byte flag in its header. Then, during
> initialization, the metadata validation code can detect the possible
> discrepancy in endianness and automatically convert the metadata to the
> endianness of the host. This part is out of scope of this series, but I would
> be able to contribute such a solution as an enhancement in the future.
> 

It is not out of scope. I don't see why we should merge something that
is arguably buggy.

Bottomline is that I just don't see why we should accept an
implementation that

  a) excludes some platforms (Windows) from using persistence; and
  b) contains endianness conversion issues

when there is a portable implementation posted that at least tries to
convert endianness as needed.


signature.asc
Description: PGP signature


RE: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-27 Thread Dmitry Fomichev
> -Original Message-
> From: Klaus Jensen 
> Sent: Thursday, September 24, 2020 5:08 PM
> To: Dmitry Fomichev 
> Cc: Keith Busch ; Klaus Jensen
> ; Kevin Wolf ; Philippe
> Mathieu-Daudé ; Maxim Levitsky
> ; Fam Zheng ; Niklas Cassel
> ; Damien Le Moal ;
> qemu-block@nongnu.org; qemu-de...@nongnu.org; Alistair Francis
> ; Matias Bjorling 
> Subject: Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types
> and Zoned Namespace Command Set
> 
> On Sep 24 03:20, Dmitry Fomichev wrote:
> > v3 -> v4
> >
> >  - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
> >to a zone happen at the new write pointer variable, zone->w_ptr,
> >that is advanced right after submitting the backend i/o. The existing
> >zone->d.wp variable is updated upon the successful write completion
> >and it is used for zone reporting. Some code has been split from
> >nvme_finalize_zoned_write() function to a new function,
> >nvme_advance_zone_wp().
> >
> 
> Same approach that I've used, +1.
> 
> >  - Make the code compile under mingw. Switch to using QEMU API for
> >mmap/msync, i.e. memory_region...(). Since mmap is not available in
> >mingw (even though there is mman-win32 library available on Github),
> >conditional compilation is added around these calls to avoid
> >undefined symbols under mingw. A better fix would be to add stub
> >functions to softmmu/memory.c for the case when CONFIG_POSIX is not
> >defined, but such change is beyond the scope of this patchset and it
> >can be made in a separate patch.
> >
> 
> E.
> 
> This feels like a hack or at the very least an abuse of the physical
> memory management API.
> 
> If it really needs to be memory mapped, then I think a hostmem-based
> approach similar to what Andrzej did for PMR is needed (I think that
> will get rid of the CONFIG_POSIX ifdef at least, but still leave it
> slightly tricky to get it to work on all platforms AFAIK).

Ok, it looks that using the HostMemoryBackendFile backend will be
more appropriate. This will remove the need for conditional compile.

The mmap() portability is pretty decent across software platforms.
Any poor Windows user who is forced to emulate ZNS on mingw will be
able to do so, just without having zone state persistency. Considering
how specialized this stuff is in first place, I estimate the number of users
affected by this "limitation" to be exactly zero.

> But really,
> since we do not require memory semantics for this, then I think the
> abstraction is fundamentally wrong.
> 

Seriously, what is wrong with using mmap :) ? It is used successfully for
similar applications, for example -
https://github.com/open-iscsi/tcmu-runner/blob/master/file_zbc.c

> I am, of course, blowing my own horn, since my implementation uses a
> portable blockdev for this.
> 

You are making it sound like the entire WDC series relies on this approach.
Actually, the persistency is introduced in the second to last patch in the
series and it only adds a couple of lines of code in the i/o path to mark
zones dirty. This is possible because of using mmap() and I find the way
it is done to be quite elegant, not ugly :)

> Another issue is the complete lack of endian conversions. Does it
> matter? It depends. Will anyone ever use this on a big endian host and
> move the meta data backing file to a little endian host? Probably not.
> So does it really matter? Probably not, but it is cutting corners.
> 

Great point on endianness! Naturally, all file backed values are stored in
their native endianness. This way, there is no extra overhead on big endian
hardware architectures. Portability concerns can be easily addressed by
storing metadata endianness as a byte flag in its header. Then, during
initialization, the metadata validation code can detect the possible
discrepancy in endianness and automatically convert the metadata to the
endianness of the host. This part is out of scope of this series, but I would
be able to contribute such a solution as an enhancement in the future.

> >  - Make the list of review comments addressed in v2 of the series
> >(see below).
> >
> 
> Very detailed! Thanks!


Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-09-24 Thread Klaus Jensen
On Sep 24 03:20, Dmitry Fomichev wrote:
> v3 -> v4
> 
>  - Fix bugs introduced in v2/v3 for QD > 1 operation. Now, all writes
>to a zone happen at the new write pointer variable, zone->w_ptr,
>that is advanced right after submitting the backend i/o. The existing
>zone->d.wp variable is updated upon the successful write completion
>and it is used for zone reporting. Some code has been split from
>nvme_finalize_zoned_write() function to a new function,
>nvme_advance_zone_wp().
> 

Same approach that I've used, +1.

>  - Make the code compile under mingw. Switch to using QEMU API for
>mmap/msync, i.e. memory_region...(). Since mmap is not available in
>mingw (even though there is mman-win32 library available on Github),
>conditional compilation is added around these calls to avoid
>undefined symbols under mingw. A better fix would be to add stub
>functions to softmmu/memory.c for the case when CONFIG_POSIX is not
>defined, but such change is beyond the scope of this patchset and it
>can be made in a separate patch.
> 

E.

This feels like a hack or at the very least an abuse of the physical
memory management API.

If it really needs to be memory mapped, then I think a hostmem-based
approach similar to what Andrzej did for PMR is needed (I think that
will get rid of the CONFIG_POSIX ifdef at least, but still leave it
slightly tricky to get it to work on all platforms AFAIK). But really,
since we do not require memory semantics for this, then I think the
abstraction is fundamentally wrong.

I am, of course, blowing my own horn, since my implementation uses a
portable blockdev for this.

Another issue is the complete lack of endian conversions. Does it
matter? It depends. Will anyone ever use this on a big endian host and
move the meta data backing file to a little endian host? Probably not.
So does it really matter? Probably not, but it is cutting corners.

>  - Make the list of review comments addressed in v2 of the series
>(see below).
> 

Very detailed! Thanks!


signature.asc
Description: PGP signature