[SeaBIOS] Re: [PATCH] nvme: fix LBA format data structure

2022-02-02 Thread Alexander Graf via SeaBIOS



On 02.02.22 16:24, Kevin O'Connor wrote:

On Wed, Feb 02, 2022 at 11:50:50AM +0100, Alexander Graf wrote:

On 02.02.22 02:52, Kevin O'Connor wrote:

On Tue, Feb 01, 2022 at 08:39:10PM +0100, Florian Larysch wrote:

On Thu, Jan 27, 2022 at 11:37:52AM -0500, Kevin O'Connor wrote:

Thanks.  I don't know enough about NVMe to review this patch though.
Maybe Julian or Alex could comment?

Happy to hear their comments on this. However, I can also try to explain
the reasoning in a bit more detail.

This follows from the NVMe spec[1]: The Identify command returns a data
structure that contains packed LBAF records (Figure 249, starting at
offset 131). This is represented by struct nvme_identify_ns in the
SeaBIOS code.

Figure 250 gives the structure of these records and this is where the
aforementioned discrepancy lies.

Thanks.  I agree that this should be fixed in SeaBIOS.

However, your patch removes the "reserved" field, which doesn't look
correct.  It sounds like the "res" struct member should remain, but
the "lbads" and "rp" members should be turned into a new "lbads_rp"
member.  Also, the nvme.c code should be updated so that the read of
lbads performs the proper bit masking.


Looking at the spec, lbads is a full well aligned 8 bits. The collision here
is between rp and res, with rp taking only 2 bits and res the remaining 6.

Ah, I misread the spec.  So, the original patch removing the "res"
struct member should be fine.



Yes, the original patch works too. There is currently no user of the rp 
field, so nothing to mask against.






Is using bitfields acceptable in SeaBIOS? If so, the patch below should
automatically give us masking as well.

I'd recommend against using C bitfields to access hardware registers.



This is an in-RAM data structure. But in any case, let's just go with 
the original patch.



Reviewed-by: Alexander Graf 

Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH] nvme: fix LBA format data structure

2022-02-02 Thread Alexander Graf via SeaBIOS


On 02.02.22 02:52, Kevin O'Connor wrote:

On Tue, Feb 01, 2022 at 08:39:10PM +0100, Florian Larysch wrote:

On Thu, Jan 27, 2022 at 11:37:52AM -0500, Kevin O'Connor wrote:

Thanks.  I don't know enough about NVMe to review this patch though.
Maybe Julian or Alex could comment?

Happy to hear their comments on this. However, I can also try to explain
the reasoning in a bit more detail.

This follows from the NVMe spec[1]: The Identify command returns a data
structure that contains packed LBAF records (Figure 249, starting at
offset 131). This is represented by struct nvme_identify_ns in the
SeaBIOS code.

Figure 250 gives the structure of these records and this is where the
aforementioned discrepancy lies.

Thanks.  I agree that this should be fixed in SeaBIOS.

However, your patch removes the "reserved" field, which doesn't look
correct.  It sounds like the "res" struct member should remain, but
the "lbads" and "rp" members should be turned into a new "lbads_rp"
member.  Also, the nvme.c code should be updated so that the read of
lbads performs the proper bit masking.



Looking at the spec, lbads is a full well aligned 8 bits. The collision 
here is between rp and res, with rp taking only 2 bits and res the 
remaining 6. Is using bitfields acceptable in SeaBIOS? If so, the patch 
below should automatically give us masking as well.


Alex

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index f9c807e..9b3015b 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -145,8 +145,8 @@ struct nvme_identify_ns_list {
 struct nvme_lba_format {
 u16 ms;
 u8  lbads;
-    u8  rp;
-    u8  res;
+    u8  rp : 2;
+    u8  res : 6;
 };

 struct nvme_identify_ns {




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCHv2 0/6] Rework NVME page transfers and avoid fsegment allocation

2022-01-21 Thread Alexander Graf via SeaBIOS



On 21.01.22 17:48, Kevin O'Connor wrote:

This is a resend of the previous series.  Changes since last time:

* Patch 1: Fix function comment on nvme_io_xfer()
* Patch 3-5: Added "goto bounce" to nvme_io_xfer() as suggested by Alex
* Patch 6: Added separate "single dma bounce buffer" patch to this series
* Patch 6: Add checking for malloc failure

-Kevin



Thanks a lot for cleaning up that code! I agree that it's a lot more 
readable after your patches :)



Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCHv2 5/6] nvme: Build the page list in the existing dma buffer

2022-01-21 Thread Alexander Graf via SeaBIOS



On 21.01.22 17:48, Kevin O'Connor wrote:

Commit 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
introduced multi-page requests using the NVMe PRP mechanism. To store the
list and "first page to write to" hints, it added fields to the NVMe
namespace struct.

Unfortunately, that struct resides in fseg which is read-only at runtime.
While KVM ignores the read-only part and allows writes, real hardware and
TCG adhere to the semantics and ignore writes to the fseg region. The net
effect of that is that reads and writes were always happening on address 0,
unless they went through the bounce buffer logic.

This patch builds the PRP maintenance data in the existing "dma bounce
buffer" and only builds it when needed.

Fixes: 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
Reported-by: Matt DeVillier 
Signed-off-by: Alexander Graf 
Signed-off-by: Kevin O'Connor 



Reviewed-by: Alexander Graf 


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

2022-01-21 Thread Alexander Graf via SeaBIOS



On 21.01.22 17:54, Kevin O'Connor wrote:

On Fri, Jan 21, 2022 at 05:41:17PM +0100, Alexander Graf wrote:

On 21.01.22 17:02, Kevin O'Connor wrote:

On Fri, Jan 21, 2022 at 03:28:33PM +0100, Alexander Graf wrote:

On 19.01.22 19:45, Kevin O'Connor wrote:

if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
-/* We need to describe more than 2 pages, rely on PRP List */
-prp2 = ns->prpl;
+/* We need to describe more than 2 pages, build PRP List */
+u32 prpl_len = 0;
+u64 *prpl = (void*)ns->dma_buffer;

At 15*8=120 bytes of data, do we even need to put the prpl into dma_buffer
or can we just use the stack? Will the array be guaranteed 64bit aligned or
would we need to add an attribute to it?

  u64 prpl[NVME_MAX_PRPL_ENTRIES];

Either way, I don't have strong feelings one way or another. It just seems
more natural to keep the bounce buffer purely as bounce buffer :).

In general it's possible to DMA from the stack (eg,
src/hw/ohci.c:ohci_send_pipe() ).  However, SeaBIOS doesn't guarantee
stack alignment so it would need to be done manually.  Also, I'm not
sure how tight the nvme request completion code is - if it returns
early for some reason it could cause havoc (device dma into random
memory).

Another option might be to make a single global prpl array, but that
would consume the memory even if nvme isn't in use.

FWIW though, I don't see a harm in the single ( u64 *prpl =
(void*)ns->dma_buffer ) line.


Fair, works for me. But then we probably want to also adjust
MAX_PRPL_ENTRIES to match the full page we now have available, no?


I don't think a BIOS disk request can be over 64KiB so I don't think
it matters.  I got the impression the current checks are just "sanity
checks".  I don't see a harm in keeping the sanity check and that size
is as good as any other as far as I know.



It's a bit of both: Sanity checks and code that potentially can be 
reused outside of the SeaBIOS context. So I would try as hard as 
possible to not make assumptions that we can only handle max 64kb I/O 
requests. Plus, if we really wanted to, we could even introduce a new 
SeaBIOS specific INT to do larger I/O requests.


I won't make a fuss if you want to keep it at 64kb max request size 
though :).



Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCHv2 3/6] nvme: Convert nvme_build_prpl() to nvme_prpl_xfer()

2022-01-21 Thread Alexander Graf via SeaBIOS



On 21.01.22 17:48, Kevin O'Connor wrote:

Rename nvme_build_prpl() to nvme_prpl_xfer() and directly invoke
nvme_io_xfer() or nvme_bounce_xfer() from that function.

Signed-off-by: Kevin O'Connor 



Reviewed-by: Alexander Graf 

Alex



---
  src/hw/nvme-int.h |  1 -
  src/hw/nvme.c | 46 --
  2 files changed, 20 insertions(+), 27 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index a4c1555..9564c17 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -125,7 +125,6 @@ struct nvme_namespace {

  /* Page List */
  u32 prpl_len;
-void *prp1;
  u64 prpl[NVME_MAX_PRPL_ENTRIES];
  };

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index d656e9b..eee7d17 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -498,10 +498,13 @@ static int nvme_add_prpl(struct nvme_namespace *ns, u64 
base)
  return 0;
  }

-static int nvme_build_prpl(struct nvme_namespace *ns, void *op_buf, u16 count)
+// Transfer data using page list (if applicable)
+static int
+nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
+   int write)
  {
  int first_page = 1;
-u32 base = (long)op_buf;
+u32 base = (long)buf;
  s32 size;

  if (count > ns->max_req_size)
@@ -511,31 +514,32 @@ static int nvme_build_prpl(struct nvme_namespace *ns, 
void *op_buf, u16 count)

  size = count * ns->block_size;
  /* Special case for transfers that fit into PRP1, but are unaligned */
-if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) {
-ns->prp1 = op_buf;
-return count;
-}
+if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
+return nvme_io_xfer(ns, lba, buf, count, write);

  /* Every request has to be page aligned */
  if (base & ~NVME_PAGE_MASK)
-return 0;
+goto bounce;

  /* Make sure a full block fits into the last chunk */
  if (size & (ns->block_size - 1ULL))
-return 0;
+goto bounce;

  for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
  if (first_page) {
  /* First page is special */
-ns->prp1 = (void*)base;
  first_page = 0;
  continue;
  }
  if (nvme_add_prpl(ns, base))
-return 0;
+goto bounce;
  }

-return count;
+return nvme_io_xfer(ns, lba, buf, count, write);
+
+bounce:
+/* Use bounce buffer to make transfer */
+return nvme_bounce_xfer(ns, lba, buf, count, write);
  }

  static int
@@ -736,24 +740,14 @@ nvme_scan(void)
  static int
  nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
  {
-u16 i, blocks;
-
+int i;
  for (i = 0; i < op->count;) {
  u16 blocks_remaining = op->count - i;
  char *op_buf = op->buf_fl + i * ns->block_size;
-
-blocks = nvme_build_prpl(ns, op_buf, blocks_remaining);
-if (blocks) {
-int res = nvme_io_xfer(ns, op->lba + i, ns->prp1, blocks, write);
-if (res < 0)
-return DISK_RET_EBADTRACK;
-} else {
-int res = nvme_bounce_xfer(ns, op->lba + i, op_buf, blocks, write);
-if (res < 0)
-return DISK_RET_EBADTRACK;
-blocks = res;
-}
-
+int blocks = nvme_prpl_xfer(ns, op->lba + i, op_buf,
+blocks_remaining, write);
+if (blocks < 0)
+return DISK_RET_EBADTRACK;
  i += blocks;
  }

--
2.31.1

___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

2022-01-21 Thread Alexander Graf via SeaBIOS



On 21.01.22 17:02, Kevin O'Connor wrote:

On Fri, Jan 21, 2022 at 03:28:33PM +0100, Alexander Graf wrote:

On 19.01.22 19:45, Kevin O'Connor wrote:

Commit 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
introduced multi-page requests using the NVMe PRP mechanism. To store the
list and "first page to write to" hints, it added fields to the NVMe
namespace struct.

Unfortunately, that struct resides in fseg which is read-only at runtime.
While KVM ignores the read-only part and allows writes, real hardware and
TCG adhere to the semantics and ignore writes to the fseg region. The net
effect of that is that reads and writes were always happening on address 0,
unless they went through the bounce buffer logic.

This patch builds the PRP maintenance data in the existing "dma bounce
buffer" and only builds it when needed.

Fixes: 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
Reported-by: Matt DeVillier 
Signed-off-by: Alexander Graf 
Signed-off-by: Kevin O'Connor 
---
   src/hw/nvme-int.h |  6 --
   src/hw/nvme.c | 51 +--
   2 files changed, 18 insertions(+), 39 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 9564c17..f0d717d 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -10,8 +10,6 @@
   #include "types.h" // u32
   #include "pcidevice.h" // struct pci_device

-#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
-
   /* Data structures */

   /* The register file of a NVMe host controller. This struct follows the 
naming
@@ -122,10 +120,6 @@ struct nvme_namespace {

   /* Page aligned buffer of size NVME_PAGE_SIZE. */
   char *dma_buffer;
-
-/* Page List */
-u32 prpl_len;
-u64 prpl[NVME_MAX_PRPL_ENTRIES];
   };

   /* Data structures for NVMe admin identify commands */
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 3a73784..39b9138 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -470,35 +470,19 @@ nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
   return res;
   }

-static void nvme_reset_prpl(struct nvme_namespace *ns)
-{
-ns->prpl_len = 0;
-}
-
-static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
-{
-if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
-return -1;
-
-ns->prpl[ns->prpl_len++] = base;
-
-return 0;
-}
+#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */

   // Transfer data using page list (if applicable)
   static int
   nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
  int write)
   {
-int first_page = 1;
   u32 base = (long)buf;
   s32 size;

   if (count > ns->max_req_size)
   count = ns->max_req_size;

-nvme_reset_prpl(ns);
-
   size = count * ns->block_size;
   /* Special case for transfers that fit into PRP1, but are unaligned */
   if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
@@ -512,28 +496,29 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
   if (size & (ns->block_size - 1ULL))
   return nvme_bounce_xfer(ns, lba, buf, count, write);

-for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
-if (first_page) {
-/* First page is special */
-first_page = 0;
-continue;
-}
-if (nvme_add_prpl(ns, base))
-return nvme_bounce_xfer(ns, lba, buf, count, write);
-}
-
-void *prp2;
   if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
-/* We need to describe more than 2 pages, rely on PRP List */
-prp2 = ns->prpl;
+/* We need to describe more than 2 pages, build PRP List */
+u32 prpl_len = 0;
+u64 *prpl = (void*)ns->dma_buffer;


At 15*8=120 bytes of data, do we even need to put the prpl into dma_buffer
or can we just use the stack? Will the array be guaranteed 64bit aligned or
would we need to add an attribute to it?

 u64 prpl[NVME_MAX_PRPL_ENTRIES];

Either way, I don't have strong feelings one way or another. It just seems
more natural to keep the bounce buffer purely as bounce buffer :).

In general it's possible to DMA from the stack (eg,
src/hw/ohci.c:ohci_send_pipe() ).  However, SeaBIOS doesn't guarantee
stack alignment so it would need to be done manually.  Also, I'm not
sure how tight the nvme request completion code is - if it returns
early for some reason it could cause havoc (device dma into random
memory).

Another option might be to make a single global prpl array, but that
would consume the memory even if nvme isn't in use.

FWIW though, I don't see a harm in the single ( u64 *prpl =
(void*)ns->dma_buffer ) line.



Fair, works for me. But then we probably want to also adjust 
MAX_PRPL_ENTRIES to match the full page we now have available, no?



Alex





Ama

[SeaBIOS] Re: [PATCH] nvme: Only allocate one dma bounce buffer for all nvme drives

2022-01-21 Thread Alexander Graf via SeaBIOS



On 19.01.22 20:11, Kevin O'Connor wrote:

There is no need to create multiple dma bounce buffers as the BIOS
disk code isn't reentrant capable.

Signed-off-by: Kevin O'Connor 



Reviewed-by: Alexander Graf 

Alex




---

This is a minor cleanup on top of the previous nvme code series.

-Kevin

---
src/hw/nvme-int.h |  3 ---
  src/hw/nvme.c | 14 +-
  2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index f0d717d..f9c807e 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -117,9 +117,6 @@ struct nvme_namespace {
  u32 block_size;
  u32 metadata_size;
  u32 max_req_size;
-
-/* Page aligned buffer of size NVME_PAGE_SIZE. */
-char *dma_buffer;
  };

  /* Data structures for NVMe admin identify commands */
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 39b9138..0ba981c 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -20,6 +20,9 @@
  #include "nvme.h"
  #include "nvme-int.h"

+// Page aligned "dma bounce buffer" of size NVME_PAGE_SIZE in high memory
+static void *nvme_dma_buffer;
+
  static void *
  zalloc_page_aligned(struct zone_s *zone, u32 size)
  {
@@ -294,7 +297,8 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, u32 ns_idx, u8 mdts)
  ns->max_req_size = -1U;
  }

-ns->dma_buffer = zalloc_page_aligned(, NVME_PAGE_SIZE);
+if (!nvme_dma_buffer)
+nvme_dma_buffer = zalloc_page_aligned(, NVME_PAGE_SIZE);

  char *desc = znprintf(MAXDESCSIZE, "NVMe NS %u: %llu MiB (%llu %u-byte "
"blocks + %u-byte metadata)",
@@ -460,12 +464,12 @@ nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
  u16 blocks = count < max_blocks ? count : max_blocks;

  if (write)
-memcpy(ns->dma_buffer, buf, blocks * ns->block_size);
+memcpy(nvme_dma_buffer, buf, blocks * ns->block_size);

-int res = nvme_io_xfer(ns, lba, ns->dma_buffer, NULL, blocks, write);
+int res = nvme_io_xfer(ns, lba, nvme_dma_buffer, NULL, blocks, write);

  if (!write && res >= 0)
-memcpy(buf, ns->dma_buffer, res * ns->block_size);
+memcpy(buf, nvme_dma_buffer, res * ns->block_size);

  return res;
  }
@@ -499,7 +503,7 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
  if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
  /* We need to describe more than 2 pages, build PRP List */
  u32 prpl_len = 0;
-u64 *prpl = (void*)ns->dma_buffer;
+u64 *prpl = nvme_dma_buffer;
  int first_page = 1;
  for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
  if (first_page) {
--
2.31.1





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH 5/5] nvme: Build the page list in the existing dma buffer

2022-01-21 Thread Alexander Graf via SeaBIOS


On 19.01.22 19:45, Kevin O'Connor wrote:

Commit 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
introduced multi-page requests using the NVMe PRP mechanism. To store the
list and "first page to write to" hints, it added fields to the NVMe
namespace struct.

Unfortunately, that struct resides in fseg which is read-only at runtime.
While KVM ignores the read-only part and allows writes, real hardware and
TCG adhere to the semantics and ignore writes to the fseg region. The net
effect of that is that reads and writes were always happening on address 0,
unless they went through the bounce buffer logic.

This patch builds the PRP maintenance data in the existing "dma bounce
buffer" and only builds it when needed.

Fixes: 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
Reported-by: Matt DeVillier 
Signed-off-by: Alexander Graf 
Signed-off-by: Kevin O'Connor 
---
  src/hw/nvme-int.h |  6 --
  src/hw/nvme.c | 51 +--
  2 files changed, 18 insertions(+), 39 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 9564c17..f0d717d 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -10,8 +10,6 @@
  #include "types.h" // u32
  #include "pcidevice.h" // struct pci_device

-#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
-
  /* Data structures */

  /* The register file of a NVMe host controller. This struct follows the naming
@@ -122,10 +120,6 @@ struct nvme_namespace {

  /* Page aligned buffer of size NVME_PAGE_SIZE. */
  char *dma_buffer;
-
-/* Page List */
-u32 prpl_len;
-u64 prpl[NVME_MAX_PRPL_ENTRIES];
  };

  /* Data structures for NVMe admin identify commands */
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 3a73784..39b9138 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -470,35 +470,19 @@ nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
  return res;
  }

-static void nvme_reset_prpl(struct nvme_namespace *ns)
-{
-ns->prpl_len = 0;
-}
-
-static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
-{
-if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
-return -1;
-
-ns->prpl[ns->prpl_len++] = base;
-
-return 0;
-}
+#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */

  // Transfer data using page list (if applicable)
  static int
  nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
 int write)
  {
-int first_page = 1;
  u32 base = (long)buf;
  s32 size;

  if (count > ns->max_req_size)
  count = ns->max_req_size;

-nvme_reset_prpl(ns);
-
  size = count * ns->block_size;
  /* Special case for transfers that fit into PRP1, but are unaligned */
  if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
@@ -512,28 +496,29 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
  if (size & (ns->block_size - 1ULL))
  return nvme_bounce_xfer(ns, lba, buf, count, write);

-for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
-if (first_page) {
-/* First page is special */
-first_page = 0;
-continue;
-}
-if (nvme_add_prpl(ns, base))
-return nvme_bounce_xfer(ns, lba, buf, count, write);
-}
-
-void *prp2;
  if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
-/* We need to describe more than 2 pages, rely on PRP List */
-prp2 = ns->prpl;
+/* We need to describe more than 2 pages, build PRP List */
+u32 prpl_len = 0;
+u64 *prpl = (void*)ns->dma_buffer;



At 15*8=120 bytes of data, do we even need to put the prpl into 
dma_buffer or can we just use the stack? Will the array be guaranteed 
64bit aligned or would we need to add an attribute to it?


    u64 prpl[NVME_MAX_PRPL_ENTRIES];

Either way, I don't have strong feelings one way or another. It just 
seems more natural to keep the bounce buffer purely as bounce buffer :).



Alex



+int first_page = 1;
+for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
+if (first_page) {
+/* First page is special */
+first_page = 0;
+continue;
+}
+if (prpl_len >= NVME_MAX_PRPL_ENTRIES)
+return nvme_bounce_xfer(ns, lba, buf, count, write);
+prpl[prpl_len++] = base;
+}
+return nvme_io_xfer(ns, lba, buf, prpl, count, write);
  } else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
  /* Directly embed the 2nd page if we only need 2 pages */
-prp2 = (void *)(long)ns->prpl[0];
+return nvme_io_xfer(ns, lba, buf, buf + NVME_PAGE_SIZE, count, write);
  } else {
  /* One page is enough, don't expose anythin

[SeaBIOS] Re: [PATCH 4/5] nvme: Pass prp1 and prp2 directly to nvme_io_xfer()

2022-01-21 Thread Alexander Graf via SeaBIOS



On 19.01.22 19:45, Kevin O'Connor wrote:

When using a prp2 parameter, build it in nvme_prpl_xfer() and pass it
directly to nvme_io_xfer().

Signed-off-by: Kevin O'Connor 



Reviewed-by: Alexander Graf 


Alex



---
  src/hw/nvme.c | 39 ++-
  1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index bafe8bf..3a73784 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -417,33 +417,19 @@ err:
  /* Reads count sectors into buf. Returns DISK_RET_*. The buffer cannot cross
 page boundaries. */
  static int
-nvme_io_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
- int write)
+nvme_io_xfer(struct nvme_namespace *ns, u64 lba, void *prp1, void *prp2,
+ u16 count, int write)
  {
-u32 buf_addr = (u32)buf;
-void *prp2;
-
-if (buf_addr & 0x3) {
+if (((u32)prp1 & 0x3) || ((u32)prp2 & 0x3)) {
  /* Buffer is misaligned */
  warn_internalerror();
  return -1;
  }

-if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
-/* We need to describe more than 2 pages, rely on PRP List */
-prp2 = ns->prpl;
-} else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
-/* Directly embed the 2nd page if we only need 2 pages */
-prp2 = (void *)(long)ns->prpl[0];
-} else {
-/* One page is enough, don't expose anything else */
-prp2 = NULL;
-}
-
  struct nvme_sqe *io_read = nvme_get_next_sqe(>ctrl->io_sq,
   write ? NVME_SQE_OPC_IO_WRITE
 : NVME_SQE_OPC_IO_READ,
- NULL, buf, prp2);
+ NULL, prp1, prp2);
  io_read->nsid = ns->ns_id;
  io_read->dword[10] = (u32)lba;
  io_read->dword[11] = (u32)(lba >> 32);
@@ -476,7 +462,7 @@ nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
  if (write)
  memcpy(ns->dma_buffer, buf, blocks * ns->block_size);

-int res = nvme_io_xfer(ns, lba, ns->dma_buffer, blocks, write);
+int res = nvme_io_xfer(ns, lba, ns->dma_buffer, NULL, blocks, write);

  if (!write && res >= 0)
  memcpy(buf, ns->dma_buffer, res * ns->block_size);
@@ -516,7 +502,7 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
  size = count * ns->block_size;
  /* Special case for transfers that fit into PRP1, but are unaligned */
  if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
-return nvme_io_xfer(ns, lba, buf, count, write);
+return nvme_io_xfer(ns, lba, buf, NULL, count, write);

  /* Every request has to be page aligned */
  if (base & ~NVME_PAGE_MASK)
@@ -536,7 +522,18 @@ nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
  return nvme_bounce_xfer(ns, lba, buf, count, write);
  }

-return nvme_io_xfer(ns, lba, buf, count, write);
+void *prp2;
+if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
+/* We need to describe more than 2 pages, rely on PRP List */
+prp2 = ns->prpl;
+} else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
+/* Directly embed the 2nd page if we only need 2 pages */
+prp2 = (void *)(long)ns->prpl[0];
+} else {
+/* One page is enough, don't expose anything else */
+prp2 = NULL;
+}
+return nvme_io_xfer(ns, lba, buf, prp2, count, write);
  }

  static int
--
2.31.1

___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH 3/5] nvme: Convert nvme_build_prpl() to nvme_prpl_xfer()

2022-01-21 Thread Alexander Graf via SeaBIOS


On 19.01.22 19:45, Kevin O'Connor wrote:

Rename nvme_build_prpl() to nvme_prpl_xfer() and directly invoke
nvme_io_xfer() or nvme_bounce_xfer() from that function.

Signed-off-by: Kevin O'Connor 
---
  src/hw/nvme-int.h |  1 -
  src/hw/nvme.c | 42 --
  2 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index a4c1555..9564c17 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -125,7 +125,6 @@ struct nvme_namespace {

  /* Page List */
  u32 prpl_len;
-void *prp1;
  u64 prpl[NVME_MAX_PRPL_ENTRIES];
  };

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index fd7c1d0..bafe8bf 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -499,10 +499,13 @@ static int nvme_add_prpl(struct nvme_namespace *ns, u64 
base)
  return 0;
  }

-static int nvme_build_prpl(struct nvme_namespace *ns, void *op_buf, u16 count)
+// Transfer data using page list (if applicable)
+static int
+nvme_prpl_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
+   int write)
  {
  int first_page = 1;
-u32 base = (long)op_buf;
+u32 base = (long)buf;
  s32 size;

  if (count > ns->max_req_size)
@@ -512,31 +515,28 @@ static int nvme_build_prpl(struct nvme_namespace *ns, 
void *op_buf, u16 count)

  size = count * ns->block_size;
  /* Special case for transfers that fit into PRP1, but are unaligned */
-if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) {
-ns->prp1 = op_buf;
-return count;
-}
+if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE))
+return nvme_io_xfer(ns, lba, buf, count, write);

  /* Every request has to be page aligned */
  if (base & ~NVME_PAGE_MASK)
-return 0;
+return nvme_bounce_xfer(ns, lba, buf, count, write);

  /* Make sure a full block fits into the last chunk */
  if (size & (ns->block_size - 1ULL))
-return 0;
+return nvme_bounce_xfer(ns, lba, buf, count, write);

  for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
  if (first_page) {
  /* First page is special */
-ns->prp1 = (void*)base;
  first_page = 0;
  continue;
  }
  if (nvme_add_prpl(ns, base))
-return 0;
+return nvme_bounce_xfer(ns, lba, buf, count, write);



I think this is correct, but reasoning about all of the bounce 
invocations is truly making my head hurt :). How about we split the 
"Does this fit into a PRP request" logic from the "do request" part? Can 
we just at the end of the function have something like this?


    return nvme_io_xfer(ns, lba, buf, count, write);
bounce:
    return nvme_bounce_xfer(ns, lba, buf, count, write);

and then goto bounce every time we realize we have to turn the request 
into an up-to-one-page bounce request?



Alex



  }

-return count;
+return nvme_io_xfer(ns, lba, buf, count, write);
  }

  static int
@@ -737,24 +737,14 @@ nvme_scan(void)
  static int
  nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
  {
-u16 i, blocks;
-
+int i;
  for (i = 0; i < op->count;) {
  u16 blocks_remaining = op->count - i;
  char *op_buf = op->buf_fl + i * ns->block_size;
-
-blocks = nvme_build_prpl(ns, op_buf, blocks_remaining);
-if (blocks) {
-int res = nvme_io_xfer(ns, op->lba + i, ns->prp1, blocks, write);
-if (res < 0)
-return DISK_RET_EBADTRACK;
-} else {
-int res = nvme_bounce_xfer(ns, op->lba + i, op_buf, blocks, write);
-if (res < 0)
-return DISK_RET_EBADTRACK;
-blocks = res;
-}
-
+int blocks = nvme_prpl_xfer(ns, op->lba + i, op_buf,
+blocks_remaining, write);
+if (blocks < 0)
+return DISK_RET_EBADTRACK;
  i += blocks;
  }

--
2.31.1

___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH 1/5] nvme: Rework nvme_io_readwrite() to return -1 on error

2022-01-21 Thread Alexander Graf via SeaBIOS



On 19.01.22 19:45, Kevin O'Connor wrote:

Rename nvme_io_readwrite() to nvme_io_xfer() and change it so it
implements the debugging dprintf() and it returns -1 on an error.

Signed-off-by: Kevin O'Connor 
---
  src/hw/nvme.c | 34 +-
  1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index f035fa2..a97501b 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -417,8 +417,8 @@ err:
  /* Reads count sectors into buf. Returns DISK_RET_*. The buffer cannot cross



This comment no longer is accurate after the patch. I guess it was half 
stale before already, but now it's completely wrong :).



Alex



 page boundaries. */
  static int
-nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char *buf, u16 count,
-  int write)
+nvme_io_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
+ int write)
  {
  u32 buf_addr = (u32)buf;
  void *prp2;
@@ -426,7 +426,7 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char 
*buf, u16 count,
  if (buf_addr & 0x3) {
  /* Buffer is misaligned */
  warn_internalerror();
-return DISK_RET_EBADTRACK;
+return -1;
  }

  if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
@@ -457,10 +457,12 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, 
char *buf, u16 count,
  dprintf(2, "read io: %08x %08x %08x %08x\n",
  cqe.dword[0], cqe.dword[1], cqe.dword[2], cqe.dword[3]);

-return DISK_RET_EBADTRACK;
+return -1;
  }

-return DISK_RET_SUCCESS;
+dprintf(5, "ns %u %s lba %llu+%u\n", ns->ns_id, write ? "write" : "read",
+lba, count);
+return count;
  }

  static void nvme_reset_prpl(struct nvme_namespace *ns)
@@ -716,20 +718,18 @@ nvme_scan(void)
  static int
  nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
  {
-int res = DISK_RET_SUCCESS;
  u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
  u16 i, blocks;

-for (i = 0; i < op->count && res == DISK_RET_SUCCESS;) {
+for (i = 0; i < op->count;) {
  u16 blocks_remaining = op->count - i;
  char *op_buf = op->buf_fl + i * ns->block_size;

  blocks = nvme_build_prpl(ns, op_buf, blocks_remaining);
  if (blocks) {
-res = nvme_io_readwrite(ns, op->lba + i, ns->prp1, blocks, write);
-dprintf(5, "ns %u %s lba %llu+%u: %d\n", ns->ns_id, write ? "write"
-  : "read",
-op->lba, blocks, res);
+int res = nvme_io_xfer(ns, op->lba + i, ns->prp1, blocks, write);
+if (res < 0)
+return DISK_RET_EBADTRACK;
  } else {
  blocks = blocks_remaining < max_blocks ? blocks_remaining
 : max_blocks;
@@ -738,12 +738,12 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct 
disk_op_s *op, int write)
  memcpy(ns->dma_buffer, op_buf, blocks * ns->block_size);
  }

-res = nvme_io_readwrite(ns, op->lba + i, ns->dma_buffer, blocks, 
write);
-dprintf(5, "ns %u %s lba %llu+%u: %d\n", ns->ns_id, write ? "write"
-  : "read",
-op->lba + i, blocks, res);
+int res = nvme_io_xfer(ns, op->lba + i, ns->dma_buffer,
+   blocks, write);
+if (res < 0)
+return DISK_RET_EBADTRACK;

-if (!write && res == DISK_RET_SUCCESS) {
+if (!write) {
  memcpy(op_buf, ns->dma_buffer, blocks * ns->block_size);
  }
  }
@@ -751,7 +751,7 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct 
disk_op_s *op, int write)
  i += blocks;
  }

-return res;
+return DISK_RET_SUCCESS;
  }

  int
--
2.31.1

___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH 1/5] nvme: Rework nvme_io_readwrite() to return -1 on error

2022-01-21 Thread Alexander Graf via SeaBIOS



On 19.01.22 19:45, Kevin O'Connor wrote:

Rename nvme_io_readwrite() to nvme_io_xfer() and change it so it
implements the debugging dprintf() and it returns -1 on an error.

Signed-off-by: Kevin O'Connor 



Reviewed-by: Alexander Graf 

Alex



---
  src/hw/nvme.c | 34 +-
  1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index f035fa2..a97501b 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -417,8 +417,8 @@ err:
  /* Reads count sectors into buf. Returns DISK_RET_*. The buffer cannot cross
 page boundaries. */
  static int
-nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char *buf, u16 count,
-  int write)
+nvme_io_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
+ int write)
  {
  u32 buf_addr = (u32)buf;
  void *prp2;
@@ -426,7 +426,7 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char 
*buf, u16 count,
  if (buf_addr & 0x3) {
  /* Buffer is misaligned */
  warn_internalerror();
-return DISK_RET_EBADTRACK;
+return -1;
  }

  if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
@@ -457,10 +457,12 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, 
char *buf, u16 count,
  dprintf(2, "read io: %08x %08x %08x %08x\n",
  cqe.dword[0], cqe.dword[1], cqe.dword[2], cqe.dword[3]);

-return DISK_RET_EBADTRACK;
+return -1;
  }

-return DISK_RET_SUCCESS;
+dprintf(5, "ns %u %s lba %llu+%u\n", ns->ns_id, write ? "write" : "read",
+lba, count);
+return count;
  }

  static void nvme_reset_prpl(struct nvme_namespace *ns)
@@ -716,20 +718,18 @@ nvme_scan(void)
  static int
  nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
  {
-int res = DISK_RET_SUCCESS;
  u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
  u16 i, blocks;

-for (i = 0; i < op->count && res == DISK_RET_SUCCESS;) {
+for (i = 0; i < op->count;) {
  u16 blocks_remaining = op->count - i;
  char *op_buf = op->buf_fl + i * ns->block_size;

  blocks = nvme_build_prpl(ns, op_buf, blocks_remaining);
  if (blocks) {
-res = nvme_io_readwrite(ns, op->lba + i, ns->prp1, blocks, write);
-dprintf(5, "ns %u %s lba %llu+%u: %d\n", ns->ns_id, write ? "write"
-  : "read",
-op->lba, blocks, res);
+int res = nvme_io_xfer(ns, op->lba + i, ns->prp1, blocks, write);
+if (res < 0)
+return DISK_RET_EBADTRACK;
  } else {
  blocks = blocks_remaining < max_blocks ? blocks_remaining
 : max_blocks;
@@ -738,12 +738,12 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct 
disk_op_s *op, int write)
  memcpy(ns->dma_buffer, op_buf, blocks * ns->block_size);
  }

-res = nvme_io_readwrite(ns, op->lba + i, ns->dma_buffer, blocks, 
write);
-dprintf(5, "ns %u %s lba %llu+%u: %d\n", ns->ns_id, write ? "write"
-  : "read",
-op->lba + i, blocks, res);
+int res = nvme_io_xfer(ns, op->lba + i, ns->dma_buffer,
+   blocks, write);
+if (res < 0)
+return DISK_RET_EBADTRACK;

-if (!write && res == DISK_RET_SUCCESS) {
+if (!write) {
  memcpy(op_buf, ns->dma_buffer, blocks * ns->block_size);
  }
  }
@@ -751,7 +751,7 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct 
disk_op_s *op, int write)
  i += blocks;
  }

-return res;
+return DISK_RET_SUCCESS;
  }

  int
--
2.31.1

___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH 2/5] nvme: Add nvme_bounce_xfer() helper function

2022-01-21 Thread Alexander Graf via SeaBIOS



On 19.01.22 19:45, Kevin O'Connor wrote:

Move bounce buffer processing to a new helper function.

Signed-off-by: Kevin O'Connor 



Reviewed-by: Alexander Graf 

Alex



---
  src/hw/nvme.c | 35 +--
  1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index a97501b..fd7c1d0 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -465,6 +465,25 @@ nvme_io_xfer(struct nvme_namespace *ns, u64 lba, void 
*buf, u16 count,
  return count;
  }

+// Transfer up to one page of data using the internal dma bounce buffer
+static int
+nvme_bounce_xfer(struct nvme_namespace *ns, u64 lba, void *buf, u16 count,
+ int write)
+{
+u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
+u16 blocks = count < max_blocks ? count : max_blocks;
+
+if (write)
+memcpy(ns->dma_buffer, buf, blocks * ns->block_size);
+
+int res = nvme_io_xfer(ns, lba, ns->dma_buffer, blocks, write);
+
+if (!write && res >= 0)
+memcpy(buf, ns->dma_buffer, res * ns->block_size);
+
+return res;
+}
+
  static void nvme_reset_prpl(struct nvme_namespace *ns)
  {
  ns->prpl_len = 0;
@@ -718,7 +737,6 @@ nvme_scan(void)
  static int
  nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write)
  {
-u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
  u16 i, blocks;

  for (i = 0; i < op->count;) {
@@ -731,21 +749,10 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct 
disk_op_s *op, int write)
  if (res < 0)
  return DISK_RET_EBADTRACK;
  } else {
-blocks = blocks_remaining < max_blocks ? blocks_remaining
-   : max_blocks;
-
-if (write) {
-memcpy(ns->dma_buffer, op_buf, blocks * ns->block_size);
-}
-
-int res = nvme_io_xfer(ns, op->lba + i, ns->dma_buffer,
-   blocks, write);
+int res = nvme_bounce_xfer(ns, op->lba + i, op_buf, blocks, write);
  if (res < 0)
  return DISK_RET_EBADTRACK;
-
-if (!write) {
-memcpy(op_buf, ns->dma_buffer, blocks * ns->block_size);
-}
+blocks = res;
  }

  i += blocks;
--
2.31.1

___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

2022-01-19 Thread Alexander Graf via SeaBIOS

Hi Matt,

On 18.01.22 18:46, Matt DeVillier wrote:

and attempting to boot from it:

Booting from Hard Disk...
sq 0x7aa77bdf next_sqe 0
sq 0x7aa77bdf commit_sqe 0
cq 0x7aa77bf1 head 0 -> 1
sq 0x7aa77bdf advanced to 1
ns 1 read lba 0+1: 0
Booting from :7c00
sq 0x7aa77bdf next_sqe 1
sq 0x7aa77bdf commit_sqe 1
cq 0x7aa77bf1 head 1 -> 2
sq 0x7aa77bdf advanced to 2
ns 1 read lba 1+1: 0
sq 0x7aa77bdf next_sqe 2
sq 0x7aa77bdf commit_sqe 2
cq 0x7aa77bf1 head 2 -> 3
sq 0x7aa77bdf advanced to 3
read io:   00010003 40270002



This error means

"PRP Offset Invalid: The Offset field for a PRP entry is invalid. This 
may occur when there is a PRP entry with a non-zero offset after the 
first entry or when the Offset field in any PRP entry is not dword 
aligned (i.e., bits 1:0 are not cleared to 00b)."



which points towards a misaligned PRP entry. The block size of this 
request is 105 sectors (52.5kb), so it must be a PRPL. I don't see any 
obvious code paths that would allow us to ever get into a misaligned 
request here.


That said, while trying to understand what is happening here I did 
stumble over a different weird effect: ns->prp1 gets overwritten with 0 
by some interrupt handler. Could you please try the patch / hack below 
and see if it fixes the problem for you? That way I at least know we're 
hunting the same ghost.


If it doesn't help, I'll send you a debug patch that will give us some 
more information about the PRP list SeaBIOS assembles.



Thanks,

Alex

---


diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index f035fa2..99ad7a8 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -257,7 +257,7 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, u32 ns_idx, u8 
mdts)

 goto free_buffer;
 }

-    struct nvme_namespace *ns = malloc_fseg(sizeof(*ns));
+    struct nvme_namespace *ns = malloc_low(sizeof(*ns));
 if (!ns) {
 warn_noalloc();
 goto free_buffer;



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH] nvme: Move PRP data to ZoneHigh

2022-01-19 Thread Alexander Graf via SeaBIOS
Commit 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
introduced multi-page requests using the NVMe PRP mechanism. To store the
list and "first page to write to" hints, it added fields to the NVMe
namespace struct.

Unfortunately, that struct resides in fseg which is read-only at runtime.
While KVM ignores the read-only part and allows writes, real hardware and
TCG adhere to the semantics and ignore writes to the fseg region. The net
effect of that is that reads and writes were always happening on address 0,
unless they went through the bounce buffer logic.

This patch splits moves all PRP maintenance data from the actual namespace
allocation and allocates them in ZoneHigh instead. That way, the PRP list
is fully read-write at runtime.

Fixes: 01f2736cc905d ("nvme: Pass large I/O requests as PRP lists")
Reported-by: Matt DeVillier 
Signed-off-by: Alexander Graf 
---
 src/hw/nvme-int.h | 12 
 src/hw/nvme.c | 24 
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index a4c1555..ceb2c79 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -108,6 +108,13 @@ struct nvme_ctrl {
 struct nvme_cq io_cq;
 };
 
+/* Page List Maintenance Data */
+struct nvme_prp_info {
+u32 prpl_len;
+void *prp1;
+u64 prpl[NVME_MAX_PRPL_ENTRIES];
+};
+
 struct nvme_namespace {
 struct drive_s drive;
 struct nvme_ctrl *ctrl;
@@ -123,10 +130,7 @@ struct nvme_namespace {
 /* Page aligned buffer of size NVME_PAGE_SIZE. */
 char *dma_buffer;
 
-/* Page List */
-u32 prpl_len;
-void *prp1;
-u64 prpl[NVME_MAX_PRPL_ENTRIES];
+struct nvme_prp_info *prp;
 };
 
 /* Data structures for NVMe admin identify commands */
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index f035fa2..963c31e 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -267,6 +267,14 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, u32 ns_idx, u8 mdts)
 ns->ns_id = ns_id;
 ns->lba_count = id->nsze;
 
+ns->prp = malloc_high(sizeof(*ns->prp));
+if (!ns->prp) {
+warn_noalloc();
+free(ns);
+goto free_buffer;
+}
+memset(ns->prp, 0, sizeof(*ns->prp));
+
 struct nvme_lba_format *fmt = >lbaf[current_lba_format];
 
 ns->block_size= 1U << fmt->lbads;
@@ -431,10 +439,10 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, 
char *buf, u16 count,
 
 if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
 /* We need to describe more than 2 pages, rely on PRP List */
-prp2 = ns->prpl;
+prp2 = ns->prp->prpl;
 } else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
 /* Directly embed the 2nd page if we only need 2 pages */
-prp2 = (void *)(long)ns->prpl[0];
+prp2 = (void *)(long)ns->prp->prpl[0];
 } else {
 /* One page is enough, don't expose anything else */
 prp2 = NULL;
@@ -465,15 +473,15 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, 
char *buf, u16 count,
 
 static void nvme_reset_prpl(struct nvme_namespace *ns)
 {
-ns->prpl_len = 0;
+ns->prp->prpl_len = 0;
 }
 
 static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
 {
-if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
+if (ns->prp->prpl_len >= NVME_MAX_PRPL_ENTRIES)
 return -1;
 
-ns->prpl[ns->prpl_len++] = base;
+ns->prp->prpl[ns->prp->prpl_len++] = base;
 
 return 0;
 }
@@ -492,7 +500,7 @@ static int nvme_build_prpl(struct nvme_namespace *ns, void 
*op_buf, u16 count)
 size = count * ns->block_size;
 /* Special case for transfers that fit into PRP1, but are unaligned */
 if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) {
-ns->prp1 = op_buf;
+ns->prp->prp1 = op_buf;
 return count;
 }
 
@@ -507,7 +515,7 @@ static int nvme_build_prpl(struct nvme_namespace *ns, void 
*op_buf, u16 count)
 for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) {
 if (first_page) {
 /* First page is special */
-ns->prp1 = (void*)base;
+ns->prp->prp1 = (void*)base;
 first_page = 0;
 continue;
 }
@@ -726,7 +734,7 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct 
disk_op_s *op, int write)
 
 blocks = nvme_build_prpl(ns, op_buf, blocks_remaining);
 if (blocks) {
-res = nvme_io_readwrite(ns, op->lba + i, ns->prp1, blocks, write);
+res = nvme_io_readwrite(ns, op->lba + i, ns->prp->prp1, blocks, 
write);
 dprintf(5, "ns %u %s lba %llu+%u: %d\n", ns->ns_id, write ? "write"
   : "read",
 op->lba, blocks, res);
-- 
2.28.0.394.ge197136389

[SeaBIOS] Re: [PATCH 1/3] nvme: Record maximum allowed request size

2020-10-02 Thread Alexander Graf




On 30.09.20 12:22, Sironi, Filippo wrote:


From: Alexander Graf 
Sent: Tuesday, September 29, 2020 20:36
To: seabios@seabios.org
Subject: [UNVERIFIED SENDER] [SeaBIOS] [PATCH 1/3] nvme: Record maximum allowed 
request size

NVMe has a limit on how many sectors it can handle at most within a single
request. Remember that number, so that in a follow-up patch, we can verify
that we don't exceed it.

Signed-off-by: Alexander Graf 
---
  src/hw/nvme-int.h |  8 +++-
  src/hw/nvme.c | 13 +++--
  2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 9f95dd8..674008a 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -117,6 +117,7 @@ struct nvme_namespace {

  u32 block_size;
  u32 metadata_size;
+u32 max_req_size;

  /* Page aligned buffer of size NVME_PAGE_SIZE. */
  char *dma_buffer;
@@ -131,7 +132,12 @@ struct nvme_identify_ctrl {
  char mn[40];
  char fr[8];

-char _boring[516 - 72];
+u8 rab;
+u8 ieee[3];
+u8 cmic;
+u8 mdts;
+
+char _boring[516 - 78];

  u32 nn; /* number of namespaces */
  };
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 6a01204..2cde6a7 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -238,7 +238,8 @@ nvme_admin_identify_ns(struct nvme_ctrl *ctrl, u32 ns_id)
  }

  static void
-nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id)
+nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id,
+  u8 mdts)
  {
  ns->ctrl  = ctrl;
  ns->ns_id = ns_id;
@@ -281,6 +282,14 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, struct 
nvme_namespace *ns, u32 ns_id)
  ns->drive.blksize   = ns->block_size;
  ns->drive.sectors   = ns->lba_count;

+if (mdts) {
+ns->max_req_size = 1U << mdts;
+dprintf(3, "NVME NS %u max request size: %d sectors\n",
+ns->max_req_size);

The use of dprintf is incorrect, you're only providing one variable to print 
instead of two. You're missing the number of sectors.


Yikes, I fixed that one locally before I hit git format-patch. I 
promise! My machine probably just went back in time and pulled this 
broken state out of the abyss :).


Kidding aside, I'll submit the right one for v2. Thanks for noticing!


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH v3 1/4] nvme: Record maximum allowed request size

2020-09-30 Thread Alexander Graf
NVMe has a limit on how many sectors it can handle at most within a single
request. Remember that number, so that in a follow-up patch, we can verify
that we don't exceed it.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - fix dprintf

v2 -> v3:

  - Adjust the maximum request size to sector units. The hardware field
describes it in pages.
---
 src/hw/nvme-int.h |  8 +++-
 src/hw/nvme.c | 13 +++--
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 9f95dd8..674008a 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -117,6 +117,7 @@ struct nvme_namespace {
 
 u32 block_size;
 u32 metadata_size;
+u32 max_req_size;
 
 /* Page aligned buffer of size NVME_PAGE_SIZE. */
 char *dma_buffer;
@@ -131,7 +132,12 @@ struct nvme_identify_ctrl {
 char mn[40];
 char fr[8];
 
-char _boring[516 - 72];
+u8 rab;
+u8 ieee[3];
+u8 cmic;
+u8 mdts;
+
+char _boring[516 - 78];
 
 u32 nn; /* number of namespaces */
 };
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 6a01204..5bc2586 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -238,7 +238,8 @@ nvme_admin_identify_ns(struct nvme_ctrl *ctrl, u32 ns_id)
 }
 
 static void
-nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id)
+nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id,
+  u8 mdts)
 {
 ns->ctrl  = ctrl;
 ns->ns_id = ns_id;
@@ -281,6 +282,14 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, struct 
nvme_namespace *ns, u32 ns_id)
 ns->drive.blksize   = ns->block_size;
 ns->drive.sectors   = ns->lba_count;
 
+if (mdts) {
+ns->max_req_size = ((1U << mdts) * NVME_PAGE_SIZE) / ns->block_size;
+dprintf(3, "NVME NS %u max request size: %d sectors\n",
+ns_id, ns->max_req_size);
+} else {
+ns->max_req_size = -1U;
+}
+
 ns->dma_buffer = zalloc_page_aligned(, NVME_PAGE_SIZE);
 
 char *desc = znprintf(MAXDESCSIZE, "NVMe NS %u: %llu MiB (%llu %u-byte "
@@ -567,7 +576,7 @@ nvme_controller_enable(struct nvme_ctrl *ctrl)
 /* Populate namespace IDs */
 int ns_idx;
 for (ns_idx = 0; ns_idx < ctrl->ns_count; ns_idx++) {
-nvme_probe_ns(ctrl, >ns[ns_idx], ns_idx + 1);
+nvme_probe_ns(ctrl, >ns[ns_idx], ns_idx + 1, identify->mdts);
 }
 
 dprintf(3, "NVMe initialization complete!\n");
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH v3 0/4] nvme: Add PRP List support

2020-09-30 Thread Alexander Graf
While looking at VM bootup times, we stumbled over the fact that the NVMe
code only does I/O operations of up to 4kb at a given point in time. This
is usually ok, but if you have an OS that loads a lot of data on boot in
combination to network backed storage, it shows in bootup times.

There is no need to restrict ourselves to 4kb though. The INT13 call we
receive gives us much larger chunks which we can just map into a native
bigger NVMe I/O call if the request buffer is page aligned.

This patch implements all logic required to do the above and gives a
substantial performance boost on boot.

v1 -> v2:

  - fix dprintf
  - Fix bounds check on PRP list add logic
  - Reduce PRP list to 15 entries embedded in the ns struct.
This reduces BIOS reserved memory footprint by 4kb.

v2 -> v3:

  - new patch: nvme: Split requests by maximum allowed size
  - Adjust the maximum request size to sector units. The hardware field
describes it in pages.

Alexander Graf (4):
  nvme: Record maximum allowed request size
  nvme: Allow to set PRP2
  nvme: Pass large I/O requests as PRP lists
  nvme: Split requests by maximum allowed size

 src/hw/nvme-int.h |  16 ++-
 src/hw/nvme.c | 122 +++---
 2 files changed, 121 insertions(+), 17 deletions(-)

-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH v3 4/4] nvme: Split requests by maximum allowed size

2020-09-30 Thread Alexander Graf
Some NVMe controllers only support small maximum request sizes, such as
the AWS EBS NVMe implementation which only supports NVMe requests of up
to 32 pages (256kb) at once.

BIOS callers can exceed those request sizes by defining sector counts
above this threshold. Currently we fall back to the bounce buffer
implementation for those. This is slow.

This patch introduces splitting logic to the NVMe I/O request code so
that every NVMe I/O request gets handled in a chunk size that is
consumable by the NVMe adapter, while maintaining the fast path PRPL
logic we just introduced.

Signed-off-by: Alexander Graf 
---
 src/hw/nvme.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index b92ca52..cc37bca 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct 
disk_op_s *op, int write)
 u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
 u16 i;
 
+/* Split up requests that are larger than the device can handle */
+if (op->count > ns->max_req_size) {
+u16 count = op->count;
+
+/* Handle the first max_req_size elements */
+op->count = ns->max_req_size;
+if (nvme_cmd_readwrite(ns, op, write))
+return res;
+
+/* Handle the remainder of the request */
+op->count = count - ns->max_req_size;
+op->lba += ns->max_req_size;
+op->buf_fl += (ns->max_req_size * ns->block_size);
+return nvme_cmd_readwrite(ns, op, write);
+}
+
 if (!nvme_build_prpl(ns, op)) {
 /* Request goes via PRP List logic */
 return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write);
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH v3 2/4] nvme: Allow to set PRP2

2020-09-30 Thread Alexander Graf
When creating a PRP based I/O request, we pass in the pointer to operate
on. Going forward, we will want to be able to pass additional pointers
though for mappings above 4k.

This patch adds a parameter to nvme_get_next_sqe() to pass in the PRP2
value of an NVMe I/O request, paving the way for a future patch to
implement PRP lists.

Signed-off-by: Alexander Graf 
Reviewed-by: Filippo Sironi 
---
 src/hw/nvme.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 5bc2586..f1b9331 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -152,7 +152,7 @@ nvme_wait(struct nvme_sq *sq)
 /* Returns the next submission queue entry (or NULL if the queue is full). It
also fills out Command Dword 0 and clears the rest. */
 static struct nvme_sqe *
-nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data)
+nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data, void 
*data2)
 {
 if (((sq->head + 1) & sq->common.mask) == sq->tail) {
 dprintf(3, "submission queue is full");
@@ -166,6 +166,7 @@ nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void 
*metadata, void *data)
 sqe->cdw0 = opc | (sq->tail << 16 /* CID */);
 sqe->mptr = (u32)metadata;
 sqe->dptr_prp1 = (u32)data;
+sqe->dptr_prp2 = (u32)data2;
 
 if (sqe->dptr_prp1 & (NVME_PAGE_SIZE - 1)) {
 /* Data buffer not page aligned. */
@@ -200,7 +201,7 @@ nvme_admin_identify(struct nvme_ctrl *ctrl, u8 cns, u32 
nsid)
 struct nvme_sqe *cmd_identify;
 cmd_identify = nvme_get_next_sqe(>admin_sq,
  NVME_SQE_OPC_ADMIN_IDENTIFY, NULL,
- identify_buf);
+ identify_buf, NULL);
 
 if (!cmd_identify) {
 warn_internalerror();
@@ -338,7 +339,7 @@ nvme_create_io_cq(struct nvme_ctrl *ctrl, struct nvme_cq 
*cq, u16 q_idx)
 
 cmd_create_cq = nvme_get_next_sqe(>admin_sq,
   NVME_SQE_OPC_ADMIN_CREATE_IO_CQ, NULL,
-  cq->cqe);
+  cq->cqe, NULL);
 if (!cmd_create_cq) {
 goto err_destroy_cq;
 }
@@ -382,7 +383,7 @@ nvme_create_io_sq(struct nvme_ctrl *ctrl, struct nvme_sq 
*sq, u16 q_idx, struct
 
 cmd_create_sq = nvme_get_next_sqe(>admin_sq,
   NVME_SQE_OPC_ADMIN_CREATE_IO_SQ, NULL,
-  sq->sqe);
+  sq->sqe, NULL);
 if (!cmd_create_sq) {
 goto err_destroy_sq;
 }
@@ -429,7 +430,7 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char 
*buf, u16 count,
 struct nvme_sqe *io_read = nvme_get_next_sqe(>ctrl->io_sq,
  write ? NVME_SQE_OPC_IO_WRITE
: NVME_SQE_OPC_IO_READ,
- NULL, buf);
+ NULL, buf, NULL);
 io_read->nsid = ns->ns_id;
 io_read->dword[10] = (u32)lba;
 io_read->dword[11] = (u32)(lba >> 32);
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH v3 3/4] nvme: Pass large I/O requests as PRP lists

2020-09-30 Thread Alexander Graf
Today, we split every I/O request into at most 4kb chunks and wait for these
requests to finish. We encountered issues where the backing storage is network
based, so every I/O request needs to go over the network with associated
latency cost. A few ms of latency when loading 100MB initrd in 4kb chunks
does add up.

NVMe implements a feature to allow I/O requests spanning multiple pages,
called PRP lists. This patch takes larger I/O operations and checks if
they can be directly passed to the NVMe backing device as PRP list.
At least for grub, read operations can always be mapped directly into
PRP list items.

This reduces the number of I/O operations required during a typical boot
path by roughly a factor of 5.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - Fix bounds check on PRP list add logic
  - Reduce PRP list to 15 entries embedded in the ns struct.
This reduces BIOS reserved memory footprint by 4kb.
---
 src/hw/nvme-int.h |  8 ++
 src/hw/nvme.c | 84 ---
 2 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 674008a..7947b29 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -10,6 +10,8 @@
 #include "types.h" // u32
 #include "pcidevice.h" // struct pci_device
 
+#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
+
 /* Data structures */
 
 /* The register file of a NVMe host controller. This struct follows the naming
@@ -121,6 +123,11 @@ struct nvme_namespace {
 
 /* Page aligned buffer of size NVME_PAGE_SIZE. */
 char *dma_buffer;
+
+/* Page List */
+u32 prpl_len;
+void *prp1;
+u64 prpl[NVME_MAX_PRPL_ENTRIES];
 };
 
 /* Data structures for NVMe admin identify commands */
@@ -195,6 +202,7 @@ union nvme_identify {
 #define NVME_CQE_DW3_P (1U << 16)
 
 #define NVME_PAGE_SIZE 4096
+#define NVME_PAGE_MASK ~(NVME_PAGE_SIZE - 1)
 
 /* Length for the queue entries. */
 #define NVME_SQE_SIZE_LOG 6
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index f1b9331..b92ca52 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -168,11 +168,6 @@ nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void 
*metadata, void *data, void *
 sqe->dptr_prp1 = (u32)data;
 sqe->dptr_prp2 = (u32)data2;
 
-if (sqe->dptr_prp1 & (NVME_PAGE_SIZE - 1)) {
-/* Data buffer not page aligned. */
-warn_internalerror();
-}
-
 return sqe;
 }
 
@@ -418,19 +413,29 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, 
char *buf, u16 count,
   int write)
 {
 u32 buf_addr = (u32)buf;
+void *prp2;
 
-if ((buf_addr & 0x3) ||
-((buf_addr & ~(NVME_PAGE_SIZE - 1)) !=
- ((buf_addr + ns->block_size * count - 1) & ~(NVME_PAGE_SIZE - 1 {
-/* Buffer is misaligned or crosses page boundary */
+if (buf_addr & 0x3) {
+/* Buffer is misaligned */
 warn_internalerror();
 return DISK_RET_EBADTRACK;
 }
 
+if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
+/* We need to describe more than 2 pages, rely on PRP List */
+prp2 = ns->prpl;
+} else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
+/* Directly embed the 2nd page if we only need 2 pages */
+prp2 = (void *)(long)ns->prpl[0];
+} else {
+/* One page is enough, don't expose anything else */
+prp2 = NULL;
+}
+
 struct nvme_sqe *io_read = nvme_get_next_sqe(>ctrl->io_sq,
  write ? NVME_SQE_OPC_IO_WRITE
: NVME_SQE_OPC_IO_READ,
- NULL, buf, NULL);
+ NULL, buf, prp2);
 io_read->nsid = ns->ns_id;
 io_read->dword[10] = (u32)lba;
 io_read->dword[11] = (u32)(lba >> 32);
@@ -450,6 +455,60 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char 
*buf, u16 count,
 return DISK_RET_SUCCESS;
 }
 
+static void nvme_reset_prpl(struct nvme_namespace *ns)
+{
+ns->prpl_len = 0;
+}
+
+static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
+{
+if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
+return -1;
+
+ns->prpl[ns->prpl_len++] = base;
+
+return 0;
+}
+
+int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op)
+{
+int first_page = 1;
+u32 base = (long)op->buf_fl;
+s32 size = op->count * ns->block_size;
+
+if (op->count > ns->max_req_size)
+return -1;
+
+nvme_reset_prpl(ns);
+
+/* Special case for transfers that fit into PRP1, but are unaligned */
+if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) {
+ns->prp1 = op->buf_fl;
+return 0;
+}
+
+/* Every request has to be page aligned */
+if (base & ~NVM

[SeaBIOS] [PATCH] nvme: Split requests by maximum allowed size

2020-09-30 Thread Alexander Graf
Some NVMe controllers only support small maximum request sizes, such as
the AWS EBS NVMe implementation which only supports NVMe requests of up
to 64 sectors at once.

BIOS callers can easily exceed those request sizes by defining sector
counts above this threshold. Currently we fall back to the bounce buffer
implementation for those. This is slow.

This patch introduces splitting logic to the NVMe I/O request code so
that every NVMe I/O request gets handled in a chunk size that is
consumable by the NVMe adapter, while maintaining the fast path PRPL
logic we just introduced.

Signed-off-by: Alexander Graf 

---

This patch is a follow-up to the NVMe PRP list patch set
---
 src/hw/nvme.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index ec05836..4eb9a02 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct 
disk_op_s *op, int write)
 u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size;
 u16 i;
 
+/* Split up requests that are larger than the device can handle */
+if (op->count > ns->max_req_size) {
+u16 count = op->count;
+
+/* Handle the first max_req_size elements */
+op->count = ns->max_req_size;
+if (nvme_cmd_readwrite(ns, op, write))
+return res;
+
+/* Handle the remainder of the request */
+op->count = count - ns->max_req_size;
+op->lba += ns->max_req_size;
+op->buf_fl += (ns->max_req_size * ns->block_size);
+return nvme_cmd_readwrite(ns, op, write);
+}
+
 if (!nvme_build_prpl(ns, op)) {
 /* Request goes via PRP List logic */
 return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write);
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH v2 3/3] nvme: Pass large I/O requests as PRP lists

2020-09-30 Thread Alexander Graf
Today, we split every I/O request into at most 4kb chunks and wait for these
requests to finish. We encountered issues where the backing storage is network
based, so every I/O request needs to go over the network with associated
latency cost. A few ms of latency when loading 100MB initrd in 4kb chunks
does add up.

NVMe implements a feature to allow I/O requests spanning multiple pages,
called PRP lists. This patch takes larger I/O operations and checks if
they can be directly passed to the NVMe backing device as PRP list.
At least for grub, read operations can always be mapped directly into
PRP list items.

This reduces the number of I/O operations required during a typical boot
path by roughly a factor of 5.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - Fix bounds check on PRP list add logic
  - Reduce PRP list to 15 entries embedded in the ns struct.
This reduces BIOS reserved memory footprint by 4kb.
---
 src/hw/nvme-int.h |  8 ++
 src/hw/nvme.c | 84 ---
 2 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 674008a..7947b29 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -10,6 +10,8 @@
 #include "types.h" // u32
 #include "pcidevice.h" // struct pci_device
 
+#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */
+
 /* Data structures */
 
 /* The register file of a NVMe host controller. This struct follows the naming
@@ -121,6 +123,11 @@ struct nvme_namespace {
 
 /* Page aligned buffer of size NVME_PAGE_SIZE. */
 char *dma_buffer;
+
+/* Page List */
+u32 prpl_len;
+void *prp1;
+u64 prpl[NVME_MAX_PRPL_ENTRIES];
 };
 
 /* Data structures for NVMe admin identify commands */
@@ -195,6 +202,7 @@ union nvme_identify {
 #define NVME_CQE_DW3_P (1U << 16)
 
 #define NVME_PAGE_SIZE 4096
+#define NVME_PAGE_MASK ~(NVME_PAGE_SIZE - 1)
 
 /* Length for the queue entries. */
 #define NVME_SQE_SIZE_LOG 6
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 04bbba4..ec05836 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -168,11 +168,6 @@ nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void 
*metadata, void *data, void *
 sqe->dptr_prp1 = (u32)data;
 sqe->dptr_prp2 = (u32)data2;
 
-if (sqe->dptr_prp1 & (NVME_PAGE_SIZE - 1)) {
-/* Data buffer not page aligned. */
-warn_internalerror();
-}
-
 return sqe;
 }
 
@@ -418,19 +413,29 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, 
char *buf, u16 count,
   int write)
 {
 u32 buf_addr = (u32)buf;
+void *prp2;
 
-if ((buf_addr & 0x3) ||
-((buf_addr & ~(NVME_PAGE_SIZE - 1)) !=
- ((buf_addr + ns->block_size * count - 1) & ~(NVME_PAGE_SIZE - 1 {
-/* Buffer is misaligned or crosses page boundary */
+if (buf_addr & 0x3) {
+/* Buffer is misaligned */
 warn_internalerror();
 return DISK_RET_EBADTRACK;
 }
 
+if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
+/* We need to describe more than 2 pages, rely on PRP List */
+prp2 = ns->prpl;
+} else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
+/* Directly embed the 2nd page if we only need 2 pages */
+prp2 = (void *)(long)ns->prpl[0];
+} else {
+/* One page is enough, don't expose anything else */
+prp2 = NULL;
+}
+
 struct nvme_sqe *io_read = nvme_get_next_sqe(>ctrl->io_sq,
  write ? NVME_SQE_OPC_IO_WRITE
: NVME_SQE_OPC_IO_READ,
- NULL, buf, NULL);
+ NULL, buf, prp2);
 io_read->nsid = ns->ns_id;
 io_read->dword[10] = (u32)lba;
 io_read->dword[11] = (u32)(lba >> 32);
@@ -450,6 +455,60 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char 
*buf, u16 count,
 return DISK_RET_SUCCESS;
 }
 
+static void nvme_reset_prpl(struct nvme_namespace *ns)
+{
+ns->prpl_len = 0;
+}
+
+static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
+{
+if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES)
+return -1;
+
+ns->prpl[ns->prpl_len++] = base;
+
+return 0;
+}
+
+int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op)
+{
+int first_page = 1;
+u32 base = (long)op->buf_fl;
+s32 size = op->count * ns->block_size;
+
+if (op->count > ns->max_req_size)
+return -1;
+
+nvme_reset_prpl(ns);
+
+/* Special case for transfers that fit into PRP1, but are unaligned */
+if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) {
+ns->prp1 = op->buf_fl;
+return 0;
+}
+
+/* Every request has to be page aligned */
+if (base & ~NVM

[SeaBIOS] [PATCH v2 1/3] nvme: Record maximum allowed request size

2020-09-30 Thread Alexander Graf
NVMe has a limit on how many sectors it can handle at most within a single
request. Remember that number, so that in a follow-up patch, we can verify
that we don't exceed it.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - fix dprintf
---
 src/hw/nvme-int.h |  8 +++-
 src/hw/nvme.c | 13 +++--
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 9f95dd8..674008a 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -117,6 +117,7 @@ struct nvme_namespace {
 
 u32 block_size;
 u32 metadata_size;
+u32 max_req_size;
 
 /* Page aligned buffer of size NVME_PAGE_SIZE. */
 char *dma_buffer;
@@ -131,7 +132,12 @@ struct nvme_identify_ctrl {
 char mn[40];
 char fr[8];
 
-char _boring[516 - 72];
+u8 rab;
+u8 ieee[3];
+u8 cmic;
+u8 mdts;
+
+char _boring[516 - 78];
 
 u32 nn; /* number of namespaces */
 };
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 6a01204..7e56134 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -238,7 +238,8 @@ nvme_admin_identify_ns(struct nvme_ctrl *ctrl, u32 ns_id)
 }
 
 static void
-nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id)
+nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id,
+  u8 mdts)
 {
 ns->ctrl  = ctrl;
 ns->ns_id = ns_id;
@@ -281,6 +282,14 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, struct 
nvme_namespace *ns, u32 ns_id)
 ns->drive.blksize   = ns->block_size;
 ns->drive.sectors   = ns->lba_count;
 
+if (mdts) {
+ns->max_req_size = 1U << mdts;
+dprintf(3, "NVME NS %u max request size: %d sectors\n",
+ns_id, ns->max_req_size);
+} else {
+ns->max_req_size = -1U;
+}
+
 ns->dma_buffer = zalloc_page_aligned(, NVME_PAGE_SIZE);
 
 char *desc = znprintf(MAXDESCSIZE, "NVMe NS %u: %llu MiB (%llu %u-byte "
@@ -567,7 +576,7 @@ nvme_controller_enable(struct nvme_ctrl *ctrl)
 /* Populate namespace IDs */
 int ns_idx;
 for (ns_idx = 0; ns_idx < ctrl->ns_count; ns_idx++) {
-nvme_probe_ns(ctrl, >ns[ns_idx], ns_idx + 1);
+nvme_probe_ns(ctrl, >ns[ns_idx], ns_idx + 1, identify->mdts);
 }
 
 dprintf(3, "NVMe initialization complete!\n");
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH v2 0/3] nvme: Add PRP List support

2020-09-30 Thread Alexander Graf
While looking at VM bootup times, we stumbled over the fact that the NVMe
code only does I/O operations of up to 4kb at a given point in time. This
is usually ok, but if you have an OS that loads a lot of data on boot in
combination to network backed storage, it shows in bootup times.

There is no need to restrict ourselves to 4kb though. The INT13 call we
receive gives us much larger chunks which we can just map into a native
bigger NVMe I/O call if the request buffer is page aligned.

This patch implements all logic required to do the above and gives a
substantial performance boost on boot.

v1 -> v2:

  - fix dprintf
  - Fix bounds check on PRP list add logic
  - Reduce PRP list to 15 entries embedded in the ns struct.
This reduces BIOS reserved memory footprint by 4kb.

Alexander Graf (3):
  nvme: Record maximum allowed request size
  nvme: Allow to set PRP2
  nvme: Pass large I/O requests as PRP lists

 src/hw/nvme-int.h |  16 -
 src/hw/nvme.c | 106 +-
 2 files changed, 105 insertions(+), 17 deletions(-)

-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH v2 2/3] nvme: Allow to set PRP2

2020-09-30 Thread Alexander Graf
When creating a PRP based I/O request, we pass in the pointer to operate
on. Going forward, we will want to be able to pass additional pointers
though for mappings above 4k.

This patch adds a parameter to nvme_get_next_sqe() to pass in the PRP2
value of an NVMe I/O request, paving the way for a future patch to
implement PRP lists.

Signed-off-by: Alexander Graf 
Reviewed-by: Filippo Sironi 
---
 src/hw/nvme.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 7e56134..04bbba4 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -152,7 +152,7 @@ nvme_wait(struct nvme_sq *sq)
 /* Returns the next submission queue entry (or NULL if the queue is full). It
also fills out Command Dword 0 and clears the rest. */
 static struct nvme_sqe *
-nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data)
+nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data, void 
*data2)
 {
 if (((sq->head + 1) & sq->common.mask) == sq->tail) {
 dprintf(3, "submission queue is full");
@@ -166,6 +166,7 @@ nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void 
*metadata, void *data)
 sqe->cdw0 = opc | (sq->tail << 16 /* CID */);
 sqe->mptr = (u32)metadata;
 sqe->dptr_prp1 = (u32)data;
+sqe->dptr_prp2 = (u32)data2;
 
 if (sqe->dptr_prp1 & (NVME_PAGE_SIZE - 1)) {
 /* Data buffer not page aligned. */
@@ -200,7 +201,7 @@ nvme_admin_identify(struct nvme_ctrl *ctrl, u8 cns, u32 
nsid)
 struct nvme_sqe *cmd_identify;
 cmd_identify = nvme_get_next_sqe(>admin_sq,
  NVME_SQE_OPC_ADMIN_IDENTIFY, NULL,
- identify_buf);
+ identify_buf, NULL);
 
 if (!cmd_identify) {
 warn_internalerror();
@@ -338,7 +339,7 @@ nvme_create_io_cq(struct nvme_ctrl *ctrl, struct nvme_cq 
*cq, u16 q_idx)
 
 cmd_create_cq = nvme_get_next_sqe(>admin_sq,
   NVME_SQE_OPC_ADMIN_CREATE_IO_CQ, NULL,
-  cq->cqe);
+  cq->cqe, NULL);
 if (!cmd_create_cq) {
 goto err_destroy_cq;
 }
@@ -382,7 +383,7 @@ nvme_create_io_sq(struct nvme_ctrl *ctrl, struct nvme_sq 
*sq, u16 q_idx, struct
 
 cmd_create_sq = nvme_get_next_sqe(>admin_sq,
   NVME_SQE_OPC_ADMIN_CREATE_IO_SQ, NULL,
-  sq->sqe);
+  sq->sqe, NULL);
 if (!cmd_create_sq) {
 goto err_destroy_sq;
 }
@@ -429,7 +430,7 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char 
*buf, u16 count,
 struct nvme_sqe *io_read = nvme_get_next_sqe(>ctrl->io_sq,
  write ? NVME_SQE_OPC_IO_WRITE
: NVME_SQE_OPC_IO_READ,
- NULL, buf);
+ NULL, buf, NULL);
 io_read->nsid = ns->ns_id;
 io_read->dword[10] = (u32)lba;
 io_read->dword[11] = (u32)(lba >> 32);
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH 0/3] nvme: Add PRP List support

2020-09-29 Thread Alexander Graf
While looking at VM bootup times, we stumbled over the fact that the NVMe
code only does I/O operations of up to 4kb at a given point in time. This
is usually ok, but if you have an OS that loads a lot of data on boot in
combination to network backed storage, it shows in bootup times.

There is no need to restrict ourselves to 4kb though. The INT13 call we
receive gives us much larger chunks which we can just map into a native
bigger NVMe I/O call if the request buffer is page aligned.

This patch implements all logic required to do the above and gives a
substantial performance boost on boot.

Alexander Graf (3):
  nvme: Record maximum allowed request size
  nvme: Allow to set PRP2
  nvme: Pass large I/O requests as PRP lists

 src/hw/nvme-int.h |  14 ++-
 src/hw/nvme.c | 107 ++
 2 files changed, 104 insertions(+), 17 deletions(-)

-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH 1/3] nvme: Record maximum allowed request size

2020-09-29 Thread Alexander Graf
NVMe has a limit on how many sectors it can handle at most within a single
request. Remember that number, so that in a follow-up patch, we can verify
that we don't exceed it.

Signed-off-by: Alexander Graf 
---
 src/hw/nvme-int.h |  8 +++-
 src/hw/nvme.c | 13 +++--
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 9f95dd8..674008a 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -117,6 +117,7 @@ struct nvme_namespace {
 
 u32 block_size;
 u32 metadata_size;
+u32 max_req_size;
 
 /* Page aligned buffer of size NVME_PAGE_SIZE. */
 char *dma_buffer;
@@ -131,7 +132,12 @@ struct nvme_identify_ctrl {
 char mn[40];
 char fr[8];
 
-char _boring[516 - 72];
+u8 rab;
+u8 ieee[3];
+u8 cmic;
+u8 mdts;
+
+char _boring[516 - 78];
 
 u32 nn; /* number of namespaces */
 };
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 6a01204..2cde6a7 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -238,7 +238,8 @@ nvme_admin_identify_ns(struct nvme_ctrl *ctrl, u32 ns_id)
 }
 
 static void
-nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id)
+nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id,
+  u8 mdts)
 {
 ns->ctrl  = ctrl;
 ns->ns_id = ns_id;
@@ -281,6 +282,14 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, struct 
nvme_namespace *ns, u32 ns_id)
 ns->drive.blksize   = ns->block_size;
 ns->drive.sectors   = ns->lba_count;
 
+if (mdts) {
+ns->max_req_size = 1U << mdts;
+dprintf(3, "NVME NS %u max request size: %d sectors\n",
+ns->max_req_size);
+} else {
+ns->max_req_size = -1U;
+}
+
 ns->dma_buffer = zalloc_page_aligned(, NVME_PAGE_SIZE);
 
 char *desc = znprintf(MAXDESCSIZE, "NVMe NS %u: %llu MiB (%llu %u-byte "
@@ -567,7 +576,7 @@ nvme_controller_enable(struct nvme_ctrl *ctrl)
 /* Populate namespace IDs */
 int ns_idx;
 for (ns_idx = 0; ns_idx < ctrl->ns_count; ns_idx++) {
-nvme_probe_ns(ctrl, >ns[ns_idx], ns_idx + 1);
+nvme_probe_ns(ctrl, >ns[ns_idx], ns_idx + 1, identify->mdts);
 }
 
 dprintf(3, "NVMe initialization complete!\n");
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] [PATCH 3/3] nvme: Pass large I/O requests as PRP lists

2020-09-29 Thread Alexander Graf
Today, we split every I/O request into at most 4kb chunks and wait for these
requests to finish. We encountered issues where the backing storage is network
based, so every I/O request needs to go over the network with associated
latency cost. A few ms of latency when loading 100MB initrd in 4kb chunks
does add up.

NVMe implements a feature to allow I/O requests spanning multiple pages,
called PRP lists. This patch takes larger I/O operations and checks if
they can be directly passed to the NVMe backing device as PRP list.
At least for grub, read operations can always be mapped directly into
PRP list items.

This reduces the number of I/O operations required during a typical boot
path by roughly a factor of 5.

Signed-off-by: Alexander Graf 
---
 src/hw/nvme-int.h |  6 
 src/hw/nvme.c | 85 ---
 2 files changed, 81 insertions(+), 10 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h
index 674008a..82b778a 100644
--- a/src/hw/nvme-int.h
+++ b/src/hw/nvme-int.h
@@ -121,6 +121,11 @@ struct nvme_namespace {
 
 /* Page aligned buffer of size NVME_PAGE_SIZE. */
 char *dma_buffer;
+
+/* Page List */
+u32 prpl_len;
+void *prp1;
+u64 *prpl;
 };
 
 /* Data structures for NVMe admin identify commands */
@@ -195,6 +200,7 @@ union nvme_identify {
 #define NVME_CQE_DW3_P (1U << 16)
 
 #define NVME_PAGE_SIZE 4096
+#define NVME_PAGE_MASK ~(NVME_PAGE_SIZE - 1)
 
 /* Length for the queue entries. */
 #define NVME_SQE_SIZE_LOG 6
diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 406ed19..49fb8f5 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -168,11 +168,6 @@ nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void 
*metadata, void *data, void *
 sqe->dptr_prp1 = (u32)data;
 sqe->dptr_prp2 = (u32)data2;
 
-if (sqe->dptr_prp1 & (NVME_PAGE_SIZE - 1)) {
-/* Data buffer not page aligned. */
-warn_internalerror();
-}
-
 return sqe;
 }
 
@@ -292,6 +287,7 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace 
*ns, u32 ns_id,
 }
 
 ns->dma_buffer = zalloc_page_aligned(, NVME_PAGE_SIZE);
+ns->prpl = zalloc_page_aligned(, NVME_PAGE_SIZE);
 
 char *desc = znprintf(MAXDESCSIZE, "NVMe NS %u: %llu MiB (%llu %u-byte "
   "blocks + %u-byte metadata)\n",
@@ -418,19 +414,29 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, 
char *buf, u16 count,
   int write)
 {
 u32 buf_addr = (u32)buf;
+void *prp2;
 
-if ((buf_addr & 0x3) ||
-((buf_addr & ~(NVME_PAGE_SIZE - 1)) !=
- ((buf_addr + ns->block_size * count - 1) & ~(NVME_PAGE_SIZE - 1 {
-/* Buffer is misaligned or crosses page boundary */
+if (buf_addr & 0x3) {
+/* Buffer is misaligned */
 warn_internalerror();
 return DISK_RET_EBADTRACK;
 }
 
+   if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) {
+   /* We need to describe more than 2 pages, rely on PRP List */
+   prp2 = ns->prpl;
+   } else if ((ns->block_size * count) > NVME_PAGE_SIZE) {
+   /* Directly embed the 2nd page if we only need 2 pages */
+   prp2 = (void *)(long)*ns->prpl;
+   } else {
+   /* One page is enough, don't expose anything else */
+   prp2 = NULL;
+   }
+
 struct nvme_sqe *io_read = nvme_get_next_sqe(>ctrl->io_sq,
  write ? NVME_SQE_OPC_IO_WRITE
: NVME_SQE_OPC_IO_READ,
- NULL, buf, NULL);
+ NULL, buf, prp2);
 io_read->nsid = ns->ns_id;
 io_read->dword[10] = (u32)lba;
 io_read->dword[11] = (u32)(lba >> 32);
@@ -450,6 +456,60 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char 
*buf, u16 count,
 return DISK_RET_SUCCESS;
 }
 
+static void nvme_reset_prpl(struct nvme_namespace *ns)
+{
+ns->prpl_len = 0;
+}
+
+static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)
+{
+ns->prpl[ns->prpl_len++] = base;
+
+if (ns->prpl_len > (NVME_PAGE_SIZE / sizeof(ns->prpl[0])))
+return -1;
+
+return 0;
+}
+
+int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op)
+{
+int first_page = 1;
+u32 base = (long)op->buf_fl;
+s32 size = op->count * ns->block_size;
+
+if (op->count > ns->max_req_size)
+return -1;
+
+nvme_reset_prpl(ns);
+
+/* Special case for transfers that fit into PRP1, but are unaligned */
+if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) {
+ns->prp1 = op->buf_fl;
+return 0;
+}
+
+/* Every request has to be page aligned */
+if (base & ~NVME_PAGE_MASK)
+re

[SeaBIOS] [PATCH 2/3] nvme: Allow to set PRP2

2020-09-29 Thread Alexander Graf
When creating a PRP based I/O request, we pass in the pointer to operate
on. Going forward, we will want to be able to pass additional pointers
though for mappings above 4k.

This patch adds a parameter to nvme_get_next_sqe() to pass in the PRP2
value of an NVMe I/O request, paving the way for a future patch to
implement PRP lists.

Signed-off-by: Alexander Graf 
---
 src/hw/nvme.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c
index 2cde6a7..406ed19 100644
--- a/src/hw/nvme.c
+++ b/src/hw/nvme.c
@@ -152,7 +152,7 @@ nvme_wait(struct nvme_sq *sq)
 /* Returns the next submission queue entry (or NULL if the queue is full). It
also fills out Command Dword 0 and clears the rest. */
 static struct nvme_sqe *
-nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data)
+nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data, void 
*data2)
 {
 if (((sq->head + 1) & sq->common.mask) == sq->tail) {
 dprintf(3, "submission queue is full");
@@ -166,6 +166,7 @@ nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void 
*metadata, void *data)
 sqe->cdw0 = opc | (sq->tail << 16 /* CID */);
 sqe->mptr = (u32)metadata;
 sqe->dptr_prp1 = (u32)data;
+sqe->dptr_prp2 = (u32)data2;
 
 if (sqe->dptr_prp1 & (NVME_PAGE_SIZE - 1)) {
 /* Data buffer not page aligned. */
@@ -200,7 +201,7 @@ nvme_admin_identify(struct nvme_ctrl *ctrl, u8 cns, u32 
nsid)
 struct nvme_sqe *cmd_identify;
 cmd_identify = nvme_get_next_sqe(>admin_sq,
  NVME_SQE_OPC_ADMIN_IDENTIFY, NULL,
- identify_buf);
+ identify_buf, NULL);
 
 if (!cmd_identify) {
 warn_internalerror();
@@ -338,7 +339,7 @@ nvme_create_io_cq(struct nvme_ctrl *ctrl, struct nvme_cq 
*cq, u16 q_idx)
 
 cmd_create_cq = nvme_get_next_sqe(>admin_sq,
   NVME_SQE_OPC_ADMIN_CREATE_IO_CQ, NULL,
-  cq->cqe);
+  cq->cqe, NULL);
 if (!cmd_create_cq) {
 goto err_destroy_cq;
 }
@@ -382,7 +383,7 @@ nvme_create_io_sq(struct nvme_ctrl *ctrl, struct nvme_sq 
*sq, u16 q_idx, struct
 
 cmd_create_sq = nvme_get_next_sqe(>admin_sq,
   NVME_SQE_OPC_ADMIN_CREATE_IO_SQ, NULL,
-  sq->sqe);
+  sq->sqe, NULL);
 if (!cmd_create_sq) {
 goto err_destroy_sq;
 }
@@ -429,7 +430,7 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char 
*buf, u16 count,
 struct nvme_sqe *io_read = nvme_get_next_sqe(>ctrl->io_sq,
  write ? NVME_SQE_OPC_IO_WRITE
: NVME_SQE_OPC_IO_READ,
- NULL, buf);
+ NULL, buf, NULL);
 io_read->nsid = ns->ns_id;
 io_read->dword[10] = (u32)lba;
 io_read->dword[11] = (u32)(lba >> 32);
-- 
2.16.4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879


___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


[SeaBIOS] Re: [PATCH v2] csm: Fix boot priority translation

2019-06-28 Thread Alexander Graf

On 20.06.19 14:07, David Woodhouse wrote:

For CSM, the highest priority for a boot entry is zero. SeaBIOS doesn't
use zero, and the highest priority is 1.

Make the results of csm_bootprio_*() conform to that convention. Also
explicitly handle the BBS_DO_NOT_BOOT_FROM and BBS_IGNORE_ENTRY values.

Signed-off-by: David Woodhouse 
---
v2: No code change, just correct the commit message.

  src/fw/csm.c | 20 +---
  1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/fw/csm.c b/src/fw/csm.c
index 3fcc252..7663d31 100644
--- a/src/fw/csm.c
+++ b/src/fw/csm.c
@@ -319,6 +319,20 @@ handle_csm(struct bregs *regs)
  csm_return(regs);
  }
  
+static int csm_prio_to_seabios(u16 csm_prio)

+{
+switch (csm_prio) {
+case BBS_DO_NOT_BOOT_FROM:
+case BBS_IGNORE_ENTRY:
+return -1;
+
+case BBS_LOWEST_PRIORITY:
+case BBS_UNPRIORITIZED_ENTRY:
+default:
+return csm_prio + 1;


Can you please add an inline comment for this to explain why exactly the 
+ 1 is needed? It's non-obvious enough that the git log is too far as 
information source.



Alex
___
SeaBIOS mailing list -- seabios@seabios.org
To unsubscribe send an email to seabios-le...@seabios.org


Re: [SeaBIOS] [Qemu-devel] [PATCH RFC V2 14/17] hw/pci: piix - suport multiple host bridges

2015-02-20 Thread Alexander Graf


On 16.02.15 14:29, Marcel Apfelbaum wrote:
 On 02/16/2015 03:00 PM, Alexander Graf wrote:


 On 16.02.15 10:54, Marcel Apfelbaum wrote:
 From: Marcel Apfelbaum marce...@redhat.com

 Instead of assuming it has only one bus, it
 enumerates all the host bridges until it finds
 the one with bus number corresponding with the
 config register.

 Signed-off-by: Marcel Apfelbaum mar...@redhat.com

 Hi Alexander,
 Thank you for the review.
 
 
 How 440 specific is this? Wouldn't we need similar code for q35 and gpxe?
 For gpxe: I have no idea.
 For Q35: PCI Express have native support for extra root bridges by
 having multiple Root Complexes(RC), but in this case each RC
 handles its separate configuration space.
 
 It may be possible to use the same hack as in PC to expose a
 PCIe Root Port as a different host bridge and a primary bus behind
 it, but it is out of this series scope.
 
 The series aims to address the limitation that PC machines support
 NUMA nodes for CPU/memory but not PCI.
 
 Anyway, we can move the code when neeeded, but since it will
 take some and the implementation is not certain, for the moment
 is 440 specific.

Sorry, I didn't realize that this was legacy PCI specific :).


Alex

___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [Qemu-devel] [PATCH RFC V2 12/17] hw/pci: introduce PCI Expander Bridge (PXB)

2015-02-16 Thread Alexander Graf


On 16.02.15 10:54, Marcel Apfelbaum wrote:
 From: Marcel Apfelbaum marce...@redhat.com
 
 PXB is a light-weight host bridge whose purpose is to enable
 the main host bridge to support multiple PCI root buses.
 
 As oposed to PCI-2-PCI bridge's secondary bus, PXB's bus
 is a primary bus and can be associated with a NUMA node
 (different from the main host bridge) allowing the guest OS
 to recognize the proximity of a pass-through device to
 other resources as RAM and CPUs.
 
 The PXB is composed from:
  - A primary PCI bus (can be associated with a NUMA node)
Acts like a normal pci bus and from the functionality point
of view is an expansion of the bus behind the
main host bridge.
  - A pci-2-pci bridge behind the primary PCI bus where the actual
devices will be attached.
  - A host-bridge PCI device
Situated on the bus behind the main host bridge, allows
the BIOS to configure the bus number and IO/mem resources.
It does not have its own config/data register for configuration
cycles, this being handled by the main host bridge.
 -  A host-bridge sysbus to comply with QEMU current design.
 
 Signed-off-by: Marcel Apfelbaum mar...@redhat.com
 ---
  hw/pci-bridge/Makefile.objs |   1 +
  hw/pci-bridge/pci_expander_bridge.c | 173 
 
  include/hw/pci/pci.h|   1 +
  3 files changed, 175 insertions(+)
  create mode 100644 hw/pci-bridge/pci_expander_bridge.c
 
 diff --git a/hw/pci-bridge/Makefile.objs b/hw/pci-bridge/Makefile.objs
 index 968b369..632e442 100644
 --- a/hw/pci-bridge/Makefile.objs
 +++ b/hw/pci-bridge/Makefile.objs
 @@ -1,4 +1,5 @@
  common-obj-y += pci_bridge_dev.o
 +common-obj-y += pci_expander_bridge.o
  common-obj-y += ioh3420.o xio3130_upstream.o xio3130_downstream.o
  common-obj-y += i82801b11.o
  # NewWorld PowerMac
 diff --git a/hw/pci-bridge/pci_expander_bridge.c 
 b/hw/pci-bridge/pci_expander_bridge.c
 new file mode 100644
 index 000..941f3c8
 --- /dev/null
 +++ b/hw/pci-bridge/pci_expander_bridge.c
 @@ -0,0 +1,173 @@
 +/*
 + * PCI Expander Bridge Device Emulation
 + *
 + * Copyright (C) 2014 Red Hat Inc
 + *
 + * Authors:
 + *   Marcel Apfelbaum marce...@redhat.com
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2 or later.
 + * See the COPYING file in the top-level directory.
 + */
 +
 +#include hw/pci/pci.h
 +#include hw/pci/pci_bus.h
 +#include hw/pci/pci_host.h
 +#include hw/pci/pci_bus.h
 +#include qemu/range.h
 +#include qemu/error-report.h
 +
 +#define TYPE_PXB_BUS pxb-bus
 +#define PXB_BUS(obj) OBJECT_CHECK(PXBBus, (obj), TYPE_PXB_BUS)
 +
 +typedef struct PXBBus {
 +/* private */
 +PCIBus parent_obj;
 +/* public */
 +
 +char bus_path[8];
 +} PXBBus;
 +
 +#define TYPE_PXB_DEVICE pxb-device
 +#define PXB_DEV(obj) OBJECT_CHECK(PXBDev, (obj), TYPE_PXB_DEVICE)
 +
 +typedef struct PXBDev {
 +/* private */
 +PCIDevice parent_obj;
 +/* public */
 +
 +uint8_t bus_nr;
 +} PXBDev;
 +
 +#define TYPE_PXB_HOST pxb-host
 +
 +static int pxb_bus_num(PCIBus *bus)
 +{
 +PXBDev *pxb = PXB_DEV(bus-parent_dev);
 +
 +return pxb-bus_nr;
 +}
 +
 +static bool pxb_is_root(PCIBus *bus)
 +{
 +return true; /* by definition */
 +}
 +
 +static void pxb_bus_class_init(ObjectClass *class, void *data)
 +{
 +PCIBusClass *pbc = PCI_BUS_CLASS(class);
 +
 +pbc-bus_num = pxb_bus_num;
 +pbc-is_root = pxb_is_root;
 +}
 +
 +static const TypeInfo pxb_bus_info = {
 +.name  = TYPE_PXB_BUS,
 +.parent= TYPE_PCI_BUS,
 +.instance_size = sizeof(PXBBus),
 +.class_init= pxb_bus_class_init,
 +};
 +
 +static const char *pxb_host_root_bus_path(PCIHostState *host_bridge,
 +  PCIBus *rootbus)
 +{
 +PXBBus *bus = PXB_BUS(rootbus);
 +
 +snprintf(bus-bus_path, 8, :%02x, pxb_bus_num(rootbus));
 +return bus-bus_path;
 +}
 +
 +static void pxb_host_class_init(ObjectClass *class, void *data)
 +{
 +DeviceClass *dc = DEVICE_CLASS(class);
 +PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(class);
 +
 +dc-fw_name = pci;
 +hc-root_bus_path = pxb_host_root_bus_path;
 +}
 +
 +static const TypeInfo pxb_host_info = {
 +.name  = TYPE_PXB_HOST,
 +.parent= TYPE_PCI_HOST_BRIDGE,
 +.class_init= pxb_host_class_init,
 +};
 +
 +static int pxb_dev_initfn(PCIDevice *dev)
 +{
 +PXBDev *pxb = PXB_DEV(dev);
 +DeviceState *ds, *bds;
 +PCIHostState *phs;
 +PCIBus *bus;
 +const char *dev_name = NULL;
 +
 +HOST_BRIDGE_FOREACH(phs) {
 +if (pxb-bus_nr == pci_bus_num(phs-bus)) {
 +error_report(Bus nr %d is already used by %s.,
 + pxb-bus_nr, phs-bus-qbus.name);
 +return -EINVAL;
 +}
 +}
 +
 +if (dev-qdev.id  *dev-qdev.id) {
 +dev_name = dev-qdev.id;
 +}
 +
 +ds = qdev_create(NULL, TYPE_PXB_HOST);
 +bus = pci_bus_new(ds, pxb-internal, NULL, NULL, 0, 

Re: [SeaBIOS] [Qemu-devel] [PATCH RFC V2 14/17] hw/pci: piix - suport multiple host bridges

2015-02-16 Thread Alexander Graf


On 16.02.15 10:54, Marcel Apfelbaum wrote:
 From: Marcel Apfelbaum marce...@redhat.com
 
 Instead of assuming it has only one bus, it
 enumerates all the host bridges until it finds
 the one with bus number corresponding with the
 config register.
 
 Signed-off-by: Marcel Apfelbaum mar...@redhat.com

How 440 specific is this? Wouldn't we need similar code for q35 and gpxe?


Alex

___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [Qemu-devel] KVM call agenda for 2013-06-25

2013-06-25 Thread Alexander Graf

On 22.06.2013, at 03:01, Alexander Graf wrote:

 
 On 20.06.2013, at 14:47, Michael S. Tsirkin wrote:
 
 Please, send any topic that you are interested in covering.
 
 VFIO with platform devices

A few notes ahead of the discussion:


-problem we are trying to solve is:
  1.  how the kernel can expose device tree based platform
  devices to user space
  2.  how QEMU exposes these devices in virtual machines

-general
  -this is not just a QEMU/KVM issue, we have other user space 
   drivers that should use the same framework (e.g. USDPAA)
  -acpi/dsdt based device discovery is on the horizon, but this 
   should be treated separately
  -we want the same linux drivers to be used in host and
   guest...don't want 'paravirt' drivers in guest

-3 aspects to this

  1. kernel side
 -a device must be able to be unbound from the host
  and bound to vfio
 -vfio can determine the mappable register regions and
  interrupts from the associated device tree node and
  expose these to user space...details TBD
 -keep it simple: expose register regions and interrupts

  2. user-space/QEMU - device tree

 -there is potentially a large variety of devices, the
  simplest are represented by a simple self-contained
  node, others get complicated and could have phandles to
  other nodes (e.g. etsec has phandle to phy node).  Some
  devices may need device tree tweaks.

 -think what we need is custom handling in QEMU for
  each type of device.  QEMU uses what the kernel 
  exposed via vfio to pass the device through and
 For example to pass through
  a SATA controller something like:

  -device platfom-vfio-sata,dev=/soc/sata@221000,bus=platform.0

 -the device specific support in QEMU knows how to
  create a device tree node for the given device type

 -it is up to user space to deal with device inter-
  relationships if they exist

 -SEC engine job ring example-- what we want ideally is a 
  Linux driver that can operate on plain job rings
  so we don't need the complexity of dealing with the
  main SEC node:

[cut]
crypto@30 {
   compatible = fsl,sec-v4.0;
   #address-cells = 0x1;
   #size-cells = 0x1;
   reg = 0x30 0x1;
   ranges = 0x0 0x30 0x1;
   interrupts = 0x5c 0x2 0x0 0x0;

   jr@1000 {
  compatible = fsl,sec-v4.0-job-ring;
  reg = 0x1000 0x1000;
  interrupts = 0x58 0x2 0x0 0x0;
   };

   jr@2000 {
  compatible = fsl,sec-v4.0-job-ring;
  reg = 0x2000 0x1000;
  interrupts = 0x59 0x2 0x0 0x0;
   };
[cut]

  3. user-space/QEMU - dynamic interrupt allocation

 -currently devices and interrupts on our system/platform
  bus are statically allocated.  We need a way to
  dynamically get devices added

 -simple example would be adding a virtual UART via -device

 -it would be nice to solve this generally for both
  ARM and Power

___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [Qemu-devel] KVM call agenda for 2013-06-25

2013-06-21 Thread Alexander Graf

On 20.06.2013, at 14:47, Michael S. Tsirkin wrote:

 Please, send any topic that you are interested in covering.

VFIO with platform devices


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [PATCH 3/3] acpi: revert d9f5cdbdf (DSDT: Fix HPET _CRS Method)

2012-12-11 Thread Alexander Graf


On 11.12.2012, at 09:21, Gerd Hoffmann kra...@redhat.com wrote:

  Hi,
 
 Sounds sensible.  Although the smc emulation in qemu looks
 incomplete, there are a bunch of #defines which are never ever
 used.  Maybe a more complete emulation offers a way to detect the
 smc without side effects. Failing that some magic register looks ok
 to me.  We should makes sure this doesn't conflict with anything
 the real hardware does though.
 
 So what if SeaBIOS maps some PCI BAR to 0x300
 
 It wouldn't.
 
 The whole point of a DSDT is to free the OS from probing devices.
 
 Or provide safe methods to probe devices.
 
 For serial/parallel/floppy there are bits in the piix4/q35 chipset which
 indicate whenever those devices are present.  We are using that, and
 this is actually similar to real hardware.
 
 For the hpet we are accessing the hardware directly and checking for a
 valid vendor id to figure whenever the device is present or not (due to
 -no-hpet).
 
 Assuming there is a side-effect free way to detect the applesmc I can't
 see what is wrong with using that.  If we can't detect it without side
 effects we'll need a paravirtualized way to handle it.  The options I
 see are:
 
  (1) as suggested add a special detection register to our applesmc
  emulation.

This means we are different than real hw.

  (2) find a spare bit in q35, next to the present bits for
  serial/parallel/floppy

Which means we differ from real hw - bad.

 (osx wouldn't boot on piix4 anyway,
  so we don't have to care, right?)

It could, but you need to manually add an lpc device. I don't think we really 
care about the piix4 case though :).

  (3) use fw_cfg.
 
 (1) + (2) are easy to handle from AML code.

Why if accessing fw_cfg harder than accessing the applesmc? Poking fw_cfg pios 
rather than applesmc pios shouldn't be an issue, right? After all, fw_cfg's 
purpose in life is to tell us details about the device model we can't easily 
find out otherwise.


Alex

 
 cheers,
  Gerd
 

___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] DSDT: SMC device node (was: Fix HPET _CRS Method)

2012-12-11 Thread Alexander Graf

On 12.12.2012, at 01:30, Kevin O'Connor wrote:

 On Tue, Dec 11, 2012 at 09:53:47AM -0500, Gabriel L. Somlo wrote:
 On Tue, Dec 11, 2012 at 09:33:21AM +0100, Alexander Graf wrote:
 (1) as suggested add a special detection register to our applesmc
 emulation.
 
 This means we are different than real hw.
 
 (2) find a spare bit in q35, next to the present bits for
 serial/parallel/floppy
 
 Which means we differ from real hw - bad.
 
 (osx wouldn't boot on piix4 anyway,
 so we don't have to care, right?)
 
 It could, but you need to manually add an lpc device. I don't think we 
 really care about the piix4 case though :).
 
 (3) use fw_cfg.
 
 (1) + (2) are easy to handle from AML code.
 [...]
 What do you guys think ?
 
 How about generating an SSDT with just the AppleSMC info in it, and
 then have qemu pass the SSDT into SeaBIOS using the existing ACPI
 table passing mechanism when the device is present.

That would mean we move the present detection to SeaBIOS which also simply 
talks to fw_cfg. So it doesn't really buy us anything over an _STA function, 
does it?


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] DSDT: SMC device node (was: Fix HPET _CRS Method)

2012-12-11 Thread Alexander Graf

On 12.12.2012, at 01:51, Kevin O'Connor wrote:

 On Wed, Dec 12, 2012 at 01:40:52AM +0100, Alexander Graf wrote:
 On 12.12.2012, at 01:30, Kevin O'Connor wrote:
 On Tue, Dec 11, 2012 at 09:53:47AM -0500, Gabriel L. Somlo wrote:
 What do you guys think ?
 
 How about generating an SSDT with just the AppleSMC info in it, and
 then have qemu pass the SSDT into SeaBIOS using the existing ACPI
 table passing mechanism when the device is present.
 
 That would mean we move the present detection to SeaBIOS which also simply 
 talks to fw_cfg. So it doesn't really buy us anything over an _STA function, 
 does it?
 
 
 QEMU can already pass an SSDT table to SeaBIOS - I was suggesting that
 QEMU could make the decision whether or not to send the table based on
 it's direct knowledge of whether or not the device is present.

Another problem with the SSDT is that today it's mostly used for CPU nodes. So 
whether it works correctly with device nodes in a particular OS implementation 
is an open question.

 There is, of course, complexities with this type of implementation
 too.  However, I felt compelled to reiterate my desire for ACPI to be
 moved from SeaBIOS to QEMU as that's really the place that has all the
 knowledge to do it correctly.

I agree, the ideal case would be that QEMU assembles the DSDT on the fly, 
similar to how we do it for the fdt on e500 or the fdt on spapr. For the time 
being, I think doing an fw_cfg variable for whether we have an applesmc or not 
is a valid and safe solution though.


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [PATCH 3/3] acpi: revert d9f5cdbdf (DSDT: Fix HPET _CRS Method)

2012-12-10 Thread Alexander Graf


On 11.12.2012, at 07:57, Gerd Hoffmann kra...@redhat.com wrote:

  Hi,
 
 But _INI relies on _STA being evaluated first, so unless I'm missing
 something that'd be a catch-22 :(
 
 Oops, indeed.
 
 So, I can think of a couple of alternatives:
 
1. have a SMC._STA() method that queries something
 
a) could be fw_cfg (may require hacking
   qemu/hw/applesmc.c *and* fw_cfg.c to make that happen)
 
 I'd avoid fw_cfg if possible.
 
b) could be the emulated SMC itself:
 
- right now, only the Data (0x300) and
  Command (0x304) ports are ever read/written,
  so we could set aside a magic qemu-smc-only
  port (e.g. 0x308, but anything within the
  32-byte range should work) and have it
  always return a magic number when read
  (via a qemu/hw/applesmc.c patch);
 
 Sounds sensible.  Although the smc emulation in qemu looks incomplete,
 there are a bunch of #defines which are never ever used.  Maybe a more
 complete emulation offers a way to detect the smc without side effects.
 Failing that some magic register looks ok to me.  We should makes sure
 this doesn't conflict with anything the real hardware does though.

So what if SeaBIOS maps some PCI BAR to 0x300 with a device that goes wild if 
you write that detection sequence?

The whole point of a DSDT is to free the OS from probing devices.

Alex

 
b) Kevin mentioned the SSDT and/or BDAT as other
   potential mechanisms to make something like this
 
 I don't think this helps much.  This just moves the smc detection from
 AML code to C code (and BDAT would be used to store the result in a way
 accessible from AML).
 
 cheers,
  Gerd

___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [PATCH 3/3] acpi: revert d9f5cdbdf (DSDT: Fix HPET _CRS Method)

2012-12-06 Thread Alexander Graf

On 06.12.2012, at 20:11, Gabriel L. Somlo wrote:

 On Thu, Dec 06, 2012 at 06:10:46PM +0100, Alexander Graf wrote:
 If I could figure out how to write a reliable _STA method for the SMC,
 that would detect whether or not it was supplied as -device applesmc
 on the qemu command line, I think we might be on to something...
 
 You should be able to poke fw_cfg in the _STA method. The machine file could 
 search its bus on init (or on machine create notify) and populate a fw_cfg 
 variable to indicate whether it found an applesmc.
 
 So the fw_cfg looks like this in 'info qtree':
 
  dev: fw_cfg, id 
ctl_iobase = 0x510
data_iobase = 0x511
irq 0
mmio /0002
mmio /0001
 
 I'm confused by the mmio values (start at the very top of memory, go
 for one or two bytes from there ?)

No idea what they're doing there. Just ignore them for now.

 I also found this patch by Gerd, which doesn't seem to have made it
 upstream (yet):
 
 http://www.mail-archive.com/qemu-devel@nongnu.org/msg110131.html
 
 That patch suggests that I could just write stuff to 0x510 and read
 from 0x511, and if I get it right the values I read will end up
 making sense :)

Yup :). And the value could be the STA result.

 Can anyone can point me to an example of how to interact with fw_cfg ?

Sure! Check out

  
http://git.qemu.org/?p=qemu.git;a=blob;f=pc-bios/optionrom/optionrom.h;h=3daf7da49576091e337b1dab814e87310d5ea25f;hb=HEAD


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [PATCH v3] DSDT: Fix HPET _CRS Method

2012-11-19 Thread Alexander Graf

On 16.11.2012, at 19:46, Kevin O'Connor wrote:

 On Fri, Nov 16, 2012 at 01:02:18PM -0500, Gabriel L. Somlo wrote:
 ping
 
 On Thu, Nov 08, 2012 at 12:35:17PM -0500, Gabriel L. Somlo wrote:
 Updated _CRS method for HPET, bringing it in line with the way it is
 presented on recent hardware (e.g. Dell Latitude D630, MacPro5,1, etc);
 Allows it to be detected and utilized from Mac OS X; Also tested OK on
 Linux (F16 64-bit install DVD) and Windows (Win7 64-bit install DVD).
 
 Signed-off-by: Gabriel Somlo so...@cmu.edu
 
 I'm okay with the patch.  I'm looking to see an Ack from one of the
 kvm/qemu developers.  (I saw Gerd was okay with the last version of
 the patch.)

IIRC Marcelo was the one changing the DSDT code from what this patch produces 
to what we have today. My initial HPET DSDT entry was also copied from real 
hardware.

Marcelo, any memory on this? :) Keep in mind I might be misremembering - maybe 
it was someone else after all ;).


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [Qemu-devel] [PATCH v3] DSDT: Fix HPET _CRS Method

2012-11-19 Thread Alexander Graf

On 19.11.2012, at 12:33, Gleb Natapov wrote:

 On Mon, Nov 19, 2012 at 11:22:48AM +0100, Alexander Graf wrote:
 
 On 16.11.2012, at 19:46, Kevin O'Connor wrote:
 
 On Fri, Nov 16, 2012 at 01:02:18PM -0500, Gabriel L. Somlo wrote:
 ping
 
 On Thu, Nov 08, 2012 at 12:35:17PM -0500, Gabriel L. Somlo wrote:
 Updated _CRS method for HPET, bringing it in line with the way it is
 presented on recent hardware (e.g. Dell Latitude D630, MacPro5,1, etc);
 Allows it to be detected and utilized from Mac OS X; Also tested OK on
 Linux (F16 64-bit install DVD) and Windows (Win7 64-bit install DVD).
 
 Signed-off-by: Gabriel Somlo so...@cmu.edu
 
 I'm okay with the patch.  I'm looking to see an Ack from one of the
 kvm/qemu developers.  (I saw Gerd was okay with the last version of
 the patch.)
 
 IIRC Marcelo was the one changing the DSDT code from what this patch 
 produces to what we have today. My initial HPET DSDT entry was also copied 
 from real hardware.
 
 Marcelo, any memory on this? :) Keep in mind I might be misremembering - 
 maybe it was someone else after all ;).
 
 I think you are misremembering. Git shows that HPET entry was taken from
 pcbios code and my old pcbios git tree says that the dsdt code was
 contributed by Beth Kon and it is the same as we have it now in SeaBIOS.

Hrm. I'm 100% sure that from my code (which Beth based on) to the code that we 
have, someone had completely rewritten the DSDT entry. Either way, not 
complaining as long as we get it sorted out eventually :).

 The patch looks OK.

Mind to make this a formal acked-by? :)


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [Qemu-devel] [PATCH v3] DSDT: Fix HPET _CRS Method

2012-11-19 Thread Alexander Graf

On 19.11.2012, at 12:51, Gleb Natapov wrote:

 On Mon, Nov 19, 2012 at 12:35:47PM +0100, Alexander Graf wrote:
 
 On 19.11.2012, at 12:33, Gleb Natapov wrote:
 
 On Mon, Nov 19, 2012 at 11:22:48AM +0100, Alexander Graf wrote:
 
 On 16.11.2012, at 19:46, Kevin O'Connor wrote:
 
 On Fri, Nov 16, 2012 at 01:02:18PM -0500, Gabriel L. Somlo wrote:
 ping
 
 On Thu, Nov 08, 2012 at 12:35:17PM -0500, Gabriel L. Somlo wrote:
 Updated _CRS method for HPET, bringing it in line with the way it is
 presented on recent hardware (e.g. Dell Latitude D630, MacPro5,1, etc);
 Allows it to be detected and utilized from Mac OS X; Also tested OK on
 Linux (F16 64-bit install DVD) and Windows (Win7 64-bit install DVD).
 
 Signed-off-by: Gabriel Somlo so...@cmu.edu
 
 I'm okay with the patch.  I'm looking to see an Ack from one of the
 kvm/qemu developers.  (I saw Gerd was okay with the last version of
 the patch.)
 
 IIRC Marcelo was the one changing the DSDT code from what this patch 
 produces to what we have today. My initial HPET DSDT entry was also copied 
 from real hardware.
 
 Marcelo, any memory on this? :) Keep in mind I might be misremembering - 
 maybe it was someone else after all ;).
 
 I think you are misremembering. Git shows that HPET entry was taken from
 pcbios code and my old pcbios git tree says that the dsdt code was
 contributed by Beth Kon and it is the same as we have it now in SeaBIOS.
 
 Hrm. I'm 100% sure that from my code (which Beth based on) to the code that 
 we have, someone had completely rewritten the DSDT entry. Either way, not 
 complaining as long as we get it sorted out eventually :).
 
 Here is Beth's commit in bochs  :)
 http://0x0badc0.de/gitweb?p=bochs/.git;a=commitdiff;h=874040e817e49811c30405d7c5d14b4c42161ca1;hp=1d2bac352688c75e8fdf8010f012b8ec85ba68d1

Ah, apparently Beth's DSDT bits were a rewrite done by Ryan. My original 
version looked like this:

  http://lists.gnu.org/archive/html/qemu-devel/2008-01/msg00180.html

but obviously broke quite a few things along the way ;). Also, I do remember 
that I extracted the HPET bits from there and submitted them individually, but 
I might really just be misremembering and it's not like that makes any 
difference.

 
 I added _STA logic in SeaBIOS but otherwise the code is the same.
 
 The patch looks OK.
 
 Mind to make this a formal acked-by? :)
 
 Sure, will do.

Just reply with the line ;)


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] [Qemu-devel] Hack integrating SeaBios / LinuxBoot option rom with QEMU trace backends

2011-10-10 Thread Alexander Graf

On 10.10.2011, at 20:53, Anthony Liguori wrote:

 On 10/10/2011 12:08 PM, Daniel P. Berrange wrote:
 I've been investigating where time disappears to when booting Linux guests.
 
 Initially I enabled DEBUG_BIOS in QEMU's hw/pc.c, and then hacked it so
 that it could print a timestamp before each new line of debug output. The
 problem with that is that it slowed down startup, so the timings I was
 examining all changed.
 
 What I really wanted was to use QEMU's trace infrastructure with a simple
 SystemTAP script. This is easy enough in the QEMU layer, but I also need
 to see where time goes to inside the various BIOS functions, and the
 options ROMs such as LinuxBoot. So I came up with a small hack to insert
 probes into SeaBios and LinuxBoot, which trigger a special IO port
 (0x404), which then cause QEMU to emit a trace event.
 
 The implementation is really very crude and does not allow any arguments
 to be passed each probes, but since all I care about is timing information,
 it is good enough for my needs.
 
 I'm not really expecting these patches to be merged into QEMU/SeaBios
 since they're just a crude hack  I don't have time to write something
 better. I figure they might be useful for someone else though...
 
 With the attached patches applied to QEMU and SeaBios, the attached
 systemtap script can be used to debug timings in QEMU startup.
 
 For example, one execution of QEMU produced the following log:
 
   $ stap qemu-timing.stp
   0.000 Start
   0.036 Run
   0.038 BIOS post
   0.180 BIOS int 19
   0.181 BIOS boot OS
   0.181 LinuxBoot copy kernel
   1.371 LinuxBoot copy initrd
 
 Yeah, there was a thread a bit ago about the performance of the interface to 
 read the kernel/initrd.  I think at it was using single byte access 
 instructions and there were patches to use string accessors instead?  I can't 
 remember where that threaded ended up.

IIRC we're already using string accessors, but are still slow. Richard had a 
nice patch cooked up to basically have the fw_cfg interface be able to DMA its 
data to the guest. I like the idea. Avi did not.

And yes, bad -kernel performance does hurt in some workloads. A lot.


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] Graphics card pass-through working with two pass pci-initialization

2011-06-01 Thread Alexander Graf

On 01.06.2011, at 09:30, Gerd Hoffmann wrote:

  Hi,
 
 0xE000 is hard-coded in the DSDT for both piix and q35 as below.
 If the range is determined dynamically, the area also needs to be
 updated somehow dynamically.
 
 ...
 Name (_CRS, ResourceTemplate ()
 ...
 DWordMemory (ResourceProducer, PosDecode, MinFixed, 
 MaxFixed, NonCacheable, ReadWrite,
 0x, // Address Space Granularity
 0xE000, // Address Range Minimum
 0xFEBF, // Address Range Maximum
 0x, // Address Translation Offset
 0x1EC0, // Address Length
 ,, , AddressRangeMemory, TypeStatic)
 
 Uhm, indeed.  I know next to nothing about ACPI though.  Ideas anyone how 
 this could be done?

We're facing similar issues on PPC. The equivalent of the DSDT there is the 
device tree, which is currently passed in as binary blob and slightly appended 
for dynamic configuration. I'd much rather like to see it fully generated 
inside of Qemu from all the information we have available there, so we don't 
run into consistency issues.

This will be even more required when we pass through SoC devices to the guest, 
which are not on a PCI bus. Without specifying them in the DT, the guest 
doesn't know about them. X86 has a similar issue. Take a look at the HPET for 
example. If you don't want an HPET inside the guest, the DSDT needs to be 
modified. So you need to change things at 2 places - the DSDT and Qemu.

I don't know how much work it would be to generate the DSDT dynamically from 
Qemu, but IMHO that's the sanest way to make things flexible. We could probably 
even extract most information from the Qdev tree.


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios


Re: [SeaBIOS] Graphics card pass-through working with two pass pci-initialization

2011-06-01 Thread Alexander Graf

On 01.06.2011, at 13:13, Avi Kivity wrote:

 On 06/01/2011 12:56 PM, Alexander Graf wrote:
 On 01.06.2011, at 09:30, Gerd Hoffmann wrote:
 
Hi,
 
   0xE000 is hard-coded in the DSDT for both piix and q35 as below.
   If the range is determined dynamically, the area also needs to be
   updated somehow dynamically.
 
   ...
   Name (_CRS, ResourceTemplate ()
   ...
   DWordMemory (ResourceProducer, PosDecode, MinFixed, 
  MaxFixed, NonCacheable, ReadWrite,
   0x, // Address Space Granularity
   0xE000, // Address Range Minimum
   0xFEBF, // Address Range Maximum
   0x, // Address Translation Offset
   0x1EC0, // Address Length
   ,, , AddressRangeMemory, TypeStatic)
 
   Uhm, indeed.  I know next to nothing about ACPI though.  Ideas anyone how 
  this could be done?
 
 We're facing similar issues on PPC. The equivalent of the DSDT there is the 
 device tree, which is currently passed in as binary blob and slightly 
 appended for dynamic configuration. I'd much rather like to see it fully 
 generated inside of Qemu from all the information we have available there, 
 so we don't run into consistency issues.
 
 This will be even more required when we pass through SoC devices to the 
 guest, which are not on a PCI bus. Without specifying them in the DT, the 
 guest doesn't know about them. X86 has a similar issue. Take a look at the 
 HPET for example. If you don't want an HPET inside the guest, the DSDT needs 
 to be modified. So you need to change things at 2 places - the DSDT and Qemu.
 
 I don't know how much work it would be to generate the DSDT dynamically from 
 Qemu, but IMHO that's the sanest way to make things flexible. We could 
 probably even extract most information from the Qdev tree.
 
 
 Generating the DSDT dynamically is hard, but the DSDT itself is dynamic.  You 
 can make any function talk to the firmware configuration interface and return 
 results that depend on the information there.

Does that hold true for nodes as well? I thought you can only use 'functions' 
for specific elements?


Alex


___
SeaBIOS mailing list
SeaBIOS@seabios.org
http://www.seabios.org/mailman/listinfo/seabios