On 27/09/2024 0:44, Daniel Verkamp wrote:
On Wed, Sep 25, 2024 at 4:42 PM Max Gurtovoy <mgurto...@nvidia.com> wrote:
Hi Daniel,
On 25/09/2024 23:57, Daniel Verkamp wrote:
On Wed, Sep 25, 2024 at 8:05 AM Max Gurtovoy <mgurto...@nvidia.com> wrote:
This field is only valid when the VIRTIO_BLK_F_BLK_SIZE feature bit is
set.
The blk_size field represents the smallest addressable unit of data that
can be read from or written to the device. It is always a power of two
and typically ranges from 512 bytes to larger values such as 4 KB.
I agree that this is what the blk_size field probably *should* have
meant (matching other storage protocols like ATA, SCSI, NVMe, ...),
but I believe rewriting the spec like this changes the meaning of the
blk_size field in an incompatible way.
I believe that this is more of a clarification than a re-write.
Previously, blk_size was described as the "optimal" sector size, so it
seems clear that the driver is still allowed to send requests that are
not aligned to blk_size ("awareness of the correct value can affect
performance" but no mention of correctness). A device may need to do
extra work to support this, of course, such as turning an unaligned
write into a read-modify-write sequence if the underlying storage
device doesn't support such an access directly, but such requests are
still valid. It's also perfectly fine for a driver to ignore
device-offered features like VIRTIO_BLK_F_BLK_SIZE and still operate
correctly.
A device should report its block size (logical) using the VIRTIO_BLK_F_BLK_SIZE
feature. For legacy devices that don't report this feature, Linux assumes a
512-byte block size.
Sure, I agree with both points here.
Great.
So I guess we agree that blk_size == logical_block_size.
Drivers should not ignore this feature when it's offered by the device. It's
crucial for correct operation IMO.
This is a "should", not a "must", though - adding a new MUST is a
breaking change.
I can change it to 'SHOULD', but this modification has implications:
1. drivers that ignore the VIRTIO_BLK_F_BLK_SIZE feature will make
assumptions about the logical block size.
2. Incorrect assumptions could lead to undefined behavior.
I'm ok with 'SHOULD', since it's a strong recommendation for drivers to
negotiate and respect this feature when available.
I'm not really familiar with drivers that ignore this feature so I won't
argue here.
The statement "driver is still allowed to send requests that are not aligned to
blk_size" needs explicit clarification in the spec.
It definitely would be better to have it written out explicitly, one
way or the other, no disagreement there.
From my understanding of the spec, the driver is not allowed to do so
(otherwise error will be returned or the behavior is undefined).
This doesn't seem to follow from my reading of the spec. Otherwise,
phrases like "optimal sector size" and "can affect performance" would
not make sense (such a request would instead be described as
illegal/returning an error).
Is it possible that certain parts of the spec could be improved ?
Is it possible that certain implicit parts of the spec could be
understood differently by various implementers or users?
In reality "optimal block size that can effect performance" probably
refers to logical vs. physical block size.
This patch tries to make the spec clearer.
In Linux, the reported blk_size is used to set logical_block_size, while
physical_block_exp is used to calculate the physical_block_size.
Sure, but this is describing the behavior of one driver
implementation, not really relevant to the meaning of the spec.
It is relevant IMO.
It's worth considering the historical context of the specification. To
my understanding, the spec was initially derived from the Linux
implementation, rather than the other way around.
This translation from implementation to specification could have
introduced some inconsistencies or ambiguities.
Also worth considering virtio-block devices that were implemented before
this feature bit. The WG probably tried to maintain backward
compatibility with these devices.
Regarding the specification statement: "This does not affect the units used in the
protocol (always 512 bytes)":
This likely refers to how certain fields in the virtio-blk protocol are
expressed, not the actual logical block size used for I/O operations. For
example: capacity reporting, max_discard_sectors and Other fields that are
expressed in 512 byte units.
The way the fields are expressed in the virtio-blk protocol is what is
relevant in the virtio spec. Even if a virtio-blk device is sending
requests to a physical SCSI/ATA/... device behind the scenes, it's the
job of that virtio-blk device implementation to handle the translation
and whatever mismatches there may be between the protocols.
This is true.
Particularly, the "sector" value in a virtio_blk_req used for
VIRTIO_BLK_T_IN/VIRTIO_BLK_T_OUT is always in 512-byte units,
regardless of what blk_size the device reports, and regardless of what
underlying physical media may be used.
From the virtio spec: "The sector number indicates the offset
(multiplied by 512) where the read or write is to occur."
(This could probably use clarification that the "offset" is in bytes,
but it still seems pretty clear that it does not depend on blk_size.)
So a request with sector = 1 always means an offset of 512 bytes from
the beginning of the block device, even if the advertised blk_size is
4096.
I appreciate your perspective, but I disagree with it.
When a block device has a logical_block_size of 4K, it's not possible to
perform read/write operations at arbitrary offsets like 512 bytes. The
smallest addressable unit is the full 4K logical block in that case.
This is a fundamental principle of block device operations, not just an
optimization.
Also IO granularity.
I believe it's important to ensure that the specification accurately
reflects these fundamental block device principles.
I believe that the reason that the "sector" is multiplication of 512 is
to maintain backward compatibility, and not to allow breaking these
fundamental principles.
This is different from (e.g.) a SCSI READ where the logical block
address is in units of the block length returned by READ CAPACITY, so
it is impossible to send a request that is not block size aligned.
IMO, it is impossible for virtio block device and any other block device
as well.
[...]
diff --git a/device-types/blk/description.tex b/device-types/blk/description.tex
index 2712ada..88a7591 100644
--- a/device-types/blk/description.tex
+++ b/device-types/blk/description.tex
@@ -3,7 +3,10 @@ \section{Block Device}\label{sec:Device Types / Block Device}
The virtio block device is a simple virtual block device (ie.
disk). Read and write requests (and other exotic requests) are
placed in one of its queues, and serviced (probably out of order) by the
-device except where noted.
+device except where noted. Each block device consists of a sequence of logical
+blocks. A logical block represents the smallest addressable unit of data that
+can be read from or written to the device. The logical block size is always a
+power of two. Logical block sizes may be 512 bytes, 1KB, 2KB, 4KB, 8KB, etc.
Does this imply that a device may return an error for a request where
the sector or num_sectors fields are not aligned to the blk_size? If
so, it should be called out as a more explicit normative requirement,
but that would definitely be an incompatible change.
As mentioned, the logical block size, such as 512 bytes, defines the smallest
addressable unit for I/O operations on the storage device.
This has important implications for I/O requests:
1. Minimum I/O size: You cannot send I/O requests smaller than the logical
block size. For example, with a 512-byte logical block size, requests of 128
bytes are not valid.
128-byte requests are not valid, certainly, but that is already
covered by the wording in the spec ("in multiples of 512 bytes" in the
VIRTIO_BLK_T_IN/VIRTIO_BLK_T_OUT description, and "The length of data
MUST be a multiple of 512 bytes for VIRTIO_BLK_T_IN and
VIRTIO_BLK_T_OUT requests" in the driver requirements).
I guess I should have given 4KB logical block size as a better example.
In case logical block size is 4KB, the minimum IO size is 4KB and not 512.
Agreed ?
2. I/O alignment: All I/O requests must be aligned to the logical block size.
This means the target start address (on the media device) of the I/O must be a
multiple of the logical block size (the expressed units in the spec are not
relevant).
Unless there is a change to the spec (like this proposed patch), I
don't think this is required. I suppose a device that can't manage to
perform a misaligned I/O could choose to deny such requests and return
VIRTIO_BLK_S_IOERR, but this is not explicitly allowed or denied by
the current spec.
This is a clarification of the existing specification, not a change.
I believe I've identified the main point of disagreement regarding this
patch:
I believe that I/O operations to storage block devices MUST be aligned
to the logical block size. In the other hand, you believe it is just a
recommendation for virtio block device.
Maybe let's see what others think as well ?
Stefan ?
3. I/O size granularity: The size of I/O requests must be a multiple of the
logical block size. For a 512-byte logical block size, valid I/O sizes would be
512 bytes, 1024 bytes, 1536 bytes, and so on.
Same as point 1 above, the only requirement (that I can see, anyway)
is that the I/O request has to be a multiple of 512 bytes.
Do you think that for logical block size of 4k one can send IO of 1K to
such device and succeed ?
The "smallest addressable unit" wording is also more confusing than
clarifying here, since in virtio-blk, the "smallest addressable unit"
is always 512 bytes, regardless of blk_size (even if a change like
this would enforce limits on the sector values that would actually be
allowed, it would still be possible to create a request that refers to
a smaller unit than blk_size, unlike other storage protocols where
logical block size changes the multiplier of the address, making it
impossible to address smaller units).
Can you please point me to the place in spec that explicitly say that "in
virtio-blk, the smallest addressable unit is always 512 bytes, regardless of
blk_size" ?
The description of struct virtio_blk_req's sector field:
"The sector number indicates the offset (multiplied by 512) where the
read or write is to occur."
Unless we have different definitions of "smallest addressable unit",
that seems pretty clear. It is always possible to address a part of
the disk in 512-byte multiples (e.g. sectors 1 through 3) even when
the blk_size is larger than 512. Maybe those requests won't succeed,
but it is always possible to describe a region of the block device
with an address that is not a multiple of blk_size.
It doesn't say anything explicit about IO alignment and granularity.
I didn't find it in the spec and I don't agree with this call if it is implicit.
I agree it should be clarified. The current wording requires too much
reading between the lines.
The implementation of the Linux kernel driver isn't implemented in this manner
as well.
When filling the sector field, the Linux virtio-blk driver uses the
value of blk_rq_pos(), which is also always in units of 512 bytes, as
far as I can tell (it returns sector_t, which is in units of 512 bytes
per the definition in include/linux/blk_types.h).
Ignoring the units for the discussion, can it ever be unaligned to the
logical block size?
[...]
@@ -228,7 +234,7 @@ \subsection{Device Initialization}\label{sec:Device Types /
Block Device / Devic
\item The device size can be read from \field{capacity}.
\item If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated,
- \field{blk_size} can be read to determine the optimal sector size
+ \field{blk_size} can be read to determine the logical block size
for the driver to use. This does not affect the units used in
the protocol (always 512 bytes), but awareness of the correct
value can affect performance.
@@ -282,6 +288,12 @@ \subsection{Device Initialization}\label{sec:Device Types
/ Block Device / Devic
\drivernormative{\subsubsection}{Device Initialization}{Device Types / Block
Device / Device Initialization}
+Drivers MUST negotiate VIRTIO_BLK_F_BLK_SIZE if the feature is offered by the
+device.
Why is this a MUST? This is also an incompatible change since there
was no such requirement before, and drivers are free to ignore
features they don't care about.
I can say SHOULD, but then the driver that is ignoring this feature may
guess/assume the logical block size and it can be wrong.
This may lead to an undefined behavior.
Unless the spec is changed, if a device can't correctly handle
non-blk_size-aligned requests, this seems like it would be a device
bug ("undefined behavior") or at least non-spec-derived limitation
(returning an error would not be undefined, but also probably wouldn't
result in a working system).
Again, this is a clarification of the existing specification, not a change.
IMO, storage block device implementations should be able to assume that
I/O requests from drivers are aligned to the logical block size. If a
device receives misaligned I/O requests, it might be appropriate for the
device to fail these requests. It is not a device "bug" IMO.
I'm curious if you've encountered any actual implementations where
drivers send non-block-size-aligned requests to block devices. I never
saw such implementation.
Sounds like a theoretical objection.
IMO, the specification could benefit from explicitly stating this
alignment principal, which would guide both device and driver
implementations.
I believe that this patch is clarifying important points and would lead
to more robust definition.
Certainly it would be preferable for the driver to send aligned
requests, but the existing wording seems to cover that pretty well
already.
Thanks,
-- Daniel