Interesting. I didn't know about the alternate meaning of stripesize. I agree then that there's currently no way to tune ZFS to respect NVME's 128KB boundaries. One could set zfs.vfs.vdev.aggregation_limit to 128KB, but that would only halfway solve the problem, because allocations could be unaligned. Frankly, I'm surprised that NVME drives should have such a small limit when SATA and SAS devices commonly handle single commands that span multiple MB. I don't think there's any way to adapt ZFS to this limit without hurting it in other ways; for example by restricting its ability to use large _or_ small record sizes.
Hopefully the NVME slow path isn't _too_ slow. On Fri, Mar 11, 2016 at 2:07 AM, Alexander Motin <m...@freebsd.org> wrote: > On 11.03.16 06:58, Alan Somers wrote: > > Do they behave badly for writes that cross a 128KB boundary, but are > > nonetheless aligned to 128KB boundaries? Then I don't understand how > > this change (or mav's replacement) is supposed to help. The stripesize > > is supposed to be the minimum write that the device can accept without > > requiring a read-modify-write. ZFS guarantees that it will never issue > > a write smaller than the stripesize, nor will it ever issue a write that > > is not aligned to a stripesize-boundary. But even if ZFS worked with > > 128KB stripesizes, it would still happily issue writes a multiple of > > 128KB in size, and these would cross those boundaries. Am I not > > understanding something here? > > stripesize is not necessary related to read-modify-write. It reports > "some" native boundaries of the device. For example, RAID0 array has > stripes, crossing which does not cause read-modify-write cycles, but > causes I/O split and head seeks for extra disks. This, as I understand, > is the case for some Intel's NVMe device models here, and is the reason > why 128KB stripesize was originally reported. > > We can not demand all file systems to never issue I/Os of less then > stripesize, since it can be 128KB, 1MB or even more (and since then it > would be called sectorsize). If ZFS (in this case) doesn't support > allocation block sizes above 8K (and even that is very > space-inefficient), and it has no other mechanisms to optimize I/O > alignment, then it is not a problem of the NVMe device or driver, but > only of ZFS itself. So what I have done here is moved workaround from > improper place (NVMe) to proper one (ZFS): NVMe now correctly reports > its native 128K bondaries, that will be respected, for example, by > gpart, that help, for example UFS align its 32K blocks, while ZFS will > correctly ignore values for which it can't optimize, falling back to > efficient 512 bytes allocations. > > PS about the meaning of stripesize not limited to read-modify-write: For > example, RAID5 of 5 512e disks actually has three stripe sizes: 4K, 64K > and 256K: aligned writes of 4K allow to avoid read-modify-write inside > the drive, I/Os not crossing 64K boundaries without reason improve > parallel performance, aligned writes of 256K allow to avoid > read-modify-write on the RAID5 level. Obviously not all of those > optimizations achievable in all environments, and the bigger the stripe > size the harder optimize for it, but it does not mean that such > optimization is impossible. It would be good to be able to report all > of them, allowing each consumer to use as many of them as it can. > > > On Thu, Mar 10, 2016 at 9:34 PM, Warner Losh <i...@bsdimp.com > > <mailto:i...@bsdimp.com>> wrote: > > > > Some Intel NVMe drives behave badly when the LBA range crosses a > > 128k boundary. Their > > performance is worse for those transactions than for ones that don't > > cross the 128k boundary. > > > > Warner > > > > On Thu, Mar 10, 2016 at 11:01 AM, Alan Somers <asom...@freebsd.org > > <mailto:asom...@freebsd.org>> wrote: > > > > Are you saying that Intel NVMe controllers perform poorly for > > all I/Os that are less than 128KB, or just for I/Os of any size > > that cross a 128KB boundary? > > > > On Thu, Dec 10, 2015 at 7:06 PM, Steven Hartland > > <s...@freebsd.org <mailto:s...@freebsd.org>> wrote: > > > > Author: smh > > Date: Fri Dec 11 02:06:03 2015 > > New Revision: 292074 > > URL: https://svnweb.freebsd.org/changeset/base/292074 > > > > Log: > > Limit stripesize reported from nvd(4) to 4K > > > > Intel NVMe controllers have a slow path for I/Os that span > > a 128KB stripe boundary but ZFS limits ashift, which is > > derived from d_stripesize, to 13 (8KB) so we limit the > > stripesize reported to geom(8) to 4KB. > > > > This may result in a small number of additional I/Os to > > require splitting in nvme(4), however the NVMe I/O path is > > very efficient so these additional I/Os will cause very > > minimal (if any) difference in performance or CPU > utilisation. > > > > This can be controller by the new sysctl > > kern.nvme.max_optimal_sectorsize. > > > > MFC after: 1 week > > Sponsored by: Multiplay > > Differential Revision: > > https://reviews.freebsd.org/D4446 > > > > Modified: > > head/sys/dev/nvd/nvd.c > > head/sys/dev/nvme/nvme.h > > head/sys/dev/nvme/nvme_ns.c > > head/sys/dev/nvme/nvme_sysctl.c > > > > > > > > > -- > Alexander Motin > _______________________________________________ svn-src-all@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-all To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"