On Sat, Aug 17, 2019 at 12:57 AM John Snow <js...@redhat.com> wrote: > On 8/16/19 5:21 PM, Nir Soffer wrote: > > When creating an image with preallocation "off" or "falloc", the first > > block of the image is typically not allocated. When using Gluster > > storage backed by XFS filesystem, reading this block using direct I/O > > succeeds regardless of request length, fooling alignment detection. > > > > In this case we fallback to a safe value (4096) instead of the optimal > > value (512), which may lead to unneeded data copying when aligning > > requests. Allocating the first block avoids the fallback. > > > > Where does this detection/fallback happen? (Can it be improved?) >
In raw_probe_alignment(). This patch explain the issues: https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00568.html Here Kevin and me discussed ways to improve it: https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00426.html > When using preallocation=off, we always allocate at least one filesystem > > block: > > > > $ ./qemu-img create -f raw test.raw 1g > > Formatting 'test.raw', fmt=raw size=1073741824 > > > > $ ls -lhs test.raw > > 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw > > > > I did quick performance tests for these flows: > > - Provisioning a VM with a new raw image. > > - Copying disks with qemu-img convert to new raw target image > > > > I installed Fedora 29 server on raw sparse image, measuring the time > > from clicking "Begin installation" until the "Reboot" button appears: > > > > Before(s) After(s) Diff(%) > > ------------------------------- > > 356 389 +8.4 > > > > I ran this only once, so we cannot tell much from these results. > > > > That seems like a pretty big difference for just having pre-allocated a > single block. What was the actual command line / block graph for that test? > Having the first block allocated changes the alignment. Before this patch, we detect request_alignment=1, so we fallback to 4096. Then we detect buf_align=1, so we fallback to value of request alignment. The guest see a disk with: logical_block_size = 512 physical_block_size = 512 But qemu uses: request_alignment = 4096 buf_align = 4096 storage uses: logical_block_size = 512 physical_block_size = 512 If the guest does direct I/O using 512 bytes aligment, qemu has to copy the buffer to align them to 4096 bytes. After this patch, qemu detects the alignment correctly, so we have: guest logical_block_size = 512 physical_block_size = 512 qemu request_alignment = 512 buf_align = 512 storage: logical_block_size = 512 physical_block_size = 512 We expect this to be more efficient because qemu does not have to emulate anything. Was this over a network that could explain the variance? > Maybe, this is complete install of Fedora 29 server, I'm not sure if the installation access the network. > The second test was cloning the installation image with qemu-img > > convert, doing 10 runs: > > > > for i in $(seq 10); do > > rm -f dst.raw > > sleep 10 > > time ./qemu-img convert -f raw -O raw -t none -T none src.raw > dst.raw > > done > > > > Here is a table comparing the total time spent: > > > > Type Before(s) After(s) Diff(%) > > --------------------------------------- > > real 530.028 469.123 -11.4 > > user 17.204 10.768 -37.4 > > sys 17.881 7.011 -60.7 > > > > Here we see very clear improvement in CPU usage. > > > > Hard to argue much with that. I feel a little strange trying to force > the allocation of the first block, but I suppose in practice "almost no > preallocation" is indistinguishable from "exactly no preallocation" if > you squint. > Right. The real issue is that filesystems and block devices do not expose the alignment requirement for direct I/O, so we need to use these hacks and assumptions. With local XFS we use xfsctl(XFS_IOC_DIOINFO) to get request_alignment, but this does not help for XFS filesystem used by Gluster on the server side. I hope that Niels is working on adding similar ioctl for Glsuter, os it can expose the properties of the remote filesystem. Nir