Peter Xu <pet...@redhat.com> writes: > On Wed, Nov 01, 2023 at 02:28:24PM +0000, Daniel P. Berrangé wrote: >> On Wed, Nov 01, 2023 at 10:21:07AM -0400, Peter Xu wrote: >> > On Wed, Nov 01, 2023 at 09:26:46AM +0000, Daniel P. Berrangé wrote: >> > > On Tue, Oct 31, 2023 at 03:03:50PM -0400, Peter Xu wrote: >> > > > On Wed, Oct 25, 2023 at 11:07:33AM -0300, Fabiano Rosas wrote: >> > > > > >> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock >> > > > > >> *block, ram_addr_t length) >> > > > > >> +{ >> > > > > >> + g_autofree unsigned long *bitmap = NULL; >> > > > > >> + struct FixedRamHeader header; >> > > > > >> + size_t bitmap_size; >> > > > > >> + long num_pages; >> > > > > >> + int ret = 0; >> > > > > >> + >> > > > > >> + ret = fixed_ram_read_header(f, &header); >> > > > > >> + if (ret < 0) { >> > > > > >> + error_report("Error reading fixed-ram header"); >> > > > > >> + return -EINVAL; >> > > > > >> + } >> > > > > >> + >> > > > > >> + block->pages_offset = header.pages_offset; >> > > > > > >> > > > > > Do you think it is worth sanity checking that 'pages_offset' is >> > > > > > aligned >> > > > > > in some way. >> > > > > > >> > > > > > It is nice that we have flexibility to change the alignment in >> > > > > > future >> > > > > > if we find 1 MB is not optimal, so I wouldn't want to force 1MB >> > > > > > align >> > > > > > check htere. Perhaps we could at least sanity check for alignment >> > > > > > at >> > > > > > TARGET_PAGE_SIZE, to detect a gross data corruption problem ? >> > > > > > >> > > > > >> > > > > I don't see why not. I'll add it. >> > > > >> > > > Is there any explanation on why that 1MB offset, and how the number is >> > > > chosen? Thanks, >> > > >> > > The fixed-ram format is anticipating the use of O_DIRECT. >> > > >> > > With O_DIRECT both the buffers in memory, and the file handle offset >> > > have alignment requirements. The buffer alignments are usually page >> > > sized, and QEMU RAM blocks will trivially satisfy those. >> > > >> > > The file handle offset alignment varies per filesystem. While you can >> > > query the alignment for the FS holding the file with statx(), that is >> > > not appropriate todo. If a user saves/restores QEMU state to file, we >> > > must assume there is a chance the user will copy the saved state to a >> > > different filesystem. >> > > >> > > IOW, we want alignment to satisfy the likely worst case. >> > > >> > > Picking 1 MB is a nice round number that is large enough that it is >> > > almost certainly going to satisfy any filesystem alignment. In fact >> > > it is likely massive overkill. None the less 1 MB is also still tiny >> > >> > Is that calculated by something like max of possible host (small) page >> > sizes? I've no idea what's it for all archs, the max small page size I'm >> > aware of is 64K, but I don't know a lot archs. >> >> It wasn't anything as precise as that. It is literally just "1MB" looks >> large enough that we don't need to spend time to investigate per arch >> page sizes. > > IMHO we need that precision on reasoning and document it, even if not on > the exact number we prefer, which can be prone to change later. Otherwise > that value will be a pure magic soon after a few years or even less, it'll > be more of a challenge later to figure things out. > >> >> Having said that I'm now having slight self-doubt wrt huge pages, though >> I swear we investigated it last year when first discussing this feature. >> The guest memory will of course already be suitably aligned, but I'm >> wondering if the filesystem I/O places any offset alignment constraints >> related to non-default page size. > > AFAIU direct IO is about pinning the IO buffers, playing the role of fs > cache instead. If my understanding is correct, huge pages shouldn't be a > problem for such pinning, because it's legal to pin partial of a huge page. > > After the partial huge pages pinned, they should be treated as normal fs > buffers when doing block IO. And then the offset of file should, per my > understanding, not relevant to what is the type of backend of that user > buffer anymore that triggers read()/write(). > > But maybe I missed something, if so that will need to be part of > documentation of 1MB magic value, IMHO. We may want to double check with > that by doing fixed-ram migration on e.g. 1GB hugetlb memory-backend-file > with 1MB file offset per-ramblock.
Does anyone have any indication that we need to relate the aligment to the page size? All I find online points to device block size being the limiting factor for filesystems. There's also raw_probe_alignment() at file-posix.c which seems to check up to 4k and recommend to disable O_DIRECT if an alignment is not found. Note that we shouldn't have any problems changing the alignment we choose since we have a pointer to the start of the aligned region which goes along with the fixed-ram header. We could even do some probing like the block layer does if we wanted.