On Fri, 19 Jan 2018 17:37:58 +0100 Veaceslav Falico <veaceslav.fal...@huawei.com> wrote:
> On 1/19/2018 4:52 PM, Eduard Shishkin wrote: > > > > > > On 1/19/2018 11:27 AM, Greg Kurz wrote: > >> On Mon, 15 Jan 2018 11:49:31 +0800 > >> Antonios Motakis <antonios.mota...@huawei.com> wrote: > >> > >>> On 13-Jan-18 00:14, Greg Kurz wrote: > >>>> On Fri, 12 Jan 2018 19:32:10 +0800 > >>>> Antonios Motakis <antonios.mota...@huawei.com> wrote: > >>>> > >>>>> Hello all, > >>>>> > >>>> > >>>> Hi Antonios, > >>>> > >>>> I see you have attached a patch to this email... this really isn't the > >>>> preferred > >>>> way to do things since it prevents to comment the patch (at least with > >>>> my mail > >>>> client). The appropriate way would have been to send the patch with a > >>>> cover > >>>> letter, using git-send-email for example. > >>> > >>> I apologize for attaching the patch, I should have known better! > >>> > >> > >> np :) > >> > >>>> > >>>>> We have found an issue in the 9p implementation of QEMU, with how qid > >>>>> paths are generated, which can cause qid path collisions and several > >>>>> issues caused by them. In our use case (running containers under VMs) > >>>>> these have proven to be critical. > >>>>> > >>>> > >>>> Ouch... > >>>> > >>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using > >>>>> the inode number of the file as input. According to the 9p spec the > >>>>> path should be able to uniquely identify a file, distinct files should > >>>>> not share a path value. > >>>>> > >>>>> The current implementation that defines qid.path = inode nr works fine > >>>>> as long as there are not files from multiple partitions visible under > >>>>> the 9p share. In that case, distinct files from different devices are > >>>>> allowed to have the same inode number. So with multiple partitions, we > >>>>> have a very high probability of qid path collisions. > >>>>> > >>>>> How to demonstrate the issue: > >>>>> 1) Prepare a problematic share: > >>>>> - mount one partition under share/p1/ with some files inside > >>>>> - mount another one *with identical contents* under share/p2/ > >>>>> - confirm that both partitions have files with same inode nr, size, etc > >>>>> 2) Demonstrate breakage: > >>>>> - start a VM with a virtio-9p pointing to the share > >>>>> - mount 9p share with FSCACHE on > >>>>> - keep open share/p1/file > >>>>> - open and write to share/p2/file > >>>>> > >>>>> What should happen is, the guest will consider share/p1/file and > >>>>> share/p2/file to be the same file, and since we are using the cache it > >>>>> will not reopen it. We intended to write to partition 2, but we just > >>>>> wrote to partition 1. This is just one example on how the guest may > >>>>> rely on qid paths being unique. > >>>>> > >>>>> In the use case of containers where we commonly have a few containers > >>>>> per VM, all based on similar images, these kind of qid path collisions > >>>>> are very common and they seem to cause all kinds of funny behavior > >>>>> (sometimes very subtle). > >>>>> > >>>>> To avoid this situation, the device id of a file needs to be also taken > >>>>> as input for generating a qid path. Unfortunately, the size of both > >>>>> inode nr + device id together would be 96 bits, while we have only 64 > >>>>> bits for the qid path, so we can't just append them and call it a day :( > >>>>> > >>>>> We have thought of a few approaches, but we would definitely like to > >>>>> hear what the upstream maintainers and community think: > >>>>> > >>>>> * Full fix: Change the 9p protocol > >>>>> > >>>>> We would need to support a longer qid path, based on a virtio feature > >>>>> flag. This would take reworking of host and guest parts of virtio-9p, > >>>>> so both QEMU and Linux for most users. > >>>>> > >>>> > >>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag > >>>> since > >>>> 9p is transport agnostic. And it happens to be used with a variety of > >>>> transports. > >>>> QEMU has both virtio-9p and a Xen backend for example. > >>>> > >>>>> * Fallback and/or interim solutions > >>>>> > >>>>> A virtio feature flag may be refused by the guest, so we think we still > >>>>> need to make collisions less likely even with 64 bit paths. E.g. > >>>> > >>>> In all cases, we would need a fallback solution to support current > >>>> guest setups. Also there are several 9p server implementations out > >>>> there (ganesha, diod, kvmtool) that are currently used with linux > >>>> clients... it will take some time to get everyone in sync :-\ > >>>> > >>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a > >>>>> proof of concept patch) > >>>> > >>>> Hmm... this would still allow collisions. Not good for fallback. > >>>> > >>>>> 2. 64 bit hash of device id and inode nr > >>>> > >>>> Same here. > >>>> > >>>>> 3. other ideas, such as allocating new qid paths and keep a look up > >>>>> table of some sorts (possibly too expensive) > >>>>> > >>>> > >>>> This would be acceptable for fallback. > >>> > >>> Maybe we can use the QEMU hash table > >>> (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it > >>> scales to millions of entries. Do you think it is appropriate in this > >>> case? > >>> > >> > >> I don't know QHT, hence Cc'ing Emilio for insights. > >> > >>> I was thinking on how to implement something like this, without having to > >>> maintain millions of entries... One option we could consider is to split > >>> the bits into a directly-mapped part, and a lookup part. For example: > >>> > >>> Inode = > >>> [ 10 first bits ] + [ 54 lowest bits ] > >>> > >>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit > >>> prefix ] > >>> The prefix is uniquely allocated for each input. > >>> > >>> Qid path = > >>> [ 10 bit prefix ] + [ inode 54 lowest bits ] > >>> > >>> Since inodes are not completely random, and we usually have a handful of > >>> device IDs, we get a much smaller number of entries to track in the hash > >>> table. > >>> > >>> So what this would give: > >>> (1) Would be faster and take less memory than mapping the full > >>> inode_nr,devi_id tuple to unique QID paths > >>> (2) Guaranteed not to run out of bits when inode numbers stay below > >>> the lowest 54 bits and we have less than 1024 devices. > >>> (3) When we get beyond this this limit, there is a chance we run out > >>> of bits to allocate new QID paths, but we can detect this and refuse to > >>> serve the offending files instead of allowing a collision. > >>> > >>> We could tweak the prefix size to match the scenarios that we consider > >>> more likely, but I think close to 10-16 bits sounds reasonable enough. > >>> What do you think? > >>> > >> > >> I think that should work. Please send a patchset :) > > > > Hi Greg, > > > > This is based on the assumption that high bits of inode numbers are > > always zero, which is unacceptable from my standpoint. Inode numbers > > allocator is a per-volume file system feature, so there may appear > > allocators, which don't use simple increment and assign inode number > > with non-zero high bits even for the first file of a volume. > > > > As I understand, the problem is that the guest doesn't have enough > > information for proper files identification (in our case share/p1/file > > and share/p2/file are wrongly identified as the same file). It seems > > that we need to transmit device ID to the guest, and not in the high bits > > of the inode number. > > > > AFAIK Slava has patches for such transmission, I hope that he will send > > it eventually. > > They're pretty trivial, however it'll require great effort and coordination > for all parties involved, as they break backwards compatibility (QID becomes > bigger and, thus, breaks "Q" transmission). > > I'd like to raise another issue - it seems that st_dev and i_no aren't > enough: > > mkdir -p /tmp/mounted > touch /tmp/mounted/file > mount -o bind /tmp/mounted t1 > mount -o bind /tmp/mounted t2 > stat t{1,2}/file | grep Inode > Device: 805h/2053d Inode: 42729487 Links: 3 > Device: 805h/2053d Inode: 42729487 Links: 3 > > In that case, even with the patch applied, we'll still have the issue of > colliding QIDs guest-side - so, for example, after umount t1, t2/file > becomes unaccessible, as the inode points to t1... > t1/file and t2/file really point to the same file on the server, so it is expected they have the same QIDs. > I'm not sure how to fix this one. Ideas? fscache bug ? > > > > > Thanks, > > Eduard. > > > >> > >>>> > >>>>> With our proof of concept patch, the issues caused by qid path > >>>>> collisions go away, so it can be seen as an interim solution of sorts. > >>>>> However, the chance of collisions is not eliminated, we are just > >>>>> replacing the current strategy, which is almost guaranteed to cause > >>>>> collisions in certain use cases, with one that makes them more rare. We > >>>>> think that a virtio feature flag for longer qid paths is the only way > >>>>> to eliminate these issues completely. > >>>>> > >>>>> This is the extent that we were able to analyze the issue from our > >>>>> side, we would like to hear if there are some better ideas, or which > >>>>> approach would be more appropriate for upstream. > >>>>> > >>>>> Best regards, > >>>>> > >>>> > >>>> Cheers, > >>>> > >>>> -- > >>>> Greg > >>>> > >>> > >> > > > > . > > > >