Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Tue, Mar 29, 2022 at 12:59:52PM +0800, Jason Wang wrote: > vDPA has a backend feature negotiation, then actually, userspace can > tell vDPA to go with the new accounting approach. Not sure RDMA can do > the same. A security feature userspace can ask to turn off is not really a security feature. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Mon, Mar 28, 2022 at 8:23 PM Jason Gunthorpe wrote: > > On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote: > > To me, it looks more easier to not answer this question by letting > > userspace know about the change, > > That is not backwards compatbile, so I don't think it helps unless we > say if you open /dev/vfio/vfio you get old behavior and if you open > /dev/iommu you get new... Actually, this is one way to go. Trying to behave exactly like typ1 might be not easy. > > Nor does it answer if I can fix RDMA or not :\ > vDPA has a backend feature negotiation, then actually, userspace can tell vDPA to go with the new accounting approach. Not sure RDMA can do the same. Thanks > So we really do need to know what exactly is the situation here. > > Jason > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Mon, Mar 28, 2022 at 02:14:26PM +0100, Sean Mooney wrote: > On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote: > > On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe wrote: > > > > > > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote: > > > > > > > It's simply because we don't want to break existing userspace. [1] > > > > > > I'm still waiting to hear what exactly breaks in real systems. > > > > > > As I explained this is not a significant change, but it could break > > > something in a few special scenarios. > > > > > > Also the one place we do have ABI breaks is security, and ulimit is a > > > security mechanism that isn't working right. So we do clearly need to > > > understand *exactly* what real thing breaks - if anything. > > > > > > Jason > > > > > > > To tell the truth, I don't know. I remember that Openstack may do some > > accounting so adding Sean for more comments. But we really can't image > > openstack is the only userspace that may use this. > sorry there is a lot of context to this discussion i have tried to read back > the > thread but i may have missed part of it. Thanks Sean, this is quite interesting, though I'm not sure it entirely reached the question > tl;dr openstack does not currently track locked/pinned memory per > use or per vm because we have no idea when libvirt will request it > or how much is needed per device. when ulimits are configured today > for nova/openstack its done at teh qemu user level outside of > openstack in our installer tooling. e.g. in tripleo the ulimits > woudl be set on the nova_libvirt contaienr to constrain all vms > spawned not per vm/process. So, today, you expect the ulimit to be machine wide, like if your machine has 1TB of memory you'd set the ulimit at 0.9 TB and you'd like the stuff under to limit memory pinning to 0.9TB globally for all qemus? To be clear it doesn't work that way today at all, you might as well just not bother setting ulimit to anything less than unlimited at the openstack layer. >hard_limit > >The optional hard_limit element is the maximum memory the >guest can use. The units for this value are kibibytes >(i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly >advised not to set this limit as domain may get killed by the >kernel if the guess is too low, and determining the memory needed >for a process to run is an undecidable problem; that said, if you >already set locked in memory backing because your workload >demands it, you'll have to take into account the specifics of >your deployment and figure out a value for hard_limit that is >large enough to support the memory requirements of your guest, >but small enough to protect your host against a malicious guest >locking all memory. And hard_limit is the ulimit that Alex was talking about? So now we switched from talking about global per-user things to per-qemu-instance things? > we coudl not figure out how to automatically comptue a hard_limit in > nova that would work for everyone and we felt exposign this to our > users/operators was bit of a cop out when they likely cant caluate > that properly either. Not surprising.. > As a result we cant actully account for them > today when schduilign workloads to a host. Im not sure this woudl > chagne even if you exposed new user space apis unless we had a way > to inspect each VF to know how much locked memory that VF woudl need > to lock? We are not talking about a new uAPI we are talking about changing the meaning of the existing ulimit. You can see it in your message above, at the openstack level you were talking about global limits and then in the libvirt level you are talking about per-qemu limits. In the kernel both of these are being used by the same control and one of the users is wrong. The kernel consensus is that the ulimit is per-user and is used by all kernel entities consistently Currently vfio is different and uses it per-process and effectively has its own private bucket. When you talk about VDPA you start to see the problems here because VPDA use a different accounting from VFIO. If you run VFIO and VDPA together then you should need 2x the ulimit, but today you only need 1x because they don't share accounting buckets. This also means the ulimit doesn't actually work the way it is supposed to. The question is how to fix it, if we do fix it, how much cares that things work differently. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote: > On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe wrote: > > > > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote: > > > > > It's simply because we don't want to break existing userspace. [1] > > > > I'm still waiting to hear what exactly breaks in real systems. > > > > As I explained this is not a significant change, but it could break > > something in a few special scenarios. > > > > Also the one place we do have ABI breaks is security, and ulimit is a > > security mechanism that isn't working right. So we do clearly need to > > understand *exactly* what real thing breaks - if anything. > > > > Jason > > > > To tell the truth, I don't know. I remember that Openstack may do some > accounting so adding Sean for more comments. But we really can't image > openstack is the only userspace that may use this. sorry there is a lot of context to this discussion i have tried to read back the thread but i may have missed part of it. tl;dr openstack does not currently track locked/pinned memory per use or per vm because we have no idea when libvirt will request it or how much is needed per device. when ulimits are configured today for nova/openstack its done at teh qemu user level outside of openstack in our installer tooling. e.g. in tripleo the ulimits woudl be set on the nova_libvirt contaienr to constrain all vms spawned not per vm/process. full responce below --- openstacks history with locked/pinned/unswapable memory is a bit complicated. we currently only request locked memory explictly in 2 cases directly https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/driver.py#L5769-L5784= when the adminstartor configure the vm flaovr to requst amd's SEV feature or configures the flavor for realtime scheduling pirorotiy. i say explictly as libvirt invented a request for locked/pinned pages implictly for sriov VFs and a number of other cases which we were not aware of implictly. this only became apprent when we went to add vdpa supprot to openstack and libvirt did not make that implict request and we had to fall back to requesting realtime instances as a workaround. nova/openstack does have the ablity to generate the libvirt xml element that configure hard and soft limtis https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/config.py#L2559-L2590 however it is only ever used in our test code https://github.com/openstack/nova/search?q=LibvirtConfigGuestMemoryTune the descirption of hard limit in the libvirt docs stongly dicurages its use with a small caveat for locked memory https://libvirt.org/formatdomain.html#memory-tuning hard_limit The optional hard_limit element is the maximum memory the guest can use. The units for this value are kibibytes (i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly advised not to set this limit as domain may get killed by the kernel if the guess is too low, and determining the memory needed for a process to run is an undecidable problem; that said, if you already set locked in memory backing because your workload demands it, you'll have to take into account the specifics of your deployment and figure out a value for hard_limit that is large enough to support the memory requirements of your guest, but small enough to protect your host against a malicious guest locking all memory. we coudl not figure out how to automatically comptue a hard_limit in nova that would work for everyone and we felt exposign this to our users/operators was bit of a cop out when they likely cant caluate that properly either. As a result we cant actully account for them today when schduilign workloads to a host. Im not sure this woudl chagne even if you exposed new user space apis unless we had a way to inspect each VF to know how much locked memory that VF woudl need to lock? same for vdpa devices, mdevs ectra. cloud system dont normaly have quotas on "locked" memory used trasitivly via passthoguh devices so even if we had this info its not imeditly apperant how we woudl consume it without altering our existing quotas. Openstack is a self service cloud plathform where enduser can upload there own worload iamge so its basicaly impossibel for the oeprator of the cloud to know how much memroy to set teh ard limit too without setting it overly large in most cases. from a management applciaton point of view we currently have no insigth into how memory will be pinned in the kernel or when libvirt will invent addtional request for pinned/locked memeory or how large they are. instead of going down that route operators are encuraged to use ulimit to set a global limit on the amount of memory the nova/qemu user can use. while nova/openstack support multi tenancy we do not expose that multi tenancy to hte underlying hypervior hosts. the agents are typicly deploy as the nova
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote: > To me, it looks more easier to not answer this question by letting > userspace know about the change, That is not backwards compatbile, so I don't think it helps unless we say if you open /dev/vfio/vfio you get old behavior and if you open /dev/iommu you get new... Nor does it answer if I can fix RDMA or not :\ So we really do need to know what exactly is the situation here. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe wrote: > > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote: > > > It's simply because we don't want to break existing userspace. [1] > > I'm still waiting to hear what exactly breaks in real systems. > > As I explained this is not a significant change, but it could break > something in a few special scenarios. > > Also the one place we do have ABI breaks is security, and ulimit is a > security mechanism that isn't working right. So we do clearly need to > understand *exactly* what real thing breaks - if anything. > > Jason > To tell the truth, I don't know. I remember that Openstack may do some accounting so adding Sean for more comments. But we really can't image openstack is the only userspace that may use this. To me, it looks more easier to not answer this question by letting userspace know about the change, Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, 24 Mar 2022 19:27:39 -0300 Jason Gunthorpe wrote: > On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote: > > On Tue, 22 Mar 2022 13:15:21 -0300 > > Jason Gunthorpe via iommu wrote: > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > I'm still picking my way through the series, but the later compat > > > > interface doesn't mention this difference as an outstanding issue. > > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > > resource limits? > > > > > > AFACIT, no, but it should be checked. > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > memory limits. > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > { > > > rlim = tsk->signal->rlim + resource; > > > [..] > > > if (new_rlim) > > > *rlim = *new_rlim; > > > > > > Which vfio reads back here: > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > And iommufd does the same read back: > > > > > > lock_limit = > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > npages = pages->npinned - pages->last_npinned; > > > do { > > > cur_pages = atomic_long_read(&pages->source_user->locked_vm); > > > new_pages = cur_pages + npages; > > > if (new_pages > lock_limit) > > > return -ENOMEM; > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages, > > >new_pages) != cur_pages); > > > > > > So it does work essentially the same. > > > > Well, except for the part about vfio updating mm->locked_vm and iommufd > > updating user->locked_vm, a per-process counter versus a per-user > > counter. prlimit specifically sets process resource limits, which get > > reflected in task_rlimit. > > Indeed, but that is not how the majority of other things seem to > operate it. > > > For example, let's say a user has two 4GB VMs and they're hot-adding > > vfio devices to each of them, so libvirt needs to dynamically modify > > the locked memory limit for each VM. AIUI, libvirt would look at the > > VM size and call prlimit to set that value. If libvirt does this to > > both VMs, then each has a task_rlimit of 4GB. In vfio we add pinned > > pages to mm->locked_vm, so this works well. In the iommufd loop above, > > we're comparing a per-task/process limit to a per-user counter. So I'm > > a bit lost how both VMs can pin their pages here. > > I don't know anything about libvirt - it seems strange to use a > securityish feature like ulimit but not security isolate processes > with real users. > > But if it really does this then it really does this. > > So at the very least VFIO container has to keep working this way. > > The next question is if we want iommufd's own device node to work this > way and try to change libvirt somehow. It seems libvirt will have to > deal with this at some point as iouring will trigger the same problem. > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > IMHO, but with most of the places doing pins voting to use > > > user->locked_vm as the charge it seems the right path in today's > > > kernel. > > > > The philosophy of whether it's ultimately a better choice for the > > kernel aside, if userspace breaks because we're accounting in a > > per-user pool rather than a per-process pool, then our compatibility > > layer ain't so transparent. > > Sure, if it doesn't work it doesn't work. Lets be sure and clearly > document what the compatability issue is and then we have to keep it > per-process. > > And the same reasoning likely means I can't change RDMA either as qemu > will break just as well when qemu uses rdma mode. > > Which is pretty sucky, but it is what it is.. I added Daniel Berrangé to the cc list for my previous reply, hopefully he can comment whether libvirt has the sort of user security model you allude to above that maybe makes this a non-issue for this use case. Unfortunately it's extremely difficult to prove that there are no such use cases out there even if libvirt is ok. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote: > On Tue, 22 Mar 2022 13:15:21 -0300 > Jason Gunthorpe via iommu wrote: > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > I'm still picking my way through the series, but the later compat > > > interface doesn't mention this difference as an outstanding issue. > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > resource limits? > > > > AFACIT, no, but it should be checked. > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > memory limits. > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > { > > rlim = tsk->signal->rlim + resource; > > [..] > > if (new_rlim) > > *rlim = *new_rlim; > > > > Which vfio reads back here: > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > And iommufd does the same read back: > > > > lock_limit = > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > npages = pages->npinned - pages->last_npinned; > > do { > > cur_pages = atomic_long_read(&pages->source_user->locked_vm); > > new_pages = cur_pages + npages; > > if (new_pages > lock_limit) > > return -ENOMEM; > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages, > > new_pages) != cur_pages); > > > > So it does work essentially the same. > > Well, except for the part about vfio updating mm->locked_vm and iommufd > updating user->locked_vm, a per-process counter versus a per-user > counter. prlimit specifically sets process resource limits, which get > reflected in task_rlimit. Indeed, but that is not how the majority of other things seem to operate it. > For example, let's say a user has two 4GB VMs and they're hot-adding > vfio devices to each of them, so libvirt needs to dynamically modify > the locked memory limit for each VM. AIUI, libvirt would look at the > VM size and call prlimit to set that value. If libvirt does this to > both VMs, then each has a task_rlimit of 4GB. In vfio we add pinned > pages to mm->locked_vm, so this works well. In the iommufd loop above, > we're comparing a per-task/process limit to a per-user counter. So I'm > a bit lost how both VMs can pin their pages here. I don't know anything about libvirt - it seems strange to use a securityish feature like ulimit but not security isolate processes with real users. But if it really does this then it really does this. So at the very least VFIO container has to keep working this way. The next question is if we want iommufd's own device node to work this way and try to change libvirt somehow. It seems libvirt will have to deal with this at some point as iouring will trigger the same problem. > > This whole area is a bit peculiar (eg mlock itself works differently), > > IMHO, but with most of the places doing pins voting to use > > user->locked_vm as the charge it seems the right path in today's > > kernel. > > The philosophy of whether it's ultimately a better choice for the > kernel aside, if userspace breaks because we're accounting in a > per-user pool rather than a per-process pool, then our compatibility > layer ain't so transparent. Sure, if it doesn't work it doesn't work. Lets be sure and clearly document what the compatability issue is and then we have to keep it per-process. And the same reasoning likely means I can't change RDMA either as qemu will break just as well when qemu uses rdma mode. Which is pretty sucky, but it is what it is.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Tue, 22 Mar 2022 13:15:21 -0300 Jason Gunthorpe via iommu wrote: > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > I'm still picking my way through the series, but the later compat > > interface doesn't mention this difference as an outstanding issue. > > Doesn't this difference need to be accounted in how libvirt manages VM > > resource limits? > > AFACIT, no, but it should be checked. > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > memory limits. > > Yes, and ulimit does work fully. prlimit adjusts the value: > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > struct rlimit *new_rlim, struct rlimit *old_rlim) > { > rlim = tsk->signal->rlim + resource; > [..] > if (new_rlim) > *rlim = *new_rlim; > > Which vfio reads back here: > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > And iommufd does the same read back: > > lock_limit = > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; > npages = pages->npinned - pages->last_npinned; > do { > cur_pages = atomic_long_read(&pages->source_user->locked_vm); > new_pages = cur_pages + npages; > if (new_pages > lock_limit) > return -ENOMEM; > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages, >new_pages) != cur_pages); > > So it does work essentially the same. Well, except for the part about vfio updating mm->locked_vm and iommufd updating user->locked_vm, a per-process counter versus a per-user counter. prlimit specifically sets process resource limits, which get reflected in task_rlimit. For example, let's say a user has two 4GB VMs and they're hot-adding vfio devices to each of them, so libvirt needs to dynamically modify the locked memory limit for each VM. AIUI, libvirt would look at the VM size and call prlimit to set that value. If libvirt does this to both VMs, then each has a task_rlimit of 4GB. In vfio we add pinned pages to mm->locked_vm, so this works well. In the iommufd loop above, we're comparing a per-task/process limit to a per-user counter. So I'm a bit lost how both VMs can pin their pages here. Am I missing some assumption about how libvirt users prlimit or sandboxes users? > The difference is more subtle, iouring/etc puts the charge in the user > so it is additive with things like iouring and additively spans all > the users processes. > > However vfio is accounting only per-process and only for itself - no > other subsystem uses locked as the charge variable for DMA pins. > > The user visible difference will be that a limit X that worked with > VFIO may start to fail after a kernel upgrade as the charge accounting > is now cross user and additive with things like iommufd. And that's exactly the concern. > This whole area is a bit peculiar (eg mlock itself works differently), > IMHO, but with most of the places doing pins voting to use > user->locked_vm as the charge it seems the right path in today's > kernel. The philosophy of whether it's ultimately a better choice for the kernel aside, if userspace breaks because we're accounting in a per-user pool rather than a per-process pool, then our compatibility layer ain't so transparent. > Ceratinly having qemu concurrently using three different subsystems > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > RLIMIT_MEMLOCK differently cannot be sane or correct. I think everyone would agree with that, but it also seems there are real differences between task_rlimits and per-user vs per-process accounting buckets and I'm confused how that's not a blocker for trying to implement transparent compatibility. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote: > It's simply because we don't want to break existing userspace. [1] I'm still waiting to hear what exactly breaks in real systems. As I explained this is not a significant change, but it could break something in a few special scenarios. Also the one place we do have ABI breaks is security, and ulimit is a security mechanism that isn't working right. So we do clearly need to understand *exactly* what real thing breaks - if anything. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
> From: Jason Wang > Sent: Thursday, March 24, 2022 11:51 AM > > > > > > > > In the end vfio type1 will be replaced by iommufd compat layer. With > > that goal in mind iommufd has to inherit type1 behaviors. > > So the compatibility should be provided by the compat layer instead of > the core iommufd. > > And I wonder what happens if iommufd is used by other subsystems like > vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do > we need a per subsystem new uAPI to expose this capability? If yes, > why can't VFIO have such an API then we don't even need the compat > layer at all? > No, compat layer is just for vfio. other subsystems including vdpa is expected to use iommu uAPI directly, except having their own bind_iommufd and attach_ioas uAPIs to build the connection between device and iommufd/ioas. And having a compat layer for vfio is just for transition purpose. Yi has demonstrated how vfio can follow what other subsystems are expected to do here: https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6 (specifically commits related to "cover-letter: Adapting vfio-pci to iommufd") Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin wrote: > > > From: Jason Wang > > Sent: Thursday, March 24, 2022 10:57 AM > > > > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin wrote: > > > > > > > From: Jason Wang > > > > Sent: Thursday, March 24, 2022 10:28 AM > > > > > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin > > wrote: > > > > > > > > > > > From: Jason Gunthorpe > > > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > > > Doesn't this difference need to be accounted in how libvirt > > > > > > > manages > > VM > > > > > > > resource limits? > > > > > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > > > memory limits. > > > > > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > > > { > > > > > > rlim = tsk->signal->rlim + resource; > > > > > > [..] > > > > > > if (new_rlim) > > > > > > *rlim = *new_rlim; > > > > > > > > > > > > Which vfio reads back here: > > > > > > > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > > > > > And iommufd does the same read back: > > > > > > > > > > > > lock_limit = > > > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > > > PAGE_SHIFT; > > > > > > npages = pages->npinned - pages->last_npinned; > > > > > > do { > > > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > > > >locked_vm); > > > > > > new_pages = cur_pages + npages; > > > > > > if (new_pages > lock_limit) > > > > > > return -ENOMEM; > > > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > > > cur_pages, > > > > > >new_pages) != cur_pages); > > > > > > > > > > > > So it does work essentially the same. > > > > > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the > > > > > > user > > > > > > so it is additive with things like iouring and additively spans all > > > > > > the users processes. > > > > > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > > > > > The user visible difference will be that a limit X that worked with > > > > > > VFIO may start to fail after a kernel upgrade as the charge > > > > > > accounting > > > > > > is now cross user and additive with things like iommufd. > > > > > > > > > > > > This whole area is a bit peculiar (eg mlock itself works > > > > > > differently), > > > > > > IMHO, but with most of the places doing pins voting to use > > > > > > user->locked_vm as the charge it seems the right path in today's > > > > > > kernel. > > > > > > > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > > > > > > > I plan to fix RDMA like this as well so at least we can have > > > > > > consistency within qemu. > > > > > > > > > > > > > > > > I have an impression that iommufd and vfio type1 must use > > > > > the same accounting scheme given the management stack > > > > > has no insight into qemu on which one is actually used thus > > > > > cannot adapt to the subtle difference in between. in this > > > > > regard either we start fixing vfio type1 to use user->locked_vm > > > > > now or have iommufd follow vfio type1 for upward compatibility > > > > > and then change them together at a later point. > > > > > > > > > > I prefer to the former as IMHO I don't know when will be a later > > > > > point w/o certain kernel changes to actually break the userspace > > > > > policy built on a wrong accounting scheme... > > > > > > > > I wonder if the kernel is the right place to do this. We have new uAPI > > > > > > I didn't get this. This thread is about that VFIO uses a wrong accounting > > > scheme and then the discussion is about the impact of fixing it to the > > > userspace. > > > > It's probably too late to fix the kernel considering it may break the > > userspace. > > > > > I didn't see the question on the right place part. > > > > I meant it would be easier
RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
> From: Jason Wang > Sent: Thursday, March 24, 2022 10:57 AM > > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin wrote: > > > > > From: Jason Wang > > > Sent: Thursday, March 24, 2022 10:28 AM > > > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin > wrote: > > > > > > > > > From: Jason Gunthorpe > > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > > Doesn't this difference need to be accounted in how libvirt manages > VM > > > > > > resource limits? > > > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > > memory limits. > > > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > > { > > > > > rlim = tsk->signal->rlim + resource; > > > > > [..] > > > > > if (new_rlim) > > > > > *rlim = *new_rlim; > > > > > > > > > > Which vfio reads back here: > > > > > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > > > And iommufd does the same read back: > > > > > > > > > > lock_limit = > > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > > PAGE_SHIFT; > > > > > npages = pages->npinned - pages->last_npinned; > > > > > do { > > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > > >locked_vm); > > > > > new_pages = cur_pages + npages; > > > > > if (new_pages > lock_limit) > > > > > return -ENOMEM; > > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > > cur_pages, > > > > >new_pages) != cur_pages); > > > > > > > > > > So it does work essentially the same. > > > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the user > > > > > so it is additive with things like iouring and additively spans all > > > > > the users processes. > > > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > > > The user visible difference will be that a limit X that worked with > > > > > VFIO may start to fail after a kernel upgrade as the charge accounting > > > > > is now cross user and additive with things like iommufd. > > > > > > > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > > > IMHO, but with most of the places doing pins voting to use > > > > > user->locked_vm as the charge it seems the right path in today's > > > > > kernel. > > > > > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > > > > > I plan to fix RDMA like this as well so at least we can have > > > > > consistency within qemu. > > > > > > > > > > > > > I have an impression that iommufd and vfio type1 must use > > > > the same accounting scheme given the management stack > > > > has no insight into qemu on which one is actually used thus > > > > cannot adapt to the subtle difference in between. in this > > > > regard either we start fixing vfio type1 to use user->locked_vm > > > > now or have iommufd follow vfio type1 for upward compatibility > > > > and then change them together at a later point. > > > > > > > > I prefer to the former as IMHO I don't know when will be a later > > > > point w/o certain kernel changes to actually break the userspace > > > > policy built on a wrong accounting scheme... > > > > > > I wonder if the kernel is the right place to do this. We have new uAPI > > > > I didn't get this. This thread is about that VFIO uses a wrong accounting > > scheme and then the discussion is about the impact of fixing it to the > > userspace. > > It's probably too late to fix the kernel considering it may break the > userspace. > > > I didn't see the question on the right place part. > > I meant it would be easier to let userspace know the difference than > trying to fix or workaround in the kernel. Jason already plans to fix RDMA which will also leads to user-aware impact when Qemu uses both VFIO and RDMA. Why is VFIO so special and left behind to carry the legacy misdesign? > > > > > > so management layer can know the difference
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin wrote: > > > From: Jason Wang > > Sent: Thursday, March 24, 2022 10:28 AM > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin wrote: > > > > > > > From: Jason Gunthorpe > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > > > resource limits? > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > memory limits. > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > { > > > > rlim = tsk->signal->rlim + resource; > > > > [..] > > > > if (new_rlim) > > > > *rlim = *new_rlim; > > > > > > > > Which vfio reads back here: > > > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > And iommufd does the same read back: > > > > > > > > lock_limit = > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > PAGE_SHIFT; > > > > npages = pages->npinned - pages->last_npinned; > > > > do { > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > >locked_vm); > > > > new_pages = cur_pages + npages; > > > > if (new_pages > lock_limit) > > > > return -ENOMEM; > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > cur_pages, > > > >new_pages) != cur_pages); > > > > > > > > So it does work essentially the same. > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the user > > > > so it is additive with things like iouring and additively spans all > > > > the users processes. > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > The user visible difference will be that a limit X that worked with > > > > VFIO may start to fail after a kernel upgrade as the charge accounting > > > > is now cross user and additive with things like iommufd. > > > > > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > > IMHO, but with most of the places doing pins voting to use > > > > user->locked_vm as the charge it seems the right path in today's > > > > kernel. > > > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > > > I plan to fix RDMA like this as well so at least we can have > > > > consistency within qemu. > > > > > > > > > > I have an impression that iommufd and vfio type1 must use > > > the same accounting scheme given the management stack > > > has no insight into qemu on which one is actually used thus > > > cannot adapt to the subtle difference in between. in this > > > regard either we start fixing vfio type1 to use user->locked_vm > > > now or have iommufd follow vfio type1 for upward compatibility > > > and then change them together at a later point. > > > > > > I prefer to the former as IMHO I don't know when will be a later > > > point w/o certain kernel changes to actually break the userspace > > > policy built on a wrong accounting scheme... > > > > I wonder if the kernel is the right place to do this. We have new uAPI > > I didn't get this. This thread is about that VFIO uses a wrong accounting > scheme and then the discussion is about the impact of fixing it to the > userspace. It's probably too late to fix the kernel considering it may break the userspace. > I didn't see the question on the right place part. I meant it would be easier to let userspace know the difference than trying to fix or workaround in the kernel. > > > so management layer can know the difference of the accounting in > > advance by > > > > -device vfio-pci,iommufd=on > > > > I suppose iommufd will be used once Qemu supports it, as long as > the compatibility opens that Jason/Alex discussed in another thread > are well addressed. It is not necessarily to be a control knob exposed > to the caller. It has a lot of implications if we do this, it means iommufd needs to inherit all the userspace noticeable behaviour as well as the "bugs" of VFIO. We know it's e
RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
> From: Jason Wang > Sent: Thursday, March 24, 2022 10:28 AM > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin wrote: > > > > > From: Jason Gunthorpe > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > I'm still picking my way through the series, but the later compat > > > > interface doesn't mention this difference as an outstanding issue. > > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > > resource limits? > > > > > > AFACIT, no, but it should be checked. > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > memory limits. > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > { > > > rlim = tsk->signal->rlim + resource; > > > [..] > > > if (new_rlim) > > > *rlim = *new_rlim; > > > > > > Which vfio reads back here: > > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > And iommufd does the same read back: > > > > > > lock_limit = > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > PAGE_SHIFT; > > > npages = pages->npinned - pages->last_npinned; > > > do { > > > cur_pages = atomic_long_read(&pages->source_user- > > > >locked_vm); > > > new_pages = cur_pages + npages; > > > if (new_pages > lock_limit) > > > return -ENOMEM; > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > cur_pages, > > >new_pages) != cur_pages); > > > > > > So it does work essentially the same. > > > > > > The difference is more subtle, iouring/etc puts the charge in the user > > > so it is additive with things like iouring and additively spans all > > > the users processes. > > > > > > However vfio is accounting only per-process and only for itself - no > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > The user visible difference will be that a limit X that worked with > > > VFIO may start to fail after a kernel upgrade as the charge accounting > > > is now cross user and additive with things like iommufd. > > > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > IMHO, but with most of the places doing pins voting to use > > > user->locked_vm as the charge it seems the right path in today's > > > kernel. > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > I plan to fix RDMA like this as well so at least we can have > > > consistency within qemu. > > > > > > > I have an impression that iommufd and vfio type1 must use > > the same accounting scheme given the management stack > > has no insight into qemu on which one is actually used thus > > cannot adapt to the subtle difference in between. in this > > regard either we start fixing vfio type1 to use user->locked_vm > > now or have iommufd follow vfio type1 for upward compatibility > > and then change them together at a later point. > > > > I prefer to the former as IMHO I don't know when will be a later > > point w/o certain kernel changes to actually break the userspace > > policy built on a wrong accounting scheme... > > I wonder if the kernel is the right place to do this. We have new uAPI I didn't get this. This thread is about that VFIO uses a wrong accounting scheme and then the discussion is about the impact of fixing it to the userspace. I didn't see the question on the right place part. > so management layer can know the difference of the accounting in > advance by > > -device vfio-pci,iommufd=on > I suppose iommufd will be used once Qemu supports it, as long as the compatibility opens that Jason/Alex discussed in another thread are well addressed. It is not necessarily to be a control knob exposed to the caller. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin wrote: > > > From: Jason Gunthorpe > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > I'm still picking my way through the series, but the later compat > > > interface doesn't mention this difference as an outstanding issue. > > > Doesn't this difference need to be accounted in how libvirt manages VM > > > resource limits? > > > > AFACIT, no, but it should be checked. > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > memory limits. > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > { > > rlim = tsk->signal->rlim + resource; > > [..] > > if (new_rlim) > > *rlim = *new_rlim; > > > > Which vfio reads back here: > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > And iommufd does the same read back: > > > > lock_limit = > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > PAGE_SHIFT; > > npages = pages->npinned - pages->last_npinned; > > do { > > cur_pages = atomic_long_read(&pages->source_user- > > >locked_vm); > > new_pages = cur_pages + npages; > > if (new_pages > lock_limit) > > return -ENOMEM; > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > cur_pages, > >new_pages) != cur_pages); > > > > So it does work essentially the same. > > > > The difference is more subtle, iouring/etc puts the charge in the user > > so it is additive with things like iouring and additively spans all > > the users processes. > > > > However vfio is accounting only per-process and only for itself - no > > other subsystem uses locked as the charge variable for DMA pins. > > > > The user visible difference will be that a limit X that worked with > > VFIO may start to fail after a kernel upgrade as the charge accounting > > is now cross user and additive with things like iommufd. > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > IMHO, but with most of the places doing pins voting to use > > user->locked_vm as the charge it seems the right path in today's > > kernel. > > > > Ceratinly having qemu concurrently using three different subsystems > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > I plan to fix RDMA like this as well so at least we can have > > consistency within qemu. > > > > I have an impression that iommufd and vfio type1 must use > the same accounting scheme given the management stack > has no insight into qemu on which one is actually used thus > cannot adapt to the subtle difference in between. in this > regard either we start fixing vfio type1 to use user->locked_vm > now or have iommufd follow vfio type1 for upward compatibility > and then change them together at a later point. > > I prefer to the former as IMHO I don't know when will be a later > point w/o certain kernel changes to actually break the userspace > policy built on a wrong accounting scheme... I wonder if the kernel is the right place to do this. We have new uAPI so management layer can know the difference of the accounting in advance by -device vfio-pci,iommufd=on ? > > Thanks > Kevin > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
> From: Jason Gunthorpe > Sent: Wednesday, March 23, 2022 12:15 AM > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > I'm still picking my way through the series, but the later compat > > interface doesn't mention this difference as an outstanding issue. > > Doesn't this difference need to be accounted in how libvirt manages VM > > resource limits? > > AFACIT, no, but it should be checked. > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > memory limits. > > Yes, and ulimit does work fully. prlimit adjusts the value: > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > struct rlimit *new_rlim, struct rlimit *old_rlim) > { > rlim = tsk->signal->rlim + resource; > [..] > if (new_rlim) > *rlim = *new_rlim; > > Which vfio reads back here: > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > And iommufd does the same read back: > > lock_limit = > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > PAGE_SHIFT; > npages = pages->npinned - pages->last_npinned; > do { > cur_pages = atomic_long_read(&pages->source_user- > >locked_vm); > new_pages = cur_pages + npages; > if (new_pages > lock_limit) > return -ENOMEM; > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > cur_pages, >new_pages) != cur_pages); > > So it does work essentially the same. > > The difference is more subtle, iouring/etc puts the charge in the user > so it is additive with things like iouring and additively spans all > the users processes. > > However vfio is accounting only per-process and only for itself - no > other subsystem uses locked as the charge variable for DMA pins. > > The user visible difference will be that a limit X that worked with > VFIO may start to fail after a kernel upgrade as the charge accounting > is now cross user and additive with things like iommufd. > > This whole area is a bit peculiar (eg mlock itself works differently), > IMHO, but with most of the places doing pins voting to use > user->locked_vm as the charge it seems the right path in today's > kernel. > > Ceratinly having qemu concurrently using three different subsystems > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > RLIMIT_MEMLOCK differently cannot be sane or correct. > > I plan to fix RDMA like this as well so at least we can have > consistency within qemu. > I have an impression that iommufd and vfio type1 must use the same accounting scheme given the management stack has no insight into qemu on which one is actually used thus cannot adapt to the subtle difference in between. in this regard either we start fixing vfio type1 to use user->locked_vm now or have iommufd follow vfio type1 for upward compatibility and then change them together at a later point. I prefer to the former as IMHO I don't know when will be a later point w/o certain kernel changes to actually break the userspace policy built on a wrong accounting scheme... Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Tue, Mar 22, 2022 at 05:31:26PM +0100, Niklas Schnelle wrote: > > > In fact I stumbled over that because the wrong accounting in > > > io_uring exhausted the applied to vfio (I was using a QEMU utilizing > > > io_uring itself). > > > > I'm pretty interested in this as well, do you have anything you can > > share? > > This was the issue reported in the following BZ. Sorry, I was talking about the iouring usage in qemu :) Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Tue, 2022-03-22 at 11:57 -0300, Jason Gunthorpe wrote: > On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote: > > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote: > > > Following the pattern of io_uring, perf, skb, and bpf iommfd will use > > iommufd ^ > > > user->locked_vm for accounting pinned pages. Ensure the value is included > > > in the struct and export free_uid() as iommufd is modular. > > > > > > user->locked_vm is the correct accounting to use for ulimit because it is > > > per-user, and the ulimit is not supposed to be per-process. Other > > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or > > > mm->locked_vm for accounting pinned pages, but this is only per-process > > > and inconsistent with the majority of the kernel. > > > > Since this will replace parts of vfio this difference seems > > significant. Can you explain this a bit more? > > I'm not sure what to say more, this is the correct way to account for > this. It is natural to see it is right because the ulimit is supposted > to be global to the user, not effectively reset every time the user > creates a new process. > > So checking the ulimit against a per-process variable in the mm_struct > doesn't make too much sense. Yes I agree that logically this makes more sense. I was kind of aiming in the same direction as Alex i.e. it's a conscious decision to do it right and we need to know where this may lead to differences and how to handle them. > > > I'm also a bit confused how io_uring handles this. When I stumbled over > > the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm > > accounting") and from that commit description I seem to rember that > > io_uring also accounts in mm->locked_vm too? > > locked/pinned_pages in the mm is kind of a debugging counter, it > indicates how many pins the user obtained through this mm. AFAICT its > only correct use is to report through proc. Things are supposed to > update it, but there is no reason to limit it as the user limit > supersedes it. > > The commit you pointed at is fixing that io_uring corrupted the value. > > Since VFIO checks locked/pinned_pages against the ulimit would blow up > when the value was wrong. > > > In fact I stumbled over that because the wrong accounting in > > io_uring exhausted the applied to vfio (I was using a QEMU utilizing > > io_uring itself). > > I'm pretty interested in this as well, do you have anything you can > share? This was the issue reported in the following BZ. https://bugzilla.kernel.org/show_bug.cgi?id=209025 I stumbled over the same problem on my x86 box and also on s390. I don't remember exactly what limit this ran into but I suspect it had something to do with the libvirt resource limits Alex mentioned. Meaning io_uring had an accounting bug and then vfio / QEMU couldn't pin memory. I think that libvirt limit is set to allow pinning all of guest memory plus a bit so the io_uring misaccounting easily tipped it over. > > Thanks, > Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > I'm still picking my way through the series, but the later compat > interface doesn't mention this difference as an outstanding issue. > Doesn't this difference need to be accounted in how libvirt manages VM > resource limits? AFACIT, no, but it should be checked. > AIUI libvirt uses some form of prlimit(2) to set process locked > memory limits. Yes, and ulimit does work fully. prlimit adjusts the value: int do_prlimit(struct task_struct *tsk, unsigned int resource, struct rlimit *new_rlim, struct rlimit *old_rlim) { rlim = tsk->signal->rlim + resource; [..] if (new_rlim) *rlim = *new_rlim; Which vfio reads back here: drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; drivers/vfio/vfio_iommu_type1.c:unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; And iommufd does the same read back: lock_limit = task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; npages = pages->npinned - pages->last_npinned; do { cur_pages = atomic_long_read(&pages->source_user->locked_vm); new_pages = cur_pages + npages; if (new_pages > lock_limit) return -ENOMEM; } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages, new_pages) != cur_pages); So it does work essentially the same. The difference is more subtle, iouring/etc puts the charge in the user so it is additive with things like iouring and additively spans all the users processes. However vfio is accounting only per-process and only for itself - no other subsystem uses locked as the charge variable for DMA pins. The user visible difference will be that a limit X that worked with VFIO may start to fail after a kernel upgrade as the charge accounting is now cross user and additive with things like iommufd. This whole area is a bit peculiar (eg mlock itself works differently), IMHO, but with most of the places doing pins voting to use user->locked_vm as the charge it seems the right path in today's kernel. Ceratinly having qemu concurrently using three different subsystems (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for RLIMIT_MEMLOCK differently cannot be sane or correct. I plan to fix RDMA like this as well so at least we can have consistency within qemu. Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Tue, 22 Mar 2022 11:57:41 -0300 Jason Gunthorpe wrote: > On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote: > > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote: > > > > > > user->locked_vm is the correct accounting to use for ulimit because it is > > > per-user, and the ulimit is not supposed to be per-process. Other > > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or > > > mm->locked_vm for accounting pinned pages, but this is only per-process > > > and inconsistent with the majority of the kernel. > > > > Since this will replace parts of vfio this difference seems > > significant. Can you explain this a bit more? > > I'm not sure what to say more, this is the correct way to account for > this. It is natural to see it is right because the ulimit is supposted > to be global to the user, not effectively reset every time the user > creates a new process. > > So checking the ulimit against a per-process variable in the mm_struct > doesn't make too much sense. I'm still picking my way through the series, but the later compat interface doesn't mention this difference as an outstanding issue. Doesn't this difference need to be accounted in how libvirt manages VM resource limits? AIUI libvirt uses some form of prlimit(2) to set process locked memory limits. A compat interface might be an interesting feature, but does it only provide ioctl compatibility and not resource ulimit compatibility? Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote: > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote: > > Following the pattern of io_uring, perf, skb, and bpf iommfd will use > iommufd ^ > > user->locked_vm for accounting pinned pages. Ensure the value is included > > in the struct and export free_uid() as iommufd is modular. > > > > user->locked_vm is the correct accounting to use for ulimit because it is > > per-user, and the ulimit is not supposed to be per-process. Other > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or > > mm->locked_vm for accounting pinned pages, but this is only per-process > > and inconsistent with the majority of the kernel. > > Since this will replace parts of vfio this difference seems > significant. Can you explain this a bit more? I'm not sure what to say more, this is the correct way to account for this. It is natural to see it is right because the ulimit is supposted to be global to the user, not effectively reset every time the user creates a new process. So checking the ulimit against a per-process variable in the mm_struct doesn't make too much sense. > I'm also a bit confused how io_uring handles this. When I stumbled over > the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm > accounting") and from that commit description I seem to rember that > io_uring also accounts in mm->locked_vm too? locked/pinned_pages in the mm is kind of a debugging counter, it indicates how many pins the user obtained through this mm. AFAICT its only correct use is to report through proc. Things are supposed to update it, but there is no reason to limit it as the user limit supersedes it. The commit you pointed at is fixing that io_uring corrupted the value. Since VFIO checks locked/pinned_pages against the ulimit would blow up when the value was wrong. > In fact I stumbled over that because the wrong accounting in > io_uring exhausted the applied to vfio (I was using a QEMU utilizing > io_uring itself). I'm pretty interested in this as well, do you have anything you can share? Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote: > Following the pattern of io_uring, perf, skb, and bpf iommfd will use iommufd ^ > user->locked_vm for accounting pinned pages. Ensure the value is included > in the struct and export free_uid() as iommufd is modular. > > user->locked_vm is the correct accounting to use for ulimit because it is > per-user, and the ulimit is not supposed to be per-process. Other > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or > mm->locked_vm for accounting pinned pages, but this is only per-process > and inconsistent with the majority of the kernel. Since this will replace parts of vfio this difference seems significant. Can you explain this a bit more? I'm also a bit confused how io_uring handles this. When I stumbled over the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm accounting") and from that commit description I seem to rember that io_uring also accounts in mm->locked_vm too? In fact I stumbled over that because the wrong accounting in io_uring exhausted the applied to vfio (I was using a QEMU utilizing io_uring itself). > > Signed-off-by: Jason Gunthorpe > --- > include/linux/sched/user.h | 2 +- > kernel/user.c | 1 + > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h > index 00ed419dd46413..c47dae71dad3c8 100644 > --- a/include/linux/sched/user.h > +++ b/include/linux/sched/user.h > @@ -24,7 +24,7 @@ struct user_struct { > kuid_t uid; > > #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \ > -defined(CONFIG_NET) || defined(CONFIG_IO_URING) > +defined(CONFIG_NET) || defined(CONFIG_IO_URING) || > IS_ENABLED(CONFIG_IOMMUFD) > atomic_long_t locked_vm; > #endif > #ifdef CONFIG_WATCH_QUEUE > diff --git a/kernel/user.c b/kernel/user.c > index e2cf8c22b539a7..d667debeafd609 100644 > --- a/kernel/user.c > +++ b/kernel/user.c > @@ -185,6 +185,7 @@ void free_uid(struct user_struct *up) > if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags)) > free_user(up, flags); > } > +EXPORT_SYMBOL_GPL(free_uid); > > struct user_struct *alloc_uid(kuid_t uid) > { ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd
Following the pattern of io_uring, perf, skb, and bpf iommfd will use user->locked_vm for accounting pinned pages. Ensure the value is included in the struct and export free_uid() as iommufd is modular. user->locked_vm is the correct accounting to use for ulimit because it is per-user, and the ulimit is not supposed to be per-process. Other places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this is only per-process and inconsistent with the majority of the kernel. Signed-off-by: Jason Gunthorpe --- include/linux/sched/user.h | 2 +- kernel/user.c | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h index 00ed419dd46413..c47dae71dad3c8 100644 --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -24,7 +24,7 @@ struct user_struct { kuid_t uid; #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \ -defined(CONFIG_NET) || defined(CONFIG_IO_URING) +defined(CONFIG_NET) || defined(CONFIG_IO_URING) || IS_ENABLED(CONFIG_IOMMUFD) atomic_long_t locked_vm; #endif #ifdef CONFIG_WATCH_QUEUE diff --git a/kernel/user.c b/kernel/user.c index e2cf8c22b539a7..d667debeafd609 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -185,6 +185,7 @@ void free_uid(struct user_struct *up) if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags)) free_user(up, flags); } +EXPORT_SYMBOL_GPL(free_uid); struct user_struct *alloc_uid(kuid_t uid) { -- 2.35.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu