Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-29 Thread Jason Gunthorpe via iommu
On Tue, Mar 29, 2022 at 12:59:52PM +0800, Jason Wang wrote:

> vDPA has a backend feature negotiation, then actually, userspace can
> tell vDPA to go with the new accounting approach. Not sure RDMA can do
> the same.

A security feature userspace can ask to turn off is not really a
security feature.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-28 Thread Jason Wang
On Mon, Mar 28, 2022 at 8:23 PM Jason Gunthorpe  wrote:
>
> On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote:
> > To me, it looks more easier to not answer this question by letting
> > userspace know about the change,
>
> That is not backwards compatbile, so I don't think it helps unless we
> say if you open /dev/vfio/vfio you get old behavior and if you open
> /dev/iommu you get new...

Actually, this is one way to go. Trying to behave exactly like typ1
might be not easy.

>
> Nor does it answer if I can fix RDMA or not :\
>

vDPA has a backend feature negotiation, then actually, userspace can
tell vDPA to go with the new accounting approach. Not sure RDMA can do
the same.

Thanks

> So we really do need to know what exactly is the situation here.
>
> Jason
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-28 Thread Jason Gunthorpe via iommu
On Mon, Mar 28, 2022 at 02:14:26PM +0100, Sean Mooney wrote:
> On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote:
> > On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe  wrote:
> > > 
> > > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
> > > 
> > > > It's simply because we don't want to break existing userspace. [1]
> > > 
> > > I'm still waiting to hear what exactly breaks in real systems.
> > > 
> > > As I explained this is not a significant change, but it could break
> > > something in a few special scenarios.
> > > 
> > > Also the one place we do have ABI breaks is security, and ulimit is a
> > > security mechanism that isn't working right. So we do clearly need to
> > > understand *exactly* what real thing breaks - if anything.
> > > 
> > > Jason
> > > 
> > 
> > To tell the truth, I don't know. I remember that Openstack may do some
> > accounting so adding Sean for more comments. But we really can't image
> > openstack is the only userspace that may use this.
> sorry there is a lot of context to this discussion i have tried to read back 
> the
> thread but i may have missed part of it.

Thanks Sean, this is quite interesting, though I'm not sure it
entirely reached the question

> tl;dr openstack does not currently track locked/pinned memory per
> use or per vm because we have no idea when libvirt will request it
> or how much is needed per device. when ulimits are configured today
> for nova/openstack its done at teh qemu user level outside of
> openstack in our installer tooling.  e.g. in tripleo the ulimits
> woudl be set on the nova_libvirt contaienr to constrain all vms
> spawned not per vm/process.

So, today, you expect the ulimit to be machine wide, like if your
machine has 1TB of memory you'd set the ulimit at 0.9 TB and you'd
like the stuff under to limit memory pinning to 0.9TB globally for all
qemus?

To be clear it doesn't work that way today at all, you might as well
just not bother setting ulimit to anything less than unlimited at the
openstack layer.

>hard_limit
>
>The optional hard_limit element is the maximum memory the
>guest can use. The units for this value are kibibytes
>(i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly
>advised not to set this limit as domain may get killed by the
>kernel if the guess is too low, and determining the memory needed
>for a process to run is an undecidable problem; that said, if you
>already set locked in memory backing because your workload
>demands it, you'll have to take into account the specifics of
>your deployment and figure out a value for hard_limit that is
>large enough to support the memory requirements of your guest,
>but small enough to protect your host against a malicious guest
>locking all memory.

And hard_limit is the ulimit that Alex was talking about?

So now we switched from talking about global per-user things to
per-qemu-instance things?

> we coudl not figure out how to automatically comptue a hard_limit in
> nova that would work for everyone and we felt exposign this to our
> users/operators was bit of a cop out when they likely cant caluate
> that properly either.

Not surprising..

> As a result we cant actully account for them
> today when schduilign workloads to a host. Im not sure this woudl
> chagne even if you exposed new user space apis unless we  had a way
> to inspect each VF to know how much locked memory that VF woudl need
> to lock?

We are not talking about a new uAPI we are talking about changing the
meaning of the existing ulimit. You can see it in your message above,
at the openstack level you were talking about global limits and then
in the libvirt level you are talking about per-qemu limits.

In the kernel both of these are being used by the same control and one
of the users is wrong.

The kernel consensus is that the ulimit is per-user and is used by all
kernel entities consistently

Currently vfio is different and uses it per-process and effectively
has its own private bucket.

When you talk about VDPA you start to see the problems here because
VPDA use a different accounting from VFIO. If you run VFIO and VDPA
together then you should need 2x the ulimit, but today you only need
1x because they don't share accounting buckets.

This also means the ulimit doesn't actually work the way it is
supposed to.

The question is how to fix it, if we do fix it, how much cares that
things work differently.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-28 Thread Sean Mooney
On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote:
> On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe  wrote:
> > 
> > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
> > 
> > > It's simply because we don't want to break existing userspace. [1]
> > 
> > I'm still waiting to hear what exactly breaks in real systems.
> > 
> > As I explained this is not a significant change, but it could break
> > something in a few special scenarios.
> > 
> > Also the one place we do have ABI breaks is security, and ulimit is a
> > security mechanism that isn't working right. So we do clearly need to
> > understand *exactly* what real thing breaks - if anything.
> > 
> > Jason
> > 
> 
> To tell the truth, I don't know. I remember that Openstack may do some
> accounting so adding Sean for more comments. But we really can't image
> openstack is the only userspace that may use this.
sorry there is a lot of context to this discussion i have tried to read back the
thread but i may have missed part of it.

tl;dr openstack does not currently track locked/pinned memory per use or per vm 
because we have
no idea when libvirt will request it or how much is needed per device. when 
ulimits are configured
today for nova/openstack its done at teh qemu user level outside of openstack 
in our installer tooling.
e.g. in tripleo the ulimits woudl be set on the nova_libvirt contaienr to 
constrain all vms spawned
not per vm/process.

full responce below
---

openstacks history with locked/pinned/unswapable memory is a bit complicated.
we currently only request locked memory explictly in 2 cases directly
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/driver.py#L5769-L5784=
when the adminstartor configure the vm flaovr to requst amd's SEV feature or 
configures the flavor for realtime scheduling pirorotiy.
i say explictly as libvirt invented a request for locked/pinned pages implictly 
for sriov VFs and a number of other cases
which we were not aware of implictly. this only became apprent when we went to 
add vdpa supprot to openstack and libvirt
did not make that implict request and we had to fall back to requesting 
realtime instances as a workaround.

nova/openstack does have the ablity to generate the libvirt xml element that 
configure hard and soft limtis 
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/config.py#L2559-L2590
however it is only ever used in our test code
https://github.com/openstack/nova/search?q=LibvirtConfigGuestMemoryTune

the descirption of hard limit in the libvirt docs stongly dicurages its use 
with a small caveat for locked memory
https://libvirt.org/formatdomain.html#memory-tuning

   hard_limit
   
   The optional hard_limit element is the maximum memory the guest can use. 
The units for this value are kibibytes (i.e. blocks of 1024 bytes). Users
   of QEMU and KVM are strongly advised not to set this limit as domain may get 
killed by the kernel if the guess is too low, and determining the memory
   needed for a process to run is an undecidable problem; that said, if you 
already set locked in memory backing because your workload demands it, you'll
   have to take into account the specifics of your deployment and figure out a 
value for hard_limit that is large enough to support the memory
   requirements of your guest, but small enough to protect your host against a 
malicious guest locking all memory.
   
we coudl not figure out how to automatically comptue a hard_limit in nova that 
would work for everyone and we felt exposign this to our
users/operators was  bit of a cop out when they likely cant caluate that 
properly either. As a result we cant actully account for them today when
schduilign workloads to a host. Im not sure this woudl chagne even if you 
exposed new user space apis unless we 
had a way to inspect each VF to know how much locked memory that VF woudl need 
to lock? same for vdpa devices,
mdevs ectra. cloud system dont normaly have quotas on "locked" memory used 
trasitivly via passthoguh devices so even if we had this info
its not imeditly apperant how we woudl consume it without altering our existing 
quotas. Openstack is a self service cloud plathform
where enduser can upload there own worload iamge so its basicaly impossibel for 
the oeprator of the cloud to know how much memroy to set teh ard limit
too without setting it overly large in most cases. from a management applciaton 
point of view we currently have no insigth into how
memory will be pinned in the kernel or when libvirt will invent addtional 
request for pinned/locked memeory or how large they are. 

instead of going down that route operators are encuraged to use ulimit to set a 
global limit on the amount of memory the nova/qemu user can use.
while nova/openstack support multi tenancy we do not expose that multi tenancy 
to hte underlying hypervior hosts. the agents are typicly
deploy as the 

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-28 Thread Jason Gunthorpe via iommu
On Mon, Mar 28, 2022 at 09:53:27AM +0800, Jason Wang wrote:
> To me, it looks more easier to not answer this question by letting
> userspace know about the change,

That is not backwards compatbile, so I don't think it helps unless we
say if you open /dev/vfio/vfio you get old behavior and if you open
/dev/iommu you get new...

Nor does it answer if I can fix RDMA or not :\

So we really do need to know what exactly is the situation here.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-27 Thread Jason Wang
On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe  wrote:
>
> On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
>
> > It's simply because we don't want to break existing userspace. [1]
>
> I'm still waiting to hear what exactly breaks in real systems.
>
> As I explained this is not a significant change, but it could break
> something in a few special scenarios.
>
> Also the one place we do have ABI breaks is security, and ulimit is a
> security mechanism that isn't working right. So we do clearly need to
> understand *exactly* what real thing breaks - if anything.
>
> Jason
>

To tell the truth, I don't know. I remember that Openstack may do some
accounting so adding Sean for more comments. But we really can't image
openstack is the only userspace that may use this.

To me, it looks more easier to not answer this question by letting
userspace know about the change,

Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-24 Thread Alex Williamson
On Thu, 24 Mar 2022 19:27:39 -0300
Jason Gunthorpe  wrote:

> On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote:
> > On Tue, 22 Mar 2022 13:15:21 -0300
> > Jason Gunthorpe via iommu  wrote:
> >   
> > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > >   
> > > > I'm still picking my way through the series, but the later compat
> > > > interface doesn't mention this difference as an outstanding issue.
> > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > resource limits?  
> > > 
> > > AFACIT, no, but it should be checked.
> > >   
> > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > memory limits.
> > > 
> > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > 
> > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > {
> > >   rlim = tsk->signal->rlim + resource;
> > > [..]
> > >   if (new_rlim)
> > >   *rlim = *new_rlim;
> > > 
> > > Which vfio reads back here:
> > > 
> > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = 
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = 
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > 
> > > And iommufd does the same read back:
> > > 
> > >   lock_limit =
> > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >   npages = pages->npinned - pages->last_npinned;
> > >   do {
> > >   cur_pages = atomic_long_read(>source_user->locked_vm);
> > >   new_pages = cur_pages + npages;
> > >   if (new_pages > lock_limit)
> > >   return -ENOMEM;
> > >   } while (atomic_long_cmpxchg(>source_user->locked_vm, cur_pages,
> > >new_pages) != cur_pages);
> > > 
> > > So it does work essentially the same.  
> > 
> > Well, except for the part about vfio updating mm->locked_vm and iommufd
> > updating user->locked_vm, a per-process counter versus a per-user
> > counter.  prlimit specifically sets process resource limits, which get
> > reflected in task_rlimit.  
> 
> Indeed, but that is not how the majority of other things seem to
> operate it.
> 
> > For example, let's say a user has two 4GB VMs and they're hot-adding
> > vfio devices to each of them, so libvirt needs to dynamically modify
> > the locked memory limit for each VM.  AIUI, libvirt would look at the
> > VM size and call prlimit to set that value.  If libvirt does this to
> > both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
> > pages to mm->locked_vm, so this works well.  In the iommufd loop above,
> > we're comparing a per-task/process limit to a per-user counter.  So I'm
> > a bit lost how both VMs can pin their pages here.  
> 
> I don't know anything about libvirt - it seems strange to use a
> securityish feature like ulimit but not security isolate processes
> with real users.
> 
> But if it really does this then it really does this.
> 
> So at the very least VFIO container has to keep working this way.
> 
> The next question is if we want iommufd's own device node to work this
> way and try to change libvirt somehow. It seems libvirt will have to
> deal with this at some point as iouring will trigger the same problem.
> 
> > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > IMHO, but with most of the places doing pins voting to use
> > > user->locked_vm as the charge it seems the right path in today's
> > > kernel.  
> > 
> > The philosophy of whether it's ultimately a better choice for the
> > kernel aside, if userspace breaks because we're accounting in a
> > per-user pool rather than a per-process pool, then our compatibility
> > layer ain't so transparent.  
> 
> Sure, if it doesn't work it doesn't work. Lets be sure and clearly
> document what the compatability issue is and then we have to keep it
> per-process.
> 
> And the same reasoning likely means I can't change RDMA either as qemu
> will break just as well when qemu uses rdma mode.
> 
> Which is pretty sucky, but it is what it is..

I added Daniel Berrangé to the cc list for my previous reply, hopefully
he can comment whether libvirt has the sort of user security model you
allude to above that maybe makes this a non-issue for this use case.
Unfortunately it's extremely difficult to prove that there are no such
use cases out there even if libvirt is ok.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-24 Thread Jason Gunthorpe via iommu
On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote:
> On Tue, 22 Mar 2022 13:15:21 -0300
> Jason Gunthorpe via iommu  wrote:
> 
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > 
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?
> > 
> > AFACIT, no, but it should be checked.
> > 
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.  
> > 
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> > 
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> > rlim = tsk->signal->rlim + resource;
> > [..]
> > if (new_rlim)
> > *rlim = *new_rlim;
> > 
> > Which vfio reads back here:
> > 
> > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = 
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = 
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > 
> > And iommufd does the same read back:
> > 
> > lock_limit =
> > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > npages = pages->npinned - pages->last_npinned;
> > do {
> > cur_pages = atomic_long_read(>source_user->locked_vm);
> > new_pages = cur_pages + npages;
> > if (new_pages > lock_limit)
> > return -ENOMEM;
> > } while (atomic_long_cmpxchg(>source_user->locked_vm, cur_pages,
> >  new_pages) != cur_pages);
> > 
> > So it does work essentially the same.
> 
> Well, except for the part about vfio updating mm->locked_vm and iommufd
> updating user->locked_vm, a per-process counter versus a per-user
> counter.  prlimit specifically sets process resource limits, which get
> reflected in task_rlimit.

Indeed, but that is not how the majority of other things seem to
operate it.

> For example, let's say a user has two 4GB VMs and they're hot-adding
> vfio devices to each of them, so libvirt needs to dynamically modify
> the locked memory limit for each VM.  AIUI, libvirt would look at the
> VM size and call prlimit to set that value.  If libvirt does this to
> both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
> pages to mm->locked_vm, so this works well.  In the iommufd loop above,
> we're comparing a per-task/process limit to a per-user counter.  So I'm
> a bit lost how both VMs can pin their pages here.

I don't know anything about libvirt - it seems strange to use a
securityish feature like ulimit but not security isolate processes
with real users.

But if it really does this then it really does this.

So at the very least VFIO container has to keep working this way.

The next question is if we want iommufd's own device node to work this
way and try to change libvirt somehow. It seems libvirt will have to
deal with this at some point as iouring will trigger the same problem.

> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> 
> The philosophy of whether it's ultimately a better choice for the
> kernel aside, if userspace breaks because we're accounting in a
> per-user pool rather than a per-process pool, then our compatibility
> layer ain't so transparent.

Sure, if it doesn't work it doesn't work. Lets be sure and clearly
document what the compatability issue is and then we have to keep it
per-process.

And the same reasoning likely means I can't change RDMA either as qemu
will break just as well when qemu uses rdma mode.

Which is pretty sucky, but it is what it is..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-24 Thread Alex Williamson
On Tue, 22 Mar 2022 13:15:21 -0300
Jason Gunthorpe via iommu  wrote:

> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.  
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
>   struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
>   rlim = tsk->signal->rlim + resource;
> [..]
>   if (new_rlim)
>   *rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = 
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:unsigned long limit = 
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
>   lock_limit =
>   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>   npages = pages->npinned - pages->last_npinned;
>   do {
>   cur_pages = atomic_long_read(>source_user->locked_vm);
>   new_pages = cur_pages + npages;
>   if (new_pages > lock_limit)
>   return -ENOMEM;
>   } while (atomic_long_cmpxchg(>source_user->locked_vm, cur_pages,
>new_pages) != cur_pages);
> 
> So it does work essentially the same.

Well, except for the part about vfio updating mm->locked_vm and iommufd
updating user->locked_vm, a per-process counter versus a per-user
counter.  prlimit specifically sets process resource limits, which get
reflected in task_rlimit.

For example, let's say a user has two 4GB VMs and they're hot-adding
vfio devices to each of them, so libvirt needs to dynamically modify
the locked memory limit for each VM.  AIUI, libvirt would look at the
VM size and call prlimit to set that value.  If libvirt does this to
both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
pages to mm->locked_vm, so this works well.  In the iommufd loop above,
we're comparing a per-task/process limit to a per-user counter.  So I'm
a bit lost how both VMs can pin their pages here.

Am I missing some assumption about how libvirt users prlimit or
sandboxes users?

> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.

And that's exactly the concern.
 
> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.

The philosophy of whether it's ultimately a better choice for the
kernel aside, if userspace breaks because we're accounting in a
per-user pool rather than a per-process pool, then our compatibility
layer ain't so transparent.

> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.

I think everyone would agree with that, but it also seems there are
real differences between task_rlimits and per-user vs per-process
accounting buckets and I'm confused how that's not a blocker for trying
to implement transparent compatibility.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-24 Thread Jason Gunthorpe via iommu
On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:

> It's simply because we don't want to break existing userspace. [1]

I'm still waiting to hear what exactly breaks in real systems.

As I explained this is not a significant change, but it could break
something in a few special scenarios.

Also the one place we do have ABI breaks is security, and ulimit is a
security mechanism that isn't working right. So we do clearly need to
understand *exactly* what real thing breaks - if anything.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, March 24, 2022 11:51 AM
> 
> > >
> >
> > In the end vfio type1 will be replaced by iommufd compat layer. With
> > that goal in mind iommufd has to inherit type1 behaviors.
> 
> So the compatibility should be provided by the compat layer instead of
> the core iommufd.
> 
> And I wonder what happens if iommufd is used by other subsystems like
> vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do
> we need a per subsystem new uAPI to expose this capability? If yes,
> why can't VFIO have such an API then we don't even need the compat
> layer at all?
> 

No, compat layer is just for vfio. other subsystems including vdpa is
expected to use iommu uAPI directly, except having their own
bind_iommufd and attach_ioas uAPIs to build the connection between
device and iommufd/ioas.

And having a compat layer for vfio is just for transition purpose. Yi has
demonstrated how vfio can follow what other subsystems are
expected to do here:

https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6
(specifically commits related to "cover-letter: Adapting vfio-pci to iommufd")

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin  wrote:
>
> > From: Jason Wang 
> > Sent: Thursday, March 24, 2022 10:57 AM
> >
> > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin  wrote:
> > >
> > > > From: Jason Wang 
> > > > Sent: Thursday, March 24, 2022 10:28 AM
> > > >
> > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin 
> > wrote:
> > > > >
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > > >
> > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > > >
> > > > > > > I'm still picking my way through the series, but the later compat
> > > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > > Doesn't this difference need to be accounted in how libvirt 
> > > > > > > manages
> > VM
> > > > > > > resource limits?
> > > > > >
> > > > > > AFACIT, no, but it should be checked.
> > > > > >
> > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > > memory limits.
> > > > > >
> > > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > > >
> > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > > {
> > > > > >   rlim = tsk->signal->rlim + resource;
> > > > > > [..]
> > > > > >   if (new_rlim)
> > > > > >   *rlim = *new_rlim;
> > > > > >
> > > > > > Which vfio reads back here:
> > > > > >
> > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > >
> > > > > > And iommufd does the same read back:
> > > > > >
> > > > > >   lock_limit =
> > > > > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > > PAGE_SHIFT;
> > > > > >   npages = pages->npinned - pages->last_npinned;
> > > > > >   do {
> > > > > >   cur_pages = atomic_long_read(>source_user-
> > > > > > >locked_vm);
> > > > > >   new_pages = cur_pages + npages;
> > > > > >   if (new_pages > lock_limit)
> > > > > >   return -ENOMEM;
> > > > > >   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> > > > > > cur_pages,
> > > > > >new_pages) != cur_pages);
> > > > > >
> > > > > > So it does work essentially the same.
> > > > > >
> > > > > > The difference is more subtle, iouring/etc puts the charge in the 
> > > > > > user
> > > > > > so it is additive with things like iouring and additively spans all
> > > > > > the users processes.
> > > > > >
> > > > > > However vfio is accounting only per-process and only for itself - no
> > > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > > >
> > > > > > The user visible difference will be that a limit X that worked with
> > > > > > VFIO may start to fail after a kernel upgrade as the charge 
> > > > > > accounting
> > > > > > is now cross user and additive with things like iommufd.
> > > > > >
> > > > > > This whole area is a bit peculiar (eg mlock itself works 
> > > > > > differently),
> > > > > > IMHO, but with most of the places doing pins voting to use
> > > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > > kernel.
> > > > > >
> > > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > > >
> > > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > > consistency within qemu.
> > > > > >
> > > > >
> > > > > I have an impression that iommufd and vfio type1 must use
> > > > > the same accounting scheme given the management stack
> > > > > has no insight into qemu on which one is actually used thus
> > > > > cannot adapt to the subtle difference in between. in this
> > > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > > and then change them together at a later point.
> > > > >
> > > > > I prefer to the former as IMHO I don't know when will be a later
> > > > > point w/o certain kernel changes to actually break the userspace
> > > > > policy built on a wrong accounting scheme...
> > > >
> > > > I wonder if the kernel is the right place to do this. We have new uAPI
> > >
> > > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > > scheme and then the discussion is about the impact of fixing it to the
> > > userspace.
> >
> > It's probably too late to fix the kernel considering it may break the 
> > userspace.
> >
> > > I didn't see the question on the right place part.
> >
> > I meant it would be easier to let 

RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, March 24, 2022 10:57 AM
> 
> On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin  wrote:
> >
> > > From: Jason Wang 
> > > Sent: Thursday, March 24, 2022 10:28 AM
> > >
> > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin 
> wrote:
> > > >
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > >
> > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > >
> > > > > > I'm still picking my way through the series, but the later compat
> > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > Doesn't this difference need to be accounted in how libvirt manages
> VM
> > > > > > resource limits?
> > > > >
> > > > > AFACIT, no, but it should be checked.
> > > > >
> > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > memory limits.
> > > > >
> > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > >
> > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > {
> > > > >   rlim = tsk->signal->rlim + resource;
> > > > > [..]
> > > > >   if (new_rlim)
> > > > >   *rlim = *new_rlim;
> > > > >
> > > > > Which vfio reads back here:
> > > > >
> > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > >
> > > > > And iommufd does the same read back:
> > > > >
> > > > >   lock_limit =
> > > > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > PAGE_SHIFT;
> > > > >   npages = pages->npinned - pages->last_npinned;
> > > > >   do {
> > > > >   cur_pages = atomic_long_read(>source_user-
> > > > > >locked_vm);
> > > > >   new_pages = cur_pages + npages;
> > > > >   if (new_pages > lock_limit)
> > > > >   return -ENOMEM;
> > > > >   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> > > > > cur_pages,
> > > > >new_pages) != cur_pages);
> > > > >
> > > > > So it does work essentially the same.
> > > > >
> > > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > > so it is additive with things like iouring and additively spans all
> > > > > the users processes.
> > > > >
> > > > > However vfio is accounting only per-process and only for itself - no
> > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > >
> > > > > The user visible difference will be that a limit X that worked with
> > > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > > is now cross user and additive with things like iommufd.
> > > > >
> > > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > > IMHO, but with most of the places doing pins voting to use
> > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > kernel.
> > > > >
> > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > >
> > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > consistency within qemu.
> > > > >
> > > >
> > > > I have an impression that iommufd and vfio type1 must use
> > > > the same accounting scheme given the management stack
> > > > has no insight into qemu on which one is actually used thus
> > > > cannot adapt to the subtle difference in between. in this
> > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > and then change them together at a later point.
> > > >
> > > > I prefer to the former as IMHO I don't know when will be a later
> > > > point w/o certain kernel changes to actually break the userspace
> > > > policy built on a wrong accounting scheme...
> > >
> > > I wonder if the kernel is the right place to do this. We have new uAPI
> >
> > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > scheme and then the discussion is about the impact of fixing it to the
> > userspace.
> 
> It's probably too late to fix the kernel considering it may break the 
> userspace.
> 
> > I didn't see the question on the right place part.
> 
> I meant it would be easier to let userspace know the difference than
> trying to fix or workaround in the kernel.

Jason already plans to fix RDMA which will also leads to user-aware
impact when Qemu uses both VFIO and RDMA. Why is VFIO so special
and left behind to carry the legacy misdesign?

> 
> >
> > > so management layer can know the difference of the 

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin  wrote:
>
> > From: Jason Wang 
> > Sent: Thursday, March 24, 2022 10:28 AM
> >
> > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin  wrote:
> > >
> > > > From: Jason Gunthorpe 
> > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > >
> > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > >
> > > > > I'm still picking my way through the series, but the later compat
> > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > > resource limits?
> > > >
> > > > AFACIT, no, but it should be checked.
> > > >
> > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > memory limits.
> > > >
> > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > >
> > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > {
> > > >   rlim = tsk->signal->rlim + resource;
> > > > [..]
> > > >   if (new_rlim)
> > > >   *rlim = *new_rlim;
> > > >
> > > > Which vfio reads back here:
> > > >
> > > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > >
> > > > And iommufd does the same read back:
> > > >
> > > >   lock_limit =
> > > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > PAGE_SHIFT;
> > > >   npages = pages->npinned - pages->last_npinned;
> > > >   do {
> > > >   cur_pages = atomic_long_read(>source_user-
> > > > >locked_vm);
> > > >   new_pages = cur_pages + npages;
> > > >   if (new_pages > lock_limit)
> > > >   return -ENOMEM;
> > > >   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> > > > cur_pages,
> > > >new_pages) != cur_pages);
> > > >
> > > > So it does work essentially the same.
> > > >
> > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > so it is additive with things like iouring and additively spans all
> > > > the users processes.
> > > >
> > > > However vfio is accounting only per-process and only for itself - no
> > > > other subsystem uses locked as the charge variable for DMA pins.
> > > >
> > > > The user visible difference will be that a limit X that worked with
> > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > is now cross user and additive with things like iommufd.
> > > >
> > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > IMHO, but with most of the places doing pins voting to use
> > > > user->locked_vm as the charge it seems the right path in today's
> > > > kernel.
> > > >
> > > > Ceratinly having qemu concurrently using three different subsystems
> > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > >
> > > > I plan to fix RDMA like this as well so at least we can have
> > > > consistency within qemu.
> > > >
> > >
> > > I have an impression that iommufd and vfio type1 must use
> > > the same accounting scheme given the management stack
> > > has no insight into qemu on which one is actually used thus
> > > cannot adapt to the subtle difference in between. in this
> > > regard either we start fixing vfio type1 to use user->locked_vm
> > > now or have iommufd follow vfio type1 for upward compatibility
> > > and then change them together at a later point.
> > >
> > > I prefer to the former as IMHO I don't know when will be a later
> > > point w/o certain kernel changes to actually break the userspace
> > > policy built on a wrong accounting scheme...
> >
> > I wonder if the kernel is the right place to do this. We have new uAPI
>
> I didn't get this. This thread is about that VFIO uses a wrong accounting
> scheme and then the discussion is about the impact of fixing it to the
> userspace.

It's probably too late to fix the kernel considering it may break the userspace.

> I didn't see the question on the right place part.

I meant it would be easier to let userspace know the difference than
trying to fix or workaround in the kernel.

>
> > so management layer can know the difference of the accounting in
> > advance by
> >
> > -device vfio-pci,iommufd=on
> >
>
> I suppose iommufd will be used once Qemu supports it, as long as
> the compatibility opens that Jason/Alex discussed in another thread
> are well addressed. It is not necessarily to be a control knob exposed
> to the caller.

It has a lot of implications if we do this, it means iommufd needs to
inherit all the userspace noticeable behaviour as well as the "bugs"
of VFIO.

We know it's easier to find 

RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Tian, Kevin
> From: Jason Wang 
> Sent: Thursday, March 24, 2022 10:28 AM
> 
> On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin  wrote:
> >
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, March 23, 2022 12:15 AM
> > >
> > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > >
> > > > I'm still picking my way through the series, but the later compat
> > > > interface doesn't mention this difference as an outstanding issue.
> > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > resource limits?
> > >
> > > AFACIT, no, but it should be checked.
> > >
> > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > memory limits.
> > >
> > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > >
> > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > {
> > >   rlim = tsk->signal->rlim + resource;
> > > [..]
> > >   if (new_rlim)
> > >   *rlim = *new_rlim;
> > >
> > > Which vfio reads back here:
> > >
> > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >
> > > And iommufd does the same read back:
> > >
> > >   lock_limit =
> > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > PAGE_SHIFT;
> > >   npages = pages->npinned - pages->last_npinned;
> > >   do {
> > >   cur_pages = atomic_long_read(>source_user-
> > > >locked_vm);
> > >   new_pages = cur_pages + npages;
> > >   if (new_pages > lock_limit)
> > >   return -ENOMEM;
> > >   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> > > cur_pages,
> > >new_pages) != cur_pages);
> > >
> > > So it does work essentially the same.
> > >
> > > The difference is more subtle, iouring/etc puts the charge in the user
> > > so it is additive with things like iouring and additively spans all
> > > the users processes.
> > >
> > > However vfio is accounting only per-process and only for itself - no
> > > other subsystem uses locked as the charge variable for DMA pins.
> > >
> > > The user visible difference will be that a limit X that worked with
> > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > is now cross user and additive with things like iommufd.
> > >
> > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > IMHO, but with most of the places doing pins voting to use
> > > user->locked_vm as the charge it seems the right path in today's
> > > kernel.
> > >
> > > Ceratinly having qemu concurrently using three different subsystems
> > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > >
> > > I plan to fix RDMA like this as well so at least we can have
> > > consistency within qemu.
> > >
> >
> > I have an impression that iommufd and vfio type1 must use
> > the same accounting scheme given the management stack
> > has no insight into qemu on which one is actually used thus
> > cannot adapt to the subtle difference in between. in this
> > regard either we start fixing vfio type1 to use user->locked_vm
> > now or have iommufd follow vfio type1 for upward compatibility
> > and then change them together at a later point.
> >
> > I prefer to the former as IMHO I don't know when will be a later
> > point w/o certain kernel changes to actually break the userspace
> > policy built on a wrong accounting scheme...
> 
> I wonder if the kernel is the right place to do this. We have new uAPI

I didn't get this. This thread is about that VFIO uses a wrong accounting
scheme and then the discussion is about the impact of fixing it to the
userspace. I didn't see the question on the right place part.

> so management layer can know the difference of the accounting in
> advance by
> 
> -device vfio-pci,iommufd=on
> 

I suppose iommufd will be used once Qemu supports it, as long as
the compatibility opens that Jason/Alex discussed in another thread
are well addressed. It is not necessarily to be a control knob exposed
to the caller.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Jason Wang
On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin  wrote:
>
> > From: Jason Gunthorpe 
> > Sent: Wednesday, March 23, 2022 12:15 AM
> >
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> >
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?
> >
> > AFACIT, no, but it should be checked.
> >
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.
> >
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> >
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> >   rlim = tsk->signal->rlim + resource;
> > [..]
> >   if (new_rlim)
> >   *rlim = *new_rlim;
> >
> > Which vfio reads back here:
> >
> > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >
> > And iommufd does the same read back:
> >
> >   lock_limit =
> >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > PAGE_SHIFT;
> >   npages = pages->npinned - pages->last_npinned;
> >   do {
> >   cur_pages = atomic_long_read(>source_user-
> > >locked_vm);
> >   new_pages = cur_pages + npages;
> >   if (new_pages > lock_limit)
> >   return -ENOMEM;
> >   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> > cur_pages,
> >new_pages) != cur_pages);
> >
> > So it does work essentially the same.
> >
> > The difference is more subtle, iouring/etc puts the charge in the user
> > so it is additive with things like iouring and additively spans all
> > the users processes.
> >
> > However vfio is accounting only per-process and only for itself - no
> > other subsystem uses locked as the charge variable for DMA pins.
> >
> > The user visible difference will be that a limit X that worked with
> > VFIO may start to fail after a kernel upgrade as the charge accounting
> > is now cross user and additive with things like iommufd.
> >
> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> >
> > Ceratinly having qemu concurrently using three different subsystems
> > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > RLIMIT_MEMLOCK differently cannot be sane or correct.
> >
> > I plan to fix RDMA like this as well so at least we can have
> > consistency within qemu.
> >
>
> I have an impression that iommufd and vfio type1 must use
> the same accounting scheme given the management stack
> has no insight into qemu on which one is actually used thus
> cannot adapt to the subtle difference in between. in this
> regard either we start fixing vfio type1 to use user->locked_vm
> now or have iommufd follow vfio type1 for upward compatibility
> and then change them together at a later point.
>
> I prefer to the former as IMHO I don't know when will be a later
> point w/o certain kernel changes to actually break the userspace
> policy built on a wrong accounting scheme...

I wonder if the kernel is the right place to do this. We have new uAPI
so management layer can know the difference of the accounting in
advance by

-device vfio-pci,iommufd=on

?

>
> Thanks
> Kevin
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-23 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, March 23, 2022 12:15 AM
> 
> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
>   struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
>   rlim = tsk->signal->rlim + resource;
> [..]
>   if (new_rlim)
>   *rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit =
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:unsigned long limit =
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
>   lock_limit =
>   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> PAGE_SHIFT;
>   npages = pages->npinned - pages->last_npinned;
>   do {
>   cur_pages = atomic_long_read(>source_user-
> >locked_vm);
>   new_pages = cur_pages + npages;
>   if (new_pages > lock_limit)
>   return -ENOMEM;
>   } while (atomic_long_cmpxchg(>source_user->locked_vm,
> cur_pages,
>new_pages) != cur_pages);
> 
> So it does work essentially the same.
> 
> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.
> 
> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.
> 
> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.
> 
> I plan to fix RDMA like this as well so at least we can have
> consistency within qemu.
> 

I have an impression that iommufd and vfio type1 must use
the same accounting scheme given the management stack
has no insight into qemu on which one is actually used thus
cannot adapt to the subtle difference in between. in this
regard either we start fixing vfio type1 to use user->locked_vm
now or have iommufd follow vfio type1 for upward compatibility
and then change them together at a later point.

I prefer to the former as IMHO I don't know when will be a later
point w/o certain kernel changes to actually break the userspace
policy built on a wrong accounting scheme... 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-22 Thread Jason Gunthorpe via iommu
On Tue, Mar 22, 2022 at 05:31:26PM +0100, Niklas Schnelle wrote:

> > > In fact I stumbled over that because the wrong accounting in
> > > io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> > > io_uring itself).
> > 
> > I'm pretty interested in this as well, do you have anything you can
> > share?
> 
> This was the issue reported in the following BZ.

Sorry, I was talking about the iouring usage in qemu :)

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-22 Thread Niklas Schnelle
On Tue, 2022-03-22 at 11:57 -0300, Jason Gunthorpe wrote:
> On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> > > Following the pattern of io_uring, perf, skb, and bpf iommfd will use
> > iommufd ^
> > > user->locked_vm for accounting pinned pages. Ensure the value is included
> > > in the struct and export free_uid() as iommufd is modular.
> > > 
> > > user->locked_vm is the correct accounting to use for ulimit because it is
> > > per-user, and the ulimit is not supposed to be per-process. Other
> > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > > mm->locked_vm for accounting pinned pages, but this is only per-process
> > > and inconsistent with the majority of the kernel.
> > 
> > Since this will replace parts of vfio this difference seems
> > significant. Can you explain this a bit more?
> 
> I'm not sure what to say more, this is the correct way to account for
> this. It is natural to see it is right because the ulimit is supposted
> to be global to the user, not effectively reset every time the user
> creates a new process.
> 
> So checking the ulimit against a per-process variable in the mm_struct
> doesn't make too much sense.

Yes I agree that logically this makes more sense. I was kind of aiming
in the same direction as Alex i.e. it's a conscious decision to do it
right and we need to know where this may lead to differences and how to
handle them.

> 
> > I'm also a bit confused how io_uring handles this. When I stumbled over
> > the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
> > accounting") and from that commit description I seem to rember that
> > io_uring also accounts in mm->locked_vm too? 
> 
> locked/pinned_pages in the mm is kind of a debugging counter, it
> indicates how many pins the user obtained through this mm. AFAICT its
> only correct use is to report through proc. Things are supposed to
> update it, but there is no reason to limit it as the user limit
> supersedes it.
> 
> The commit you pointed at is fixing that io_uring corrupted the value.
> 
> Since VFIO checks locked/pinned_pages against the ulimit would blow up
> when the value was wrong.
> 
> > In fact I stumbled over that because the wrong accounting in
> > io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> > io_uring itself).
> 
> I'm pretty interested in this as well, do you have anything you can
> share?

This was the issue reported in the following BZ.

https://bugzilla.kernel.org/show_bug.cgi?id=209025

I stumbled over the same problem on my x86 box and also on s390. I
don't remember exactly what limit this ran into but I suspect it had
something to do with the libvirt resource limits Alex mentioned.
Meaning io_uring had an accounting bug and then vfio / QEMU couldn't
pin memory. I think that libvirt limit is set to allow pinning all of
guest memory plus a bit so the io_uring misaccounting easily tipped it
over.

> 
> Thanks,
> Jason


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-22 Thread Jason Gunthorpe via iommu
On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:

> I'm still picking my way through the series, but the later compat
> interface doesn't mention this difference as an outstanding issue.
> Doesn't this difference need to be accounted in how libvirt manages VM
> resource limits?  

AFACIT, no, but it should be checked.

> AIUI libvirt uses some form of prlimit(2) to set process locked
> memory limits.

Yes, and ulimit does work fully. prlimit adjusts the value:

int do_prlimit(struct task_struct *tsk, unsigned int resource,
struct rlimit *new_rlim, struct rlimit *old_rlim)
{
rlim = tsk->signal->rlim + resource;
[..]
if (new_rlim)
*rlim = *new_rlim;

Which vfio reads back here:

drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = 
rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
drivers/vfio/vfio_iommu_type1.c:unsigned long limit = 
rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

And iommufd does the same read back:

lock_limit =
task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
npages = pages->npinned - pages->last_npinned;
do {
cur_pages = atomic_long_read(>source_user->locked_vm);
new_pages = cur_pages + npages;
if (new_pages > lock_limit)
return -ENOMEM;
} while (atomic_long_cmpxchg(>source_user->locked_vm, cur_pages,
 new_pages) != cur_pages);

So it does work essentially the same.

The difference is more subtle, iouring/etc puts the charge in the user
so it is additive with things like iouring and additively spans all
the users processes.

However vfio is accounting only per-process and only for itself - no
other subsystem uses locked as the charge variable for DMA pins.

The user visible difference will be that a limit X that worked with
VFIO may start to fail after a kernel upgrade as the charge accounting
is now cross user and additive with things like iommufd.

This whole area is a bit peculiar (eg mlock itself works differently),
IMHO, but with most of the places doing pins voting to use
user->locked_vm as the charge it seems the right path in today's
kernel.

Ceratinly having qemu concurrently using three different subsystems
(vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
RLIMIT_MEMLOCK differently cannot be sane or correct.

I plan to fix RDMA like this as well so at least we can have
consistency within qemu.

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-22 Thread Alex Williamson
On Tue, 22 Mar 2022 11:57:41 -0300
Jason Gunthorpe  wrote:

> On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:  
> > > 
> > > user->locked_vm is the correct accounting to use for ulimit because it is
> > > per-user, and the ulimit is not supposed to be per-process. Other
> > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > > mm->locked_vm for accounting pinned pages, but this is only per-process
> > > and inconsistent with the majority of the kernel.  
> > 
> > Since this will replace parts of vfio this difference seems
> > significant. Can you explain this a bit more?  
> 
> I'm not sure what to say more, this is the correct way to account for
> this. It is natural to see it is right because the ulimit is supposted
> to be global to the user, not effectively reset every time the user
> creates a new process.
> 
> So checking the ulimit against a per-process variable in the mm_struct
> doesn't make too much sense.

I'm still picking my way through the series, but the later compat
interface doesn't mention this difference as an outstanding issue.
Doesn't this difference need to be accounted in how libvirt manages VM
resource limits?  AIUI libvirt uses some form of prlimit(2) to set
process locked memory limits.  A compat interface might be an
interesting feature, but does it only provide ioctl compatibility and
not resource ulimit compatibility?  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-22 Thread Jason Gunthorpe via iommu
On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote:
> On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> > Following the pattern of io_uring, perf, skb, and bpf iommfd will use
> iommufd ^
> > user->locked_vm for accounting pinned pages. Ensure the value is included
> > in the struct and export free_uid() as iommufd is modular.
> > 
> > user->locked_vm is the correct accounting to use for ulimit because it is
> > per-user, and the ulimit is not supposed to be per-process. Other
> > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> > mm->locked_vm for accounting pinned pages, but this is only per-process
> > and inconsistent with the majority of the kernel.
> 
> Since this will replace parts of vfio this difference seems
> significant. Can you explain this a bit more?

I'm not sure what to say more, this is the correct way to account for
this. It is natural to see it is right because the ulimit is supposted
to be global to the user, not effectively reset every time the user
creates a new process.

So checking the ulimit against a per-process variable in the mm_struct
doesn't make too much sense.

> I'm also a bit confused how io_uring handles this. When I stumbled over
> the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
> accounting") and from that commit description I seem to rember that
> io_uring also accounts in mm->locked_vm too? 

locked/pinned_pages in the mm is kind of a debugging counter, it
indicates how many pins the user obtained through this mm. AFAICT its
only correct use is to report through proc. Things are supposed to
update it, but there is no reason to limit it as the user limit
supersedes it.

The commit you pointed at is fixing that io_uring corrupted the value.

Since VFIO checks locked/pinned_pages against the ulimit would blow up
when the value was wrong.

> In fact I stumbled over that because the wrong accounting in
> io_uring exhausted the applied to vfio (I was using a QEMU utilizing
> io_uring itself).

I'm pretty interested in this as well, do you have anything you can
share?

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-22 Thread Niklas Schnelle
On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote:
> Following the pattern of io_uring, perf, skb, and bpf iommfd will use
iommufd ^
> user->locked_vm for accounting pinned pages. Ensure the value is included
> in the struct and export free_uid() as iommufd is modular.
> 
> user->locked_vm is the correct accounting to use for ulimit because it is
> per-user, and the ulimit is not supposed to be per-process. Other
> places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
> mm->locked_vm for accounting pinned pages, but this is only per-process
> and inconsistent with the majority of the kernel.

Since this will replace parts of vfio this difference seems
significant. Can you explain this a bit more?

I'm also a bit confused how io_uring handles this. When I stumbled over
the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm
accounting") and from that commit description I seem to rember that
io_uring also accounts in mm->locked_vm too? In fact I stumbled over
that because the wrong accounting in io_uring exhausted the applied to
vfio (I was using a QEMU utilizing io_uring itself).

> 
> Signed-off-by: Jason Gunthorpe 
> ---
>  include/linux/sched/user.h | 2 +-
>  kernel/user.c  | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
> index 00ed419dd46413..c47dae71dad3c8 100644
> --- a/include/linux/sched/user.h
> +++ b/include/linux/sched/user.h
> @@ -24,7 +24,7 @@ struct user_struct {
>   kuid_t uid;
>  
>  #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
> -defined(CONFIG_NET) || defined(CONFIG_IO_URING)
> +defined(CONFIG_NET) || defined(CONFIG_IO_URING) || 
> IS_ENABLED(CONFIG_IOMMUFD)
>   atomic_long_t locked_vm;
>  #endif
>  #ifdef CONFIG_WATCH_QUEUE
> diff --git a/kernel/user.c b/kernel/user.c
> index e2cf8c22b539a7..d667debeafd609 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
>   if (refcount_dec_and_lock_irqsave(>__count, _lock, ))
>   free_user(up, flags);
>  }
> +EXPORT_SYMBOL_GPL(free_uid);
>  
>  struct user_struct *alloc_uid(kuid_t uid)
>  {


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-18 Thread Jason Gunthorpe via iommu
Following the pattern of io_uring, perf, skb, and bpf iommfd will use
user->locked_vm for accounting pinned pages. Ensure the value is included
in the struct and export free_uid() as iommufd is modular.

user->locked_vm is the correct accounting to use for ulimit because it is
per-user, and the ulimit is not supposed to be per-process. Other
places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or
mm->locked_vm for accounting pinned pages, but this is only per-process
and inconsistent with the majority of the kernel.

Signed-off-by: Jason Gunthorpe 
---
 include/linux/sched/user.h | 2 +-
 kernel/user.c  | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 00ed419dd46413..c47dae71dad3c8 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -24,7 +24,7 @@ struct user_struct {
kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-defined(CONFIG_NET) || defined(CONFIG_IO_URING)
+defined(CONFIG_NET) || defined(CONFIG_IO_URING) || 
IS_ENABLED(CONFIG_IOMMUFD)
atomic_long_t locked_vm;
 #endif
 #ifdef CONFIG_WATCH_QUEUE
diff --git a/kernel/user.c b/kernel/user.c
index e2cf8c22b539a7..d667debeafd609 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -185,6 +185,7 @@ void free_uid(struct user_struct *up)
if (refcount_dec_and_lock_irqsave(>__count, _lock, ))
free_user(up, flags);
 }
+EXPORT_SYMBOL_GPL(free_uid);
 
 struct user_struct *alloc_uid(kuid_t uid)
 {
-- 
2.35.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu