On Thu, Jul 2, 2009 at 11:55 PM, Andrew Morton<a...@linux-foundation.org> wrote:
>> The seq_file iterator for these files relies on them being sorted so
>> that it can pick up where it left off even in the event of the pid set
>> changing between reads - it does a binary search to find the first pid
>> greater than the last one that was returned, so as to guarantee that
>> we return every pid that was in the cgroup before the scan started and
>> remained in the cgroup until after the scan finished; there are no
>> guarantees about pids that enter/leave the cgroup during the scan.
>
> OIC.  Gee we made it hard for ourselves.  That tears it.

Without a guarantee like that, the file becomes much less useful - it
would just be "a random subset of the pids that were in the cgroup at
some time during the read".

>
> Was it a mistake to try to present an ordered, dupes-removed view of a
> large amount of data from the kernel?

The dupe-removal is a red-herring in this argument - that's only added
by this patch, and doesn't affect the existing "tasks" file. I think
we could even change it into a "mostly dupe-removed" view very cheaply
by simply ignoring a tgid when building the array if it's the same as
the previous one we just saw, and leaving userspace to deal with any
remaining dupes.

The ordered semantics of the tasks file comes from cpusets and I
maintained the API (which actually allocated a potentially even larger
array containing the pre-formatted data).

I'd be surprised if removing the ordering constraint broke anything in
userspace, but the more important part about the ordering is to allow
the kernel to present a consistent view in the presence of changing
pids.

Hmm, I guess we could use a combination of the IDR approach that you
suggested and the present shared-array approach:

- when opening a tasks file:
  - populate an IDR with all the pids/tgids in the cgroup
  - find any existing IDR open for the cgroup in the list keyed by
namespace and filetype ("procs"/"tasks")
  - replace (and free) the existing IDR or extend the list with a new one
  - bump the refcount

- when reading:
  - iterate from the last reported pid/tgid

- when closing:
  - drop the refcount, and free the IDR if the count reaches 0

That maintains the property of preventing userspace from consuming
arbitrary amounts of memory just by holding an fd open on a large
cgroup, while also maintaining a consistency guarantee, and we get the
ordering for free as a side-effect of the IDR, with no large memory
allocations. It's not hugely different from the current solution - all
we're doing is replacing the large array in the cgroup_pidlist
structure with an IDR, and populating/reading it appropriately.

The downsides would be a higher fixed cost, I suspect - setting up an
IDR and populating/scanning it sounds like it has to be more expensive
than filling/reading a linear buffer. But it's probably not enough
extra overhead to worry too much about it.

Paul
_______________________________________________
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

_______________________________________________
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

Reply via email to