from:"H. Peter Anvin"

[Devel] Re: [RFC PATCH 0/5] net: socket bind to file descriptor introduced

2012-08-15 Thread H. Peter Anvin

On 08/15/2012 12:49 PM, Eric W. Biederman wrote:
> 
> There is also the trick of getting a shorter directory name using
> /proc/self/fd if you are threaded and can't change the directory.
> 
> The obvious choices at this point are
> - Teach bind and connect and af_unix sockets to take longer AF_UNIX
>   socket path names.
> 
> - introduce sockaddr_fd that can be applied to AF_UNIX sockets,
>   and teach unix_bind and unix_connect how to deal with a second type of 
> sockaddr.
>   struct sockaddr_fd { short fd_family; short pad; int fd; };
> 
> - introduce sockaddr_unix_at that takes a directory file descriptor
>   as well as a unix path, and teach unix_bind and unix_connect to deal with a
>   second sockaddr type.
>   struct sockaddr_unix_at { short family; short pad; int dfd; char path[102]; 
> }
>   AF_UNIX_AT
> 
> I don't know what the implications of for breaking connect up into 3
> system calls and changing the semantics are and I would really rather
> not have to think about it.
> 
> But it certainly does not look to me like you introduce new systems
> calls to do what you want.
> 

How would you distinguish the new sockaddr types from the traditional
one?  New AF_?

-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC PATCH 0/5] net: socket bind to file descriptor introduced

2012-08-15 Thread H. Peter Anvin

On 08/15/2012 09:52 AM, Ben Pfaff wrote:
> Stanislav Kinsbursky  writes:
> 
>> This system call is especially required for UNIX sockets, which has name
>> lenght limitation.
> 
> The worst of the name length limitations can be worked around by
> opening the directory where the socket is to go as a file
> descriptor, then using /proc/self/fd// as the name
> of the socket.  This technique also works with "connect" and in
> other contexts where a struct sockaddr is needed.  At first
> glance, it looks like your patches only help with "bind".
> 

The really hard part is what to do with things that are supposed to
return a struct sockaddr.  I also have some reservations about using a
new system call to deal with what at least theoretically is only part of
one socket domain.

-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC PATCH 5/5] syscall: sys_fbind() introduced

2012-08-15 Thread H. Peter Anvin


On 08/15/2012 09:22 AM, Stanislav Kinsbursky wrote:

This syscall allows to bind socket to specified file descriptor.
Descriptor can be gained by simple open with O_PATH flag.
Socket node can be created by sys_mknod().

Signed-off-by: Stanislav Kinsbursky 
---
  arch/x86/syscalls/syscall_32.tbl |1 +
  arch/x86/syscalls/syscall_64.tbl |1 +
  include/linux/syscalls.h |1 +
  kernel/sys_ni.c  |3 +++
  net/socket.c |   25 +
  5 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 7a35a6e..9594b82 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -356,3 +356,4 @@
  347   i386process_vm_readvsys_process_vm_readv
compat_sys_process_vm_readv
  348   i386process_vm_writev   sys_process_vm_writev   
compat_sys_process_vm_writev
  349   i386kcmpsys_kcmp
+350i386fbind   sys_fbind


i386 uses socketcalls... perhaps it shouldn't (socketcalls are pretty 
much an abomination), but for socketcall-based architectures this really 
should be a socketcall.


Don't you also need fconnect()?  Or is that simply handled by allowing 
open() without O_PATH?


    -hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC PATCH 0/2] net: connect to UNIX sockets from specified root

2012-08-10 Thread H. Peter Anvin

On 08/10/2012 12:28 PM, Alan Cox wrote:
> Explicitly for Linux yes - this is not generally true of the AF_UNIX
> socket domain and even the permissions aspect isn't guaranteed to be
> supported on some BSD environments !

Yes, but let's worry about what the Linux behavior should be.

> The name is however just a proxy for the socket itself. You don't even
> get a device node in the usual sense or the same inode in the file system
> space.

No, but it is looked up the same way any other inode is (the difference
between FIFOs and sockets is that sockets have separate connections,
which is also why open() on sockets would be nice.)

However, there is a fundamental difference between AF_UNIX sockets and
open(), and that is how the pathname is delivered.  It thus would make
more sense to provide the openat()-like information in struct
sockaddr_un, but that may be very hard to do in a sensible way.  In that
sense it perhaps would be cleaner to be able to do an open[at]() on the
socket node with O_PATH (perhaps there should be an O_SOCKET option,
even?) and pass the resulting file descriptor to bind() or connect().

-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC PATCH 0/2] net: connect to UNIX sockets from specified root

2012-08-10 Thread H. Peter Anvin

On 08/10/2012 11:40 AM, Alan Cox wrote:
> 
> Agreed on open() for sockets.. the lack of open is a Berklix derived
> pecularity of the interface. It would equally be useful to be able to
> open "/dev/socket/ipv4/1.2.3.4/1135" and the like for scripts and stuff
> 
> That needs VFS changes however so you can pass the remainder of a path to
> a device node. It also lets you do a lot of other sane stuff like
> 
>   open /dev/ttyS0/9600/8n1
> 

Well, supporting device node subpaths would be nice, but I don't think
that that is a requirement either for being able to open() a socket (as
a Linux extension) nor for supporting something like your above
/dev/socket/... since that could be done with a filesystem rather than
just a device node.

-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC PATCH 0/2] net: connect to UNIX sockets from specified root

2012-08-10 Thread H. Peter Anvin

On 08/10/2012 11:26 AM, Alan Cox wrote:
>> On that whole subject...
>>
>> Do we need a Unix domain socket equivalent to openat()?
> 
> I don't think so. The name is just a file system indexing trick, it's not
> really the socket proper. It's little more than "ascii string with
> permissions attached" - indeed we also support an abstract name space
> which for a lot of uses is actually more convenient.
> 

I don't really understand why Unix domain sockets is different than any
other pathname users in this sense.  (Actually, I have never understood
why open() on a Unix domain socket doesn't give the equivalent of a
socket() + connect() -- it would make logical sense and would provide
additional functionality).

It would be different if the Unix domain sockets simply required an
absolute pathname (it is not just about the root, it is also about the
cwd, which is where the -at() functions come into play), but that is not
the case.

The abstract namespace is irrelevant for this, obviously.

> AF_UNIX between roots raises some interesting semantic questions when
> you begin passing file descriptors down them as well.

Why is that?  A file descriptor carries all that information with it...

-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC PATCH 0/2] net: connect to UNIX sockets from specified root

2012-08-10 Thread H. Peter Anvin

On 08/10/2012 05:57 AM, Stanislav Kinsbursky wrote:
> Today, there is a problem in connecting of local SUNRPC thansports. These
> transports uses UNIX sockets and connection itself is done by rpciod
> workqueue.
> But UNIX sockets lookup is done in context of process file system root. I.e.
> all local thunsports are connecting in rpciod context.
> This works nice until we will try to mount NFS from process with other root -
> for example in container. This container can have it's own (nested) root and
> rcpbind process, listening on it's own unix sockets. But NFS mount attempt in
> this container will register new service (Lockd for example) in global rpcbind
> - not containers's one.
> 
> This patch set introduces kernel connect helper for UNIX stream sockets and
> modifies unix_find_other() to be able to search from specified root.
> It also replaces generic socket connect call for local transports by new
> helper in SUNRPC layer.
> 
> The following series implements...

On that whole subject...

Do we need a Unix domain socket equivalent to openat()?

-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone

2010-07-06 Thread H. Peter Anvin

On 07/06/2010 08:12 AM, Oren Laadan wrote:
>>
>> The child returns from vfork, via the same return address that
>> the parent will later use. (on the stack for many architectures)
>> The child then calls a function which might not have the same
>> stack layout as vfork, scrambling whatever may be on the stack
>> that the parent will be using to return from vfork. The parent may
>> then end up using a return address that has been corrupted.
>> To make this work, gcc actually recognizes vfork and has
>> special handling for it.
> 
> I assumed that this is taken care of by libc rather than the
> compiler, like it is done for clone(2).
> 

No, vfork is *really* special, because the two threads share a stack.

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone

2010-07-04 Thread H. Peter Anvin

On 07/04/2010 04:39 PM, Matt Helsley wrote:
>>
>> 1. can you implement it for i386 (register starved) using eclone?
> 
> That's a very good question. I'm going to punt on a direct answer for
> now. Instead,  I wonder if it's even worth enabling vfork through eclone.
> vfork is rarely used, is supported by the "old" clone syscall, and any
> old code adapted to use eclone for vfork would need significant
> changes because of vfork's specialness. (A consequence of the way vfork
> borrows page tables and must avoid clobbering parent's registers..)
> 

vfork is its own system call for a reason.  We used to do it with
sys_clone, and it turned out to be a mess.  Doing it in a separate
system call -- even though the internals are largely the same -- is cleaner.

-hpa
-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

2009-10-22 Thread H. Peter Anvin

On 10/22/2009 09:14 PM, Michael Kerrisk wrote:
>
> So, sometimes, a number in a system call should be the bit width of
> some arguments(s), sometimes it should be the number of arguments, and
> sometimes (well, just occasionally, as in mmap2() and clone()) -- it
> should be a version number? Does the weather play any part in the
> decision? ;-)
>

The notion is that they are *some* kind of description on how the system 
call has been augmented.  The bitwidths and argument numbers are 
non-overlapping and visually very different, so are not subject to 
confusion.  Your argument makes about as much sense as saying the letter 
"a" should have the same meaning in every context.

-hpa
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

2009-10-22 Thread H. Peter Anvin

On 10/22/2009 07:26 PM, Michael Kerrisk wrote:
>>
>> "3" is number of arguments.
>
> sys_clone3(struct clone_struct __user *ucs, pid_t __user *pids)
>
> It appears to me that the number of arguments is 2.
>

It was 3 at one point... I'm not sure when that changed last :-/

>> It's better than "extended" or something like
>> that simply because "extended" just means "more than", and a number at least
>> tells you *how much more than*.
>
> I'm not sure why you think including a number in the name tells us
> "how much more than". Unless you are considering the numbering to be
> version numbers, which apparently is not what you mean.

It is a version number of sorts.

-hpa
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

2009-10-21 Thread H. Peter Anvin

On 10/22/2009 04:44 AM, Sukadev Bhattiprolu wrote:
>>
>> "3" is number of arguments.
>
> To me, it is a version number.
>
> mmap() and mmap2() both have 6 parameters.
>

You keep bringing this up.  mmap2() is (a) a non-user-visible call; (b) 
an exception (a mistake, if you want.)

> Besides if wait4() were born before wait3(), would it still be wait4() :-)

Yes.  wait3() came before wait4(), but there never was a wait2().

> But I see that it is hard to get one-convention-that-fits-all.
>
>> It's better than "extended" or something
>> like that simply because "extended" just means "more than", and a number
>> at least tells you *how much more than*.
>
> And extended assumes we wont extend again.

Exactly.

-hpa
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

2009-10-21 Thread H. Peter Anvin

On 10/21/2009 01:26 PM, Michael Kerrisk wrote:
>
> My question here is: what does "3" actually mean? In general, system
> calls have not followed any convention of numbering to indicate
> successive versions -- clone2() being the one possible exception that
> I know of.
>

"3" is number of arguments.  It's better than "extended" or something 
like that simply because "extended" just means "more than", and a number 
at least tells you *how much more than*.

-hpa
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 9/10]: Define clone3() syscall

2009-10-19 Thread H. Peter Anvin

On 10/20/2009 02:44 AM, Matt Helsley wrote:
>> |
>> | I know I'm late to this discussion, but why the name clone3()? It's
>> | not consistent with any other convention used fo syscall naming,

This assumption, of course, is just plain wrong.  Look at the wait 
system calls, for example.  However, when a small integer is used like 
that, it pretty much always reflects numbers of arguments.

>> | AFAICS. I think a name like clone_ext() or clonex() (for extended)
>> | might make more sense.
>>
>> Sure, we talked about calling it clone_extended() and I can go back
>> to that.
>>
>> Only minor concern with that name was if this new call ever needs to
>> be extended, what would we call it :-). With clone3() we could add a
>> real/fake parameter and call it clone4() :-p
>
> Perhaps clone64 (somewhat like stat64 for example)?
>

I think that doesn't exactly reflect the nature of the changes.

clone3() is actually pretty good.

-hpa
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

2009-10-14 Thread H. Peter Anvin

On 10/14/2009 03:36 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [...@zytor.com] wrote:
> | 
> | Overall it seems sane to:
> | 
> | a) make it an actual 3-argument call;
> | b) make the existing flags a u32 forever, and make it a separate
> |argument;
> | c) any new expansion can be via the struct, which may want to have
> |an "c3_flags" field first in the structure.
> 
> Ok, So will this work ?
> 
>   struct clone_args {
>   u32 flags_high; /* new clone flags (higher bits) */ 
>   u32 reserved1;
>   u32 nr_pids;
>   u32 reserved2;
>   u64 child_stack_base;
>   u64 child_stack_size;
>   u64 parent_tid_ptr;
>   u64 child_tid_ptr;
>   u64 reserved3;
>   };
> 
>   sys_clone3(u32 flags_low, struct clone_args *args, pid_t *pid_list)
> 
> Even on 64bit architectures the applications have to use sys_clone3() for
> the extended features.

Yes, although I'd just make flags_high a u64.  The other thing that
might be worthwhile is to have a length field on the structure; that way
we could add new fields at the end if ever necessary in the future.

-hpa
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

2009-10-13 Thread H. Peter Anvin

On 10/13/2009 09:40 PM, Sukadev Bhattiprolu wrote:
> H. Peter Anvin [...@zytor.com] wrote:
> | > 
> | > Except we can't use clone2() because it conflicts on ia64.  Care to 
> propose
> | > a name you would prefer?
> 
> Yes, I am running out of ideas :-)
> 
> How about clone64_with_pids() ? - hope we don't need a 65th clone-flag :p

never_clone_alone()  ;)

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

2009-10-13 Thread H. Peter Anvin

On 10/13/2009 09:36 PM, Sukadev Bhattiprolu wrote:
> 
> Would it help to use a type clone_flags_64_t to make the distinction
> between types more explicit ?
> 

The problem with using the same flags in two places, one as a 32-bit and
one as a 64-bit number, is that using one in the wrong place will cause
silent, but deadly, truncation.

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

2009-10-13 Thread H. Peter Anvin

On 10/13/2009 06:39 PM, Matt Helsley wrote:
> On Tue, Oct 13, 2009 at 04:49:05PM -0700, H. Peter Anvin wrote:
>> On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
>>>
>>> This patchset implements a new system call, clone3() that lets a process
>>> specify the pids of the child process.
>>>
>>
>> A system call named clone3() taking two parameters is just too weird to
>> live.  No, please.
> 
> Except we can't use clone2() because it conflicts on ia64.  Care to propose
> a name you would prefer?
> 
> Also I was a bit suprised to discover there are plenty of examples where this
> convention has not been followed: vm86, lseek64, and mmap2 to name a few. In
> fact, of the 46 __NR_foo[[:digit:]]+, 36 break this convention on x86-32.
> 

The -86, -64 and so on are visually obviously not a parameter count.
sys_mmap2 is not user visible, and so doesn't really matter.

-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

2009-10-13 Thread H. Peter Anvin

On 10/13/2009 04:53 PM, Roland McGrath wrote:
>> My only concern is the support of 64-bit clone flags on 32-bit architectures.
> 
> Oy.  I didn't realize there was serious consideration of having more than
> 32 flags.  IMHO it would be a bad choice, since they could only be used via
> clone3.  Having high-bit flags work in clone on 64-bit machines but not on
> 32-bit machines just seems like a wrongly confusing way for things to be.
> If any high-bits flags are constrained even on 64-bit machines to uses in
> clone3 calls for sanity purposes, then it seems questionable IMHO to have
> them be more flags in the same u64 at all.
> 
> Since all new features will be via this struct, various new kinds of things
> could potentially be done by other new struct fields independent of flags.
> But that would of course require putting enough reserved fields in now and
> requiring that they be zero-filled now in anticipation of such future uses,
> which is not very pleasant either.
> 
> In short, I guess I really am saying that "clone_flags_high" (or
> "more_flags" or something) does seem better to me than any of the
> possibilities for having more than 32 CLONE_* in the current flags word.
> 

Overall it seems sane to:

a) make it an actual 3-argument call;
b) make the existing flags a u32 forever, and make it a separate
   argument;
c) any new expansion can be via the struct, which may want to have
   an "c3_flags" field first in the structure.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v8][PATCH 0/10] Implement clone3() system call

2009-10-13 Thread H. Peter Anvin

On 10/12/2009 09:49 PM, Sukadev Bhattiprolu wrote:
> 
> This patchset implements a new system call, clone3() that lets a process
> specify the pids of the child process.
> 

A system call named clone3() taking two parameters is just too weird to
live.  No, please.

    -hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall

2009-09-29 Thread H. Peter Anvin

On 09/29/2009 03:11 PM, Linus Torvalds wrote:
> 
> Ok, I agree with that. The kernel side is easy (we have magic calling 
> conventions there and need to turn registers into arguments anyway before 
> you get to the shared code), but your point about the user side prototype 
> is valid.
> 

I think it would also apply to kernel-side munging.  It's quite possibly
you're right in that clone is such a special case anyway, but it seems
pointless to make it more special in the short bus sort of way even if
it is possible.

Let's just make it another system call.  It doesn't have any downside
that I can see, might prevent problems, and avoids setting a bad
precedent that someone can misinterpret.

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall

2009-09-29 Thread H. Peter Anvin

On 09/29/2009 12:10 PM, Linus Torvalds wrote:
> 
> On Tue, 29 Sep 2009, Arjan van de Ven wrote:
>>>
>>> We already have a syscall layer which is painful to thunk in places,
>>> and this would make it much worse.
>>
>> syscalls are cheap as well.
>> cheaper than decades of dealing with such multiplexer mess ;/
> 
> Well, I'd agree, except the clone flags really _are_ about multiplexer 
> issues, and the new flag woudln't really change anything. 
> 
> If the new system call actually had appreciably separate code-paths, I'd 
> buy the "multiplexer" argument. But it doesn't really. It's going to call 
> down to the same basic clone functionality, and the core clone code ends 
> up de-multiplexing the cases anyway.
> 
> So this would not at all be like the socket calls (to pick the traditional 
> Linux system call multiplexing example) in that sense.
> 

That's not the main issue here, though.  The main issue is that the
prototype of the function now depends on one of its arguments, which is
absolute hell for anything that needs to thunk arguments in a systematic
way, which we have to do on several architectures, and which would be
useful to be able to do for others, too.

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall

2009-09-29 Thread H. Peter Anvin

On 09/29/2009 12:02 PM, Arjan van de Ven wrote:
> On Tue, 29 Sep 2009 11:44:52 -0700
> "H. Peter Anvin"  wrote:
> 
>> On 09/29/2009 11:40 AM, Roland McGrath wrote:
>>> Why add a new syscall at all instead of just using a new CLONE_*
>>> flag to indicate that the argument layout is different?
>>
>> What an absolutely atrociously bad idea.
>>
>> We already have a syscall layer which is painful to thunk in places,
>> and this would make it much worse.
>>
> syscalls are cheap as well.
> cheaper than decades of dealing with such multiplexer mess ;/
> 

It really comes down to wanting all the dispatch to happen in one
central place.

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall

2009-09-29 Thread H. Peter Anvin

On 09/29/2009 11:40 AM, Roland McGrath wrote:
> Why add a new syscall at all instead of just using a new CLONE_* flag to
> indicate that the argument layout is different?

What an absolutely atrociously bad idea.

We already have a syscall layer which is painful to thunk in places, and
this would make it much worse.

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v5][PATCH 8/8]: Define clone_with_pids() syscall

2009-09-09 Thread H. Peter Anvin

On 09/09/2009 11:03 AM, Sukadev Bhattiprolu wrote:
> 
> C90 or C99 below should work. Is it ok to use a data structure that is
> not in C89 ? 
> 

C89 is the same as C90 (C89 refers to the ANSI standard, C90 to the ISO
standard, but they're functionally identical.)

> BTW, would it work if we defined 
> 
>   struct pid_set {
>   u64 pids;
>   int num_pids;
>   }
> 
> where ->pids can be still be a pointer ? The data structure would
> have the same size on all architectures.
> 

It would rather suck in terms of usability, though.

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][v5][PATCH 8/8]: Define clone_with_pids() syscall

2009-09-09 Thread H. Peter Anvin

On 09/09/2009 05:19 AM, Arnd Bergmann wrote:
> 
> This is a complex problem. The structure above would need a conversion
> for the pointer size that you can avoid by using a u64, but that introduces
> another problem:
> 
> 2. use a single pointer, with variable length data structures:
> 
> struct pid_set {
>   int num_pids;
>   pid_t pids[0];
> };
> 
> Since pid_t is always an int, you have no problem with padding or
> incompatible types, but rely on a data structure definition that
> is not in C89 (not sure about C99).
> 

C90 has these data structures, but you have to give the array a nonzero
length:

struct pid_set {
int num_pids;
pid_t pids[1];
};

In C99, this is spelt:

struct pid_set {
int num_pids;
    pid_t pids[];
};

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: BUG in tty_open when using containers and ptrace

2009-07-23 Thread H. Peter Anvin

Grzegorz Nosek wrote:
> 
> I tried that and while it did not oops, I wasn't sure where to dput it
> back, so it leaked like a sieve. I'm probably missing something obvious
> but I couldn't find a function whose calls balanced calls to
> devpts_get_tty.
> 
> BTW, what would the semantics be? I.e. what can we do with a pty that has
> its master side long gone?
> 

Nothing, but as long as something is keeping the pts file entry open, it
 should not be garbage-collected.

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFCv2][PATCH] flexible array implementation

2009-07-22 Thread H. Peter Anvin

On 07/21/2009 03:00 PM, Dave Hansen wrote:
> 
> Here's an alternative.  I think it's what Andrew was
> suggesting  here:
> 
>   http://lkml.org/lkml/2009/7/2/518 
> 
> I call it a flexible array.  It does all of its work in
> PAGE_SIZE bits, so never does an order>0 allocation.
> The base level has PAGE_SIZE-2*sizeof(int) bytes of
> storage for pointers to the second level.  So, with a
> 32-bit arch, you get about 4MB (4183112 bytes) of total
> storage when the objects pack nicely into a page.  It
> is half that on 64-bit because the pointers are twice
> the size.
> 

I'm wondering if there is any use case which would require scaling below
the PAGE_SIZE level... in which case it would be nice for it to
gracefully decay to a single kmalloc allocation + some metadata.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: BUG in tty_open when using containers and ptrace

2009-07-22 Thread H. Peter Anvin

On 07/22/2009 06:27 PM, Sukadev Bhattiprolu wrote:
> | 
> | Immediate crash. I tried 2.6.18-something (Debian etch kernel) that I
> | had lying around on the VM. The result:
> 
> Interesting.
> 
> Attaching test program and Ccing Peter Anvin for any insights.
> 
> | idr_remove called for id=0 which is not allocated.
> |  [] idr_remove+0xd4/0x137
> |  [] release_mem+0x1d5/0x1e1
> |  [] release_dev+0x5d6/0x5ee
> |  [] __wake_up+0x2a/0x3d
> |  [] tty_ldisc_enable+0x1f/0x21
> |  [] init_dev+0x378/0x49f
> |  [] tty_open+0x2a9/0x2e8
> |  [] chrdev_open+0x126/0x141
> |  [] chrdev_open+0x0/0x141
> |  [] __dentry_open+0xc8/0x1ac
> |  [] nameidata_to_filp+0x19/0x28
> |  [] do_filp_open+0x2b/0x31
> |  [] do_nanosleep+0x43/0x6a
> |  [] do_sigaction+0x99/0x156
> |  [] do_sys_open+0x3e/0xb3
> |  [] sys_open+0x16/0x18
> |  [] syscall_call+0x7/0xb
> | 
> | (on the bright side, the machine is still usable afterwards).
> | 
> | However, 2.6.26 (both mine and Debian) survives the test so it may indeed
> | be a recent regression (was it broken again after fixing sometime
> | between .18 and .26?)
> | 
> | Bisecting...

Interesting... I have to say I'm more than a bit surprised that you can
mount a filesystem on top of a character device node at all, but there
isn't really a fundamental reason why you couldn't do it, so...

I am assuming that what causes the problem is that you have found a way
(vfsmount) to hold the pts device node busy which doesn't involve the
tty subsystem.  This isn't inherently a problem, but it does have
implications for freeing: in particular, the pts node cannot be removed
until the vfsmount is gone, *and* the device number cannot be reclaimed.
 It sounds like it's the latter piece which causes problems.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/9] Multiple devpts instances

2009-02-23 Thread H. Peter Anvin

Daniel Lezcano wrote:
> 
> Yep,  I changed my mind, I think Eric and HPA are right. devpts is a 
> file system and not a namespace even if the result is the same. That 
> makes sense to keep a global sysctl for the root container and handle 
> security problem with user namespace and mount option.
> 

No, it's more dramatic than that.

Namespaces are not resource allocation boundaries, even though in the 
container use case you probably want both.

Furthermore, namespaces are relatively straightforward in comparison: 
you generally either want to share a namespace or you don't.  Resource 
control policies are much more complex.  In the general case you want to 
be able to support a hierarchial cascade of policies; at the least you 
want to have global and local limits.

Furthermore, there are a number of use cases for resource allocation 
boundaries that do *not* involve namespaces.

-hpa
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/9] Multiple devpts instances

2009-02-23 Thread H. Peter Anvin

Serge E. Hallyn wrote:
>>
>> If you want security and permission arguments get with Serge and finish
>> the uid namespace.  The you will have a user that looks like root but
>> does not have permissions to do most things.
> 
> Right, and in particular the way it would partially solve this issue is
> that the procsys limit file would be owned by root in the initial uid
> namespace, so root in a child container would not be able to write to
> it.
> 

No, uid namespace is not the right thing for this.  If anything, it 
should be controlled by a capability flag.  This is a general issue for 
procfs and sysfs as used for controlling any kind of system resources, 
though.

> Defining a new mount option to set a per-sb limit seems useful though,
> as I could easily see wanting to limit containers (on a 1000-container
> system) to 3 ptys each for instance.

What probably would make more sense is to limit containers to a specific 
number of inodes or open file descriptors.  The pty limit was a quick 
hack to avoid DoS, but it's really equivalent (with a small multiplier, 
~3 or so) to "open inodes".

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/9] Multiple devpts instances

2009-02-19 Thread H. Peter Anvin

Eric W. Biederman wrote:
>>
>> Really.  You have the same classes of issues with ANY allocatable
>> resource in the system.  Period.  Furthermore, there are quite a few
>> applications which want one and not the other.  Trying to entangle
>> them is broken.
> 
> Peter they are entangled issues because the limits frequently show up
> in the naming.  pids are a good example of that.
> 

No.  The *reason* for these limits are a matter of resource control, and
that has to at least have the ability to be global.

Entangling them because it peers through somewhat in the naming is
nonsense, at least in this case.

Containers may very well want resource control, but it's a separate
issue from naming.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/9] Multiple devpts instances

2009-02-19 Thread H. Peter Anvin

Daniel Lezcano wrote:
> 
> But if I am able to create a new instance of devpts for a container and 
> modify the configuration of another devpts from this container, is it 
> acceptable ? Can we convince people to use the containers for security 
> and have anybody able to make a pty starvation from one container to 
> another ?
> If it is too much complicated to handle one value per new devpts 
> instance, IMHO /proc/sys/kernel/pty/max should be, at least, read-only 
> for the new instance, no ?
> 

First of all, there is no such thing... the devpts instance is simply 
another filesystem, whereas the /proc/sys entry is a global limit on the 
total number of ptys in the system.  Again, one of thousands, and yes, 
they probably should ALL be readonly in a container environment.  That 
has to be set up separately than the devpts filesystem, because the 
devpts filesystem is not tied to procfs or even containers in any way.

-hpa
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/9] Multiple devpts instances

2009-02-19 Thread H. Peter Anvin

Daniel Lezcano wrote:
>>
>> Resource limit partitioning is a much bigger and orthogonal problem.
>>   
> In this case we don't have the pty allocated independently, no ?
> I mean one container can allocate 4095 pty, making a pty starvation for 
> others containers. Or imagine I am a vilain and I want to mess the other 
> containers, I can do echo 0 > /proc/sys/kernel/pty/max.
> AFAIR, we said people making isolation of a resource is in charge (if it 
> is relevant), to take into account the /proc/sys part.
> 
> For example, making the network per namespace all the network 
> configuration variable located in /proc/sys/net are per namespace too. 
> When it is irrelevant the file is read-only or just not displayed.
> 
> IMHO, pty/max and pty/nr is part of the "multiple devpts instances" 
> feature.
> 

Naming and resource partitioning are two orthogonal issues, regardless 
of what's IYHO.

Really.  You have the same classes of issues with ANY allocatable 
resource in the system.  Period.  Furthermore, there are quite a few 
applications which want one and not the other.  Trying to entangle them 
is broken.

-hpa

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/9] Multiple devpts instances

2009-02-19 Thread H. Peter Anvin

Daniel Lezcano wrote:
> suka...@linux.vnet.ibm.com wrote:
>> Enable multiple instances of devpts filesystem so each container can
>> allocate
>> ptys independently.
>>   
> Hi suka,
> 
> It looks like the /proc/sys/kernel/pty/max and nr are not virtualized.
> Modifying in the container the "max" pty, that impacts the init_pty.
> Same as nr which does not show the real number of pty allocated for the
> container.
> 
> Are you planning to fix this ?
> 

That's a separate issue, i.e. a resource allocation
localization/globalization issue.  The main reason for these limits is
top put a cap on the amount of low kernel memory used on 32-bit systems
especially, which is somewhat inherently global.

Resource limit partitioning is a much bigger and orthogonal problem.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 9/9] Document usage of multiple-instances of devpts

2008-10-15 Thread H. Peter Anvin

Serge E. Hallyn wrote:
> Looks good.  In the very last part, you might say just a little more to
> make sure it's clear:  You want to mount -o newinstance before sshd
> or gnome is started in the root container, so that a child container
> can't reach your devpts by doing a mount -t devpts without -o
> newinstance.  It's not that it's not clear in what you write, it's
> more that it's at the very end and brief, so I'm afraid it's not
> attention-grabbing enough as is.

Actually, you should just enable newinstance everywhere, in particular 
in your fstab, so that ALL instances of devpts in the system have 
newinstance (leaving the legacy one unreachable.)

In that sense I think your text above is more confusing than what 
Sukadev had.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 10/10] Document usage of multiple-instances of devpts

2008-09-19 Thread H. Peter Anvin

Alan Cox wrote:
> Ok I'm happy with this patch set. It appears correct as far as the tty
> side is concerned, it looks sensible in terms of interface with the
> devpts layer.
> 
> Really depends what everyone else thinks about the vfs bits and the API

The last version looks fine to me.  To be fair, I have only reviewed it, 
not actually tested it.

Acked-by: H. Peter Anvin <[EMAIL PROTECTED]>

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v3]: Enable multiple instances of devpts

2008-09-06 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> Agree in general. Not sure if you are implying remount is necessary just
> to change permissions of pts/ptmx. Why not "chmod 0666 /dev/pts/ptmx" ?
> The remount changes the 'ptmxmode' setting, but since the node exists,
> the 'ptmxmode' setting is never used again and we need to chmod.

A chmod requires bigger changes to existing scripts than an option which 
can be set in /etc/fstab.

> ptmx node in multi-instance mounts continue to get PTMX_DEFAULT_MODE
> permissions (not 000) right ? (unless -o ptmxmode is specified)

It's probably easier to always default it to zero and expect that the 
mode is set explicitly.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v3]: Enable multiple instances of devpts

2008-09-05 Thread H. Peter Anvin

Alan Cox wrote:
>> Does presence of /dev/pts/ptmx in single-instance case break userspace ?
> 
> It changes the permssion rules and subverts any permissions and security
> labels applied to the current node.
> 
> If it was there and defaulted to no permission I doubt anything would
> care - ie presence is not the problem, rights management is.

It would be easy enough to have it default to mode 000 unless otherwise 
specified.  For the default instance it is important that a remount can 
update the permissions (since the original mount will be the kernel 
version), but that's pretty straightforward.

That might be the best option?

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v3]: Enable multiple instances of devpts

2008-09-04 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:

> 
> When both modes are used simultaneously, we have following options:
> 
> 1. Let container-startup deal with it i.e use above bind-mount approach
>or, as Serge mentioned, have containers chroot and make ptmx->pts/ptmx
>symlink or another option ?
> 
> 2. Have the ptmx-node even in the initial mount and a "permanent" ptmx
>symlink -  Did we fully rule it out :-)
> 
> 3. Choose #2 with a (yet-another) config token. Not sure if it adds
>value or further complicates the matrix.
> 
> Both #1 and #2 have their pros/cons.  Long term, one advantage I see with #2
> is that we don't force container-scripts do something now that they can/should
> potentially undo later if we ever want to remove the single-instance 
> semantics.
> 
> Does presence of /dev/pts/ptmx in single-instance case break userspace ?
> If it only surprises, will adding notes to pts(4) man page help ?
> 

Well, userspaces which implement the #2 option should add the 
newinstance mount option to ALL mounts of devpts, including the first 
one.  That way the "default" pts instance is never actually exposed.

Container scripts which need to work in both modes can trivially 
determine if they need to do the bind-mount, simply by seeing if 
/dev/ptmx is already a symlink.

-hpa

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v3]: Enable multiple instances of devpts

2008-09-04 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> But that node will not be accessible if there is a newinstance mount
> without the bind mount ? IOW
> 
>   1. mount -t devpts -o newinstance lxcpts /dev/pts
>   2. mount -o bind /dev/pts/ptmx /dev/ptmx
> 
> If both #1 and #2 or neither happen there is no problem.
> 
> If #1 is NOT followed by #2, ptys break in new namespace.
> 
> An open of /dev/ptmx in this case will allocate a pty in the
> initial namespace, but since #1 is complete, we lookup the
> pty (/dev/pts/7) in the new namespace and fail.
> 

That is correct.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v3]: Enable multiple instances of devpts

2008-09-04 Thread H. Peter Anvin

Alan Cox wrote:
> O> We can't, really, because it will open the global ptmx.  This is an 
>> unfortunate side effect of the backwards-compatibility code.
>>
>> This is also why I don't like the bind mount; the symlink option has the 
>> nice property that f*ckups are more obvious.
> 
> It's asking for trouble with existing systems and users that
> upgrade. /dev/ptmx should remain a proper device file for the non
> container case.

I did say that as being the desired *eventual* goal.

> Should /dev/ptmx give you a node in the 'master' pty namespace or a node
> in your current containers pty namespace ?

Well, since there is no "current containers pty namespace" per se, it 
will give you a node in the default (initial) pty namespace unless the 
bind mount is set up.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v3]: Enable multiple instances of devpts

2008-09-04 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> Ah, ok.  Well, I will remove that para from the patch description.
> 
> If the -o newinstance is NOT followed by the bind mount, ptys won't
> work and would be nice if we can print a useful message when opening
> /dev/ptmx.
> 

We can't, really, because it will open the global ptmx.  This is an 
unfortunate side effect of the backwards-compatibility code.

This is also why I don't like the bind mount; the symlink option has the 
nice property that f*ckups are more obvious.

The non-legacy option should be as follows, IMNSHO:

- ALL mounts of devpts use -o newinstance;
- /dev/ptmx -> pts/ptmx symlink.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 11/11][v3]: Enable multiple instances of devpts

2008-09-03 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
>   2. To effectively use the multi-instance mode, applications/libraries
>   should, open "/dev/pts/ptmx" instead of "/dev/ptmx" but obviously
>   this would fail in the legacy mode.
>   

NOT SO!

/dev/ptmx is required by Unix98 (which is arguably obsolete, but still.) 
  Applications SHOULD NOT try to open /dev/pts/pmtx.  This should be 
considered strictly an internal implementation detail.

Applications should use posix_openpt(), openpty() or forkpty(); 
libraries should use /dev/ptmx.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 1/3] Move parts of init_dev() into new functions

2008-08-26 Thread H. Peter Anvin

Alan Cox wrote:
> On Tue, 26 Aug 2008 09:40:20 -0700
> "H. Peter Anvin" <[EMAIL PROTECTED]> wrote:
> 
>> Alan Cox wrote:
>>> In the case of the initial open you don't yet know the tty pointer and
>>> may be creating it. SO the tty isn't a reference because it doesn't exist.
>>>
>> Got it.  I was under the (apparently mistaken) notion that only pty tty 
>> structures were created dynamically.
> 
> tty is dynamically created and attached to the file handle. The port side
> structure is currently port specific and does last. Thats what the
> tty_port stuff is intended to slowly standardise but won't help ptys as
> they don't have a physical port anyway.
> 

Got it, sorry for the confusion.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 1/3] Move parts of init_dev() into new functions

2008-08-26 Thread H. Peter Anvin

Alan Cox wrote:
> 
> In the case of the initial open you don't yet know the tty pointer and
> may be creating it. SO the tty isn't a reference because it doesn't exist.
> 

Got it.  I was under the (apparently mistaken) notion that only pty tty 
structures were created dynamically.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 1/3] Move parts of init_dev() into new functions

2008-08-25 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> Yes, we know the driver, but do we need to pass it into ->get_tty() ?
> 
> Passing it in (or having the operation compute from inode) has advantage
> of allowing drivers to share code if necessary.
> 

Yes, and it gets access to its own data.  It's how you implement an 
object-oriented method call in C.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 1/3] Move parts of init_dev() into new functions

2008-08-25 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
>>
>>  tty = driver->ops->get_tty(driver, inode [, other_stuff?]);
> 
> Can the inode be used to identify the driver too ?  (but inode to driver
> mapping is not trivial atm).

It can, but it's an O(n) operation in the number of registered drivers. 
  However, we can only call the above if we know the driver in the first 
place so such a lookup is rather pointless.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 1/3] Move parts of init_dev() into new functions

2008-08-25 Thread H. Peter Anvin

Alan Cox wrote:
>> This seems more than a bit redundant.  The "instance", IMO, *is* the tty 
>> structure; so the interface should be:
> 
> Only for a re-open - which is very different to an initial open,
> and /dev/tty is deep magic in this situation.

I guess I fail to understand something here, perhaps because I haven't 
looked at the code in very much details for several years.  How is there 
not a 1:1 mapping between tty structures and instances, even in the 
presence of /dev/tty?  (/dev/tty, of course, points to a real tty.)

>> Not "index", but "inode".  If, as a courtesy to the generic driver, we 
>> want to precalculate the index number we can do that, but otherwise that 
>> is of course available as:
> 
> Thats a much bigger step and raises problems later on with consoles. We
> might want to end up there - but not in one leap.

*Nod.*  It may mean that for consoles we have to provide transient 
inodes in rootfs.

-hpa

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 1/3] Move parts of init_dev() into new functions

2008-08-25 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> By extension, maybe the tty layer would need another interface to determine
> the instance:
> 
>   instance =  driver->ops->get_instance(driver, inode, other_stuff) 
> 
> using this we find the tty
> 
>   tty = driver->ops->something(driver, instance, idx);

This seems more than a bit redundant.  The "instance", IMO, *is* the tty 
structure; so the interface should be:

tty = driver->ops->get_tty(driver, inode [, other_stuff?]);

Not "index", but "inode".  If, as a courtesy to the generic driver, we 
want to precalculate the index number we can do that, but otherwise that 
is of course available as:

index = inode->i_rdev - MK_DEV(device->major, device->minor_start);

... and if we replaced device->{major,minor_start} with device->base and 
made the drivers use MK_DEV() in the initializers, it would be even simpler.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 7/8]: Auto-create ptmx node when mounting devpts

2008-08-21 Thread H. Peter Anvin

Eric W. Biederman wrote:
> 
> The point of making it a bind is to address the concerns about
> backwards compatibility in user space.  In particular security
> conscious applications and applications that perform sanity checks
> are known to ignore things if they are the wrong type in the filesystem.
> 

A.k.a. broken applications...

>> This is *only* required to support back-and-forth, and can be introduced at 
>> any
>> time after this patch is in the kernel -- or even before.
> 
> You can use a file bind mount just as easily as a symlink.
> 
> As for udev I haven't seen a version that is accessible to mere mortals yet
> and it doesn't seem like they plan on it being so.  Eventually I will get
> around to making sense of it as we need to make it work in a container
> but so far it seems to be much more complex then it should be.

I have not had that experience... I find it relatively simple to deal 
with.  The biggest problem is the fact that the rules aren't bundled 
with the kernel, which causes nasty chicken and egg problems.

-hpa

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 7/8]: Auto-create ptmx node when mounting devpts

2008-08-21 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> Hmm, so, single and multi-mount don't coexist ? i.e some are multi-mounts
> while others are single-mounts.
> 
> The way I looked at is that even if a distro has not yet updated the
> startup script (fstab), we could use the multi-mount. Maybe a container
> startup script could change /dev/ptmx to symlink and both types of
> mounts can work simultaneously.
> 
> Would that be unnecessary ?
> 

Yes, that's unncessary; you can still use a "newinstance" elsewhere in 
the system (in a container) where you rather by definition don't have a 
backward compatibility issue.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 7/8]: Auto-create ptmx node when mounting devpts

2008-08-21 Thread H. Peter Anvin

Eric W. Biederman wrote:
>> I had the new ptmx node only in 'multi-mount' mode initially. But if users
>> want the multi-mount semantics, /dev/ptmx must be a symlink. If its a 
>> symlink,
>> we break in the single-mount case (which does not have the ptmx node and
>> we don't support mknod in pts).
> 
> Then have user space make it a file bind mount instead of symlink.
> That should address all of the backwards compatibility concerns, and
> allow us to only create it when open.

The right thing is that, if you want to support back-and-forth flipping, 
to introduce a udev rule which looks for pts/ptmx, links to it if 
present, and otherwise creates the ptmx device node.

This is *only* required to support back-and-forth, and can be introduced 
at any time after this patch is in the kernel -- or even before.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 7/8]: Auto-create ptmx node when mounting devpts

2008-08-21 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> I had the new ptmx node only in 'multi-mount' mode initially. But if users
> want the multi-mount semantics, /dev/ptmx must be a symlink. If its a symlink,
> we break in the single-mount case (which does not have the ptmx node and
> we don't support mknod in pts).
> 

True, but changing that is still a configuration change (adding newns to 
the fstab); it's not that much more work to change whatever else needs 
to change.

I personally don't expect a whole lot of back-and-forth; I suspect 
people will switch from the legacy model to the newns model mostly as 
part of a distro upgrade.

>>> I'm open to being convinced and the
>>> other problems with that code are more pressing.
> 
> Yes, I will look at the latest in linux-next and the ->driver_data
> approach.
> 
> But just to confirm, we do want try and keep single-mount semantics.

Certainly for several years at least.

-hpa

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/8][v2]: Enable multiple mounts of devpts

2008-08-21 Thread H. Peter Anvin

Cedric Le Goater wrote:
> H. Peter Anvin wrote:
>> Cedric Le Goater wrote:
>>>> I suggest "newinstance", but "newns" works, too.
>>> Could we also use this mount option to 'unshare' a new posix message
>>> queue namespace ?
>> Sorry, I fail to see the connection with devpts here?  Are you
>> suggesting using the same option for another filesystem (if so, which)?
> 
> yes. the posix message queues are also using a single superblock filesystem. 
> 
> If we want isolate them (for container needs for example), we also need to 
> create a new sb. The patchset I have uses a clone flag but using a mount 
> 'newns' really sounds like a better idea.
> 

OK, so this is a very good motivation for using a nonspecific term like 
"instance" or "namespace".

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 7/8]: Auto-create ptmx node when mounting devpts

2008-08-21 Thread H. Peter Anvin

Alan Cox wrote:
>> auto-created, than supporting mknod(2) inside the devpts filesystem. 
>> It's not a matter of "changing the user space"; it's a matter of what 
>> makes most sense inside the kernel.
> 
> Having an extra node with different permissions suddenely appear without
> warning isn't I think good behaviour.

Hm.  Given that the single-instance mode is the backwards compatibility 
mode (and it's accessible from outside the filesystem), it probably 
makes sense to suppress creating this device node when *not* applying 
the "newns" option, or whatever we want to call it.

> I'm open to being convinced and the
> other problems with that code are more pressing.

Agreed.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/8][v2]: Enable multiple mounts of devpts

2008-08-21 Thread H. Peter Anvin

Cedric Le Goater wrote:
>>>
>> I suggest "newinstance", but "newns" works, too.
> 
> Could we also use this mount option to 'unshare' a new posix message queue 
> namespace ? 
> 

Sorry, I fail to see the connection with devpts here?  Are you 
suggesting using the same option for another filesystem (if so, which)?

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 7/8]: Auto-create ptmx node when mounting devpts

2008-08-21 Thread H. Peter Anvin

Alan Cox wrote:
>> This patch has the kernel internally create the [ptmx, c, 5:2] device
>> when mounting devpts filesystem. The permissions for the device node
>> can be specified by the '-o ptmx_mode=0666' option. The default mode
>> is 0666.
> 
> NAK
> 
>>  Hopefully, presence of the 'ptmx' node in /dev/pts does not surprise
>>  user space.
> 
> If you are going to make major changes requiring user space changes to
> use them then you can change the user space rather than playing "gee
> where did that come from" with the existing system.
> 

This particular one I think is the right thing, despite everything.  In 
particular, it makes a *hell* of a lot more sense to have this 
auto-created, than supporting mknod(2) inside the devpts filesystem. 
It's not a matter of "changing the user space"; it's a matter of what 
makes most sense inside the kernel.

His new implementation doesn't require user space changes unless you 
want to take advantages of the multi-instance features.  It thus allows 
for a soft transition.  However, soft or hard transition is irrelevant 
-- creating ptmx inside devpts is necessary, and it doesn't make sense 
to require userspace to mknod it, since the kernel code ends up being 
more complex.

-hpa

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/8][v2]: Enable multiple mounts of devpts

2008-08-20 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
>>
>> I don't like the name "newmnt" for the option; it is not just another 
>> mount, but a whole new instance of the pty space.
> 
> I agree.  Its mostly a place-holder for now. How about newns or newptsns ?
> 

I suggest "newinstance", but "newns" works, too.

>> I observe you didn't incorporate my feedback with regards to get_node().   
> 
> Yes, I have not addressed it yet. Will look into it in the next pass,
> but will add to the todo list now.

OK.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/8][v2]: Enable multiple mounts of devpts

2008-08-20 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> TODO:
>   - Remove even initial kernel mount of devpts ? (If we do, how
> do we preserve single-mount semantics) ?

Doesn't make sense unless we decide to drop single-mount semantics in 
the (far) future. As long as we have an instance that services 
unconnected ptmx instances, it makes sense to have that instance 
available to the kernel at all times.

I don't like the name "newmnt" for the option; it is not just another 
mount, but a whole new instance of the pty space.

I observe you didn't incorporate my feedback with regards to get_node(). 
   In this scheme, any and all uses of get_node() are bogus; as such, 
you're missing the huge opportunity for cleanup that comes along with 
this whole thing.

This means breaking compatibility in one very minor way, which is if 
people copy device nodes out of /dev/pts, but I am feeling pretty sure 
that that is much better than carrying the ugliness that goes along with 
the current code.  Furthermore, if there are anyone who do something 
that silly, they need to fix it anyway.

The *entire* implementation of devpts_get_tty(), for example, should 
look like:

struct tty_struct *devpts_get_tty(struct inode *inode)
{
struct super_block *sb = inode->i_sb;

if (sb->s_magic == DEVPTS_SUPER_MAGIC)
return (struct tty_struct *)inode->i_private;
else
return NULL;/* Higher layer should return -ENXIO */
}

I really appreciate your tackling this implementation.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-06 Thread H. Peter Anvin

Kyle Moffett wrote:
> On Tue, Aug 5, 2008 at 2:15 AM, Eric W. Biederman <[EMAIL PROTECTED]> wrote:
>>> There definitely needs to be a mount option (and possibly a config
>>> option to forcibly enable the mount option).  I personally have 5 or 6
>>> different custom scripts that depend on being able to unmount and
>>> remount devpts without losing access to the TTYs therein.  Eventually
>>> I will need to port those over to use "mount --move", but it would be
>>> bad to have a random kernel upgrade just break my imaging/cloning
>>> system.
>> An interesting point.  What should the semantics be.  If we unmount /dev/pts
>> and people still have ptys open.  -EBUSY?  Except for lazy unmounts?
> 
> Well, even if it's unmounted you can still access your pty with
> /dev/tty.  As it stands right now it's possible to "umount /dev/pts"
> from an SSH login and still have a mostly-functional system.

This is only because there is always an instance in the kernel.

> The only
> failure will be when somebody needs a pseudo-TTY and you have devpts
> unmounted and UNIX98 ptys turned off.
> 
> So for the legacy case, the behavior should be exactly as it is now.
> In the CONFIG_DEVPTY_FORCE_PERMOUNT/"permount"-option case, I agree
> that you could easily go either way.

In the multimount case, we should refuse umounts (except, obviously, 
lazy umounts) if there are ptys open, or ptys used as current terminals, 
in that filesystem.  Simple.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-04 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> Ok. But was wondering if we can pass the ptmx symlink burden to the
> 'container-startup sripts' since they are the ones that need the second
> or subsequent mount of devpts.
> 
> So, initially and for systems that don't need multiple mounts of devpts,
> existing behavior can continue (/dev/ptmx is a node).
> 
> Container startup scripts have to anyway remount /dev/pts and mknod
> /dev/pts/ptmx. These scripts could additionally check if /dev/ptmx is
> a node and make it a symlink. The container script would have to do
> this check while it still has access to the first mount of devpts
> and mknod in the first devpts mnt.
> 
> But then again, the first mount is still special in the kernel.
> 

You're right, I think we can do this and still retain most of the 
advantages, at least for a transition period.

The idea would be that you'd have a mount option, that if you do not 
specify it, you get a bind to the in-kernel mount; otherwise you get a 
new instance.  ptmx, if not invoked from inside a devpts filesystem, 
would default to the kernel-mounted instance.

Unfortunately I believe that means parsing the command options in 
getpts_get_sb() to know if we do have the "multi" option, but that isn't 
really all that difficult; it just means breaking the parser out as a 
separate subroutine.

-hpa

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-04 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> Appreciate comments on overall approach of my mapping from the inode
> to sb->s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(),
> and also on the tweak to init_dev() (patch 6).
> 

First of all, thanks for taking this on :)  It's always delightful to 
spout some ideas and have patches appear as a result :)

Once you have the notion of the device nodes tied to a specific devpts 
filesystem, a lot of the operations can be trivialized; for example, the 
whole devpts_get_tty() mechanism can be reduced to:

if (inode->i_sb->sb_magic != DEVPTS_SUPER_MAGIC) {
/* do cleanup */
return -ENXIO;
}
tty = inode->i_private;

This is part of what makes this whole approach so desirable: it actually 
allows for some dramatic simplifications of the existing code.

One can even bind special operations to both the ptmx node and slave 
nodes, to bypass most of the character device and tty dispatch.  That 
might require too much hacking at the tty core to be worth it, though.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-04 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> If devpts is mounted more than once, then '/dev/ptmx' must be a symlink
> to '/dev/pts/ptmx' and in each new devpts mount we must create the
> device node '/dev/pts/ptmx' [c, 5;2] by hand.
> 

This should be auto-created.  That also eliminates any need to support 
the mknod system call.

> Appreciate comments on overall approach of my mapping from the inode
> to sb->s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(),
> and also on the tweak to init_dev() (patch 6).
> 
> Todo:
>   User-space impact of /dev/ptmx symlink - Options are being
>   discussed on mailing list (new mount option and config token,
>   new fs name, etc)
> 
>   Remove even initial kernel mount of devpts ?

The initial kernel mount of devpts should be removed, since that 
instance will never be accessible.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Per-instance devpts

2008-08-03 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> 
> IIRC, /dev/tty also needs a similar symlink.
> 

Why?  I do not believe that is correct.

-hpa

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Per-instance devpts

2008-08-02 Thread H. Peter Anvin

Kyle Moffett wrote:
> 
> Here's my suggestion:
> 
> By default, without any mount options, use the current "legacy"
> behavior.  The devpts filesystem would point to a "global" instance on
> the whole box, controlled by the traditional /dev/ptmx device node.
> There would *NOT* be a /dev/pts/ptmx node.
> 
> If the devpts filesystem is mounted with a special option ("permount"?
> "noglobal"?), then it will create a new devpts instance associated
> with the filesystem.  A devpts mounted that way *WILL* have a magic
> /dev/pts/ptmx node.
> 
> If the kernel is built with CONFIG_DEVPTS_FORCE_PERMOUNT then the
> traditional /dev/ptmx device node will be neutered (IE: always return
> -ENODEV) and the "permount" option will be forced for all devpts
> mounts.  This will also remove the static global devpts instance.
> 

Hm.  This might work if we can get the mount behaviour to work right. 
I'll think about it.  It definitely seems like a reasonable way to get 
from A to B.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Per-instance devpts

2008-08-01 Thread H. Peter Anvin

Dave Hansen wrote:
> On Fri, 2008-08-01 at 11:12 -0700, H. Peter Anvin wrote:
>> 1. /dev/ptmx would have to change to a symlink, ptmx -> pts/ptmx.
> ...
>> I worry #1 would have substantial user-space impact, but I don't see a 
>> way around it, since there would be no obvious way to associate 
>> /dev/ptmx with a filesystem.
> 
> Are your worries just about replacing what is now a normal file with a
> symlink, and the behavioral changes that come with that?  
> 
> I wonder if using a bind mount for the file would be more robust.  We
> wouldn't, of course, be able to do it persistently, but I bet it would
> be something we could count on udev to do for us.

No, I'm concerned about the changes needed for udev and setup scripts.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Per-instance devpts

2008-08-01 Thread H. Peter Anvin

Since the issue of PTY namespaces came up (and was rejected) back in 
April, I have thought a little bit about changing ptys to be tied 
directly into a devpts instance.  devpts would then be a "normal" 
filesystem, which can be mounted multiple times (or not at all).  pty's 
would then become private to a devpts instance.

This is what it would appear would have to change, and I'd like to get 
people's feeing for the user-space impact:

1. /dev/ptmx would have to change to a symlink, ptmx -> pts/ptmx.
2. Permissions on /dev/ptmx would not be persistent, and would have to
be set via devpts mount options (unless they're 0666 root.tty, which
would presumably be the default.)
3. The /proc/sys/kernel/pty limit would be global; a per-filesystem
limit could be added on top or instead (presumably via a filesystem
mount options.)

I worry #1 would have substantial user-space impact, but I don't see a 
way around it, since there would be no obvious way to associate 
/dev/ptmx with a filesystem.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [LTP] [PATCH 0/4] Helper patches for PTY namespaces

2008-04-23 Thread H. Peter Anvin

Serge E. Hallyn wrote:
> 
> Subrata,
> 
> pty namespaces as such are not going to happen.  We'll be pursuing
> full-scale device namespaces instead.
> 

Again, either that, or tie Unix98 pty's closer into devpts (which would 
have other advantages, in particular avoiding the double lookup) which 
would permit the ordinary filesystem namespace mechanisms to be used.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH]: Factor out PTY index allocation

2008-04-17 Thread H. Peter Anvin

Serge E. Hallyn wrote:
> Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
>> We noticed this while working on pts namespaces and believe this might
>> be an useful change even as we rework our pts/device namespace approach.
>>
>> ---
>>
>> From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
>> Subject: [PATCH]: Factor out PTY index allocation
>>
>> Factor out the code used to allocate/free a pts index into new interfaces,
>> devpts_new_index() and devpts_kill_index().  This localizes the external
>> data structures used in managing the pts indices.
>>
>> Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
>> Signed-off-by: Serge Hallyn<[EMAIL PROTECTED]>
>> Signed-off-by: Matt Helsley<[EMAIL PROTECTED]>
> 
> No traces of devpts namespaces here, so I assume this should be
> non-offensive and fine for inclusion.
> 

Acked-by: H. Peter Anvin <[EMAIL PROTECTED]>
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH]: Propagate error code from devpts_pty_new

2008-04-17 Thread H. Peter Anvin

Serge E. Hallyn wrote:
> Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
>> From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
>> Subject: [PATCH]: Propagate error code from devpts_pty_new
>>
>> Have ptmx_open() propagate any error code returned by devpts_pty_new()
>> (which returns either 0 or -ENOMEM anyway).
>>
>> Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
> 
> Seems nice and non-contentuous.
> 
> Acked-by: Serge Hallyn <[EMAIL PROTECTED]>

Acked-by: H. Peter Anvin <[EMAIL PROTECTED]>
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Multiple instances of devpts

2008-04-12 Thread H. Peter Anvin

H. Peter Anvin wrote:
> 
> Thinking about it further, allowing this restriction would also allow a 
> whole lot of cleanups inside the pty setup, since it would eliminate the 
> need to do a separate lookup to find the corresponding devpts entry in 
> pty_open().  The benefit here comes from the closer coupling between the 
> pty and the devpts filesystem and isn't at all related to namespaces, 
> but it's a very nice side benefit.
> 

Minor correction: the lookup is actually in init_dev() in tty_io.c; I'm 
specifically referring to devpts_get_tty().

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Multiple instances of devpts

2008-04-12 Thread H. Peter Anvin

Eric W. Biederman wrote:
>>
>> /dev/ptmx can be a symlink ptmx -> pts/ptmx, and we add a ptmx instance 
>> inside the devpts filesystem.  Each devpts filesystem is responsible for 
>> its own pool of ptys, with own numbering, etc.
>>
>> This does mean that entries in /dev/pts are more than just plain device 
>> nodes, which they are now (you can cp -a a device node from /dev/pts 
>> into another filesystem and it will still "just work"), but I doubt this 
>> actually matters to anyone.  If anyone cares, now I guess would be a 
>> good time to speak up.
> 
> Agreed.   That is another legitimate path.  And if all you care about is
> isolation and not dealing with the general class of problems with the
> global device number to device mapping that is sane.  I know we have
> several other virtual devices that we tend to care about but ptys are
> the real world pain point.
> 

Thinking about it further, allowing this restriction would also allow a 
whole lot of cleanups inside the pty setup, since it would eliminate the 
need to do a separate lookup to find the corresponding devpts entry in 
pty_open().  The benefit here comes from the closer coupling between the 
pty and the devpts filesystem and isn't at all related to namespaces, 
but it's a very nice side benefit.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Multiple instances of devpts

2008-04-12 Thread H. Peter Anvin

Al Viro wrote:
> 
> *boggle*
> 
> Care to explain how that "namespace" is different from devpts instance?
> IOW, why the devil do you guys ignore Occam's Razor?
> 
> Frankly, this nonsense has gone far enough; I can buy the need to compensate
> for shitty APIs (sockets, non-fs-based-IPC, etc.), but devpts *is* *a*
> *fucking* *filesystem*.  Already.  And as such it's already present in
> normal, real, we-really-shouldn't-have-any-other-if-not-for-ancient-stupidity
> namespace.
> 
> Why not simply allow independent instances of devpts and be done with that?

In particular:

/dev/ptmx can be a symlink ptmx -> pts/ptmx, and we add a ptmx instance 
inside the devpts filesystem.  Each devpts filesystem is responsible for 
its own pool of ptys, with own numbering, etc.

This does mean that entries in /dev/pts are more than just plain device 
nodes, which they are now (you can cp -a a device node from /dev/pts 
into another filesystem and it will still "just work"), but I doubt this 
actually matters to anyone.  If anyone cares, now I guess would be a 
good time to speak up.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/4] Helper patches for PTY namespaces

2008-04-12 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> Some simple helper patches to enable implementation of multiple PTY
> (or device) namespaces.
> 
>   [PATCH 1/4]: Propagate error code from devpts_pty_new
>   [PATCH 2/4]: Factor out PTY index allocation
>   [PATCH 3/4]: Move devpts globals into init_pts_ns
>   [PATCH 3/4]: Enable multiple mounts of /dev/pts
> 
> This patchset is based on earlier versions developed by Serge Hallyn
> and Matt Helsley.

Any measurable performance impact when not using these kinds of namespaces?

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls

2008-04-10 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> | 
> | I thought that the consensus was that adding a new system call was
> | better than trying to force extensibility on to the existing
> | non-extensible system call.
> 
> There were couple of objections to extensible system calls like
> sys_indirect() and to Pavel's approach.
> 

This is a very different thing, though.  sys_indirect is pretty much a 
mechanism for having a sideband channel -- a second ABI -- into each and 
every system call, making it extremely hard to analyze what the full set 
of impact of a specific system call is.  Worse, as it was being proposed 
to have been used, it would have set state variables inside the kernel 
in a very opaque manner.

> | But if we are adding a new system call, why not make the new one
> | extensible to reduce the need for yet another new call in the future?
> 
> hypothetically, can we make a variant of clone() extensible to the point
> of requiring a copy_from_user() ?

The only issue is whether or not it's acceptable from a performance 
standpoint.  clone() is reasonably expensive, though.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls

2008-04-10 Thread H. Peter Anvin

Cedric Le Goater wrote:
> 
> OK. I didn't know that. I took sys_llseek() as an example of an interface 
> to follow when coded clone64(). 
> 

llseek() was the first system call that took a doublewidth argument. 
It's not the one you want to mimic.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 3/3] add the clone64() and unshare64() syscalls

2008-04-09 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> Jakub Jelinek [EMAIL PROTECTED] wrote:
> | On Wed, Apr 09, 2008 at 03:34:59PM -0700, [EMAIL PROTECTED] wrote:
> | > From: Cedric Le Goater <[EMAIL PROTECTED]>
> | > Subject: [PATCH 3/3] add the clone64() and unshare64() syscalls
> | > 
> | > This patch adds 2 new syscalls :
> | > 
> | >  long sys_clone64(unsigned long flags_high, unsigned long flags_low,
> | >   unsigned long newsp);
> | > 
> | >  long sys_unshare64(unsigned long flags_high, unsigned long 
> flags_low);
> | 
> | Can you explain why are you adding it for 64-bit arches too?  unsigned long
> | is there already 64-bit, and both sys_clone and sys_unshare have unsigned
> | long flags, rather than unsigned int.
> 
> Hmm,
> 
> By simply resuing clone() on 64 bit and adding a new call for 32-bit won't
> the semantics of clone() differ between the two ?
> 
> i.e clone() on 64 bit supports say CLONE_NEWPTS clone() on 32bit does not ?
> 
> Wouldn't it be simpler/cleaner if clone() and clone64() behaved the same
> on both 32 and 64 bit systems ?
> 

No, not really.  The way this work on the libc side is pretty much "use 
clone64 if it exists, otherwise use clone".

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls

2008-04-09 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
>>
>> If you're going to make it a 64-bit pass it in as a 64-bit number, instead 
>> of breaking it into two numbers.
> 
> Maybe I am missing your point. The glibc interface could take a 64bit
> parameter, but don't we need to pass 32-bit values into the system call 
> on 32 bit systems ?

Not as such, no.  The ABI handles that.  To make the ABI clean on some 
architectures, it's good to consider a 64-bit value only in positions 
where they map to an even:odd register pair once slotted in.

> Yes, this was discussed before in the context of Pavel Emelyanov's patch
> 
>   http://lkml.org/lkml/2008/1/16/109
> 
> along with sys_indirect().  While there was no consensus, it looked like
> adding a new system call was better than open ended interfaces.

That's not really an open-ended interface, it's just an expandable bitmap.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/3] clone64() and unshare64() system calls

2008-04-09 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> This is a resend of the patch set Cedric had sent earlier. I ported
> the patch set to 2.6.25-rc8-mm1 and tested on x86 and x86_64.
> ---
> 
> We have run out of the 32 bits in clone_flags !
> 
> This patchset introduces 2 new system calls which support 64bit clone-flags.
> 
>  long sys_clone64(unsigned long flags_high, unsigned long flags_low,
>   unsigned long newsp);
> 
>  long sys_unshare64(unsigned long flags_high, unsigned long flags_low);
> 
> The current version of clone64() does not support CLONE_PARENT_SETTID and 
> CLONE_CHILD_CLEARTID because we would exceed the 6 registers limit of some 
> arches. It's possible to get around this limitation but we might not
> need it as we already have clone()
> 

I really dislike this interface.

If you're going to make it a 64-bit pass it in as a 64-bit number, 
instead of breaking it into two numbers.  Better yet, IMO, would be to 
pass a pointer to a structure like:

struct shared {
unsigned long nwords;
unsigned long flags[];
};

... which can be expanded indefinitely.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/7] Clone PTS namespace

2008-04-09 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
>> I'm just worried about the accumulation of what feels like ad hoc 
>> namespaces, causing a very large combination matrix, a lot of which don't 
>> make sense.
> 
> Hmm, if we were to just call this CLONE_NEWDEV, would that (a) make
> sense and (b) suitably address your (certainly valid) concern?
> 
> Basically for now CLONE_NEWDEV wouldn't yet be fully implemented, only
> unsharing unix98 ptys...

That would make sense to me.  Also see Eric's note about uevent, 
however; and there are probably other issues like it.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/7] Clone PTS namespace

2008-04-09 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> We want to provide isolation between containers, meaning PTYs in container
> C1 should not be accessible to processes in C2 (unless C2 is an ancestor).

Yes, I certainly can understand the desire for isolation.  That wasn't 
what my question was about.

> The other reason for this in the longer term is for checkpoint/restart.
> When restarting an application we want to make sure that the PTY indices
> it was using is available and isolated.

OK, this would be the motivation for index isolation.

> A complete device-namespace could solve this, but IIUC, is being planned
> in the longer term. We are hoping this would provide the isolation in the
> near-term without being too intrusive or impeding the implementation of
> the device namespace.

I'm just worried about the accumulation of what feels like ad hoc 
namespaces, causing a very large combination matrix, a lot of which 
don't make sense.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 0/7] Clone PTS namespace

2008-04-08 Thread H. Peter Anvin

[EMAIL PROTECTED] wrote:
> Devpts namespace patchset
> 
> In continuation of the implementation of containers in mainline, we need to
> support multiple PTY namespaces so that the PTY index (ie the tty names) in
> one container is independent of the PTY indices of other containers.  For
> instance this would allow each container to have a '/dev/pts/0' PTY and
> refer to different terminals.
> 

Why do we "need" this?  There isn't a fundamental need for this to be a 
dense numberspace (in fact, there are substantial reasons why it's a bad 
idea; the only reason the namespace is dense at the moment is because of 
the hideously bad handing of utmp in glibc.)  Other than indicies, this 
seems to be a more special case of device isolation across namespaces, 
would that be a more useful problem to solve across the board?

hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: Extending syscalls

2008-01-17 Thread H. Peter Anvin


Jonathan Corbet wrote:


Heh, indeed.  But we do seem to have a recurring problem of people
wanting to extend sys_foo() beyond the confines of its original API.
I've observed a few ways of doing that:

 - create sys_foo2() (or sys_foo64(), or sys_fooat(), or sys_pfoo(),
   or...) and add the new stuff there.

The first approach has traditionally been the most popular.  If we have
a consensus that this is the way to extend system calls in the future,
it would be nice to set that down somewhere.  We could avoid a lot of
API blind alleys that way.



I would argue it is the right approach.  It lets the kernel system call 
entry dispatch directly to the system call for the "new" case, and to a 
compatibility thunk for the "old" case.  It has the following desirable 
properties:


- No overhead for the "new" case.
- Minimal overhead for the "old" case.
- Easily dealt with by tools like strace that examine system calls.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] mark read_crX() asm code as volatile

2007-10-02 Thread H. Peter Anvin


Nick Piggin wrote:


This should work because the result gets used before reading again:

read_cr3(a);
write_cr3(a | 1);
read_cr3(a);

But this might be reordered so that b gets read before the write:

read_cr3(a);
write_cr3(a | 1);
read_cr3(b);

?


I don't see how, as write_cr3 clobbers memory.


Because read_cr3() doesn't depend on memory, and b could be stored in a 
register.


-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH] mark read_crX() asm code as volatile

2007-10-02 Thread H. Peter Anvin


Arjan van de Ven wrote:

On Tue, 02 Oct 2007 18:08:32 +0400
Kirill Korotaev <[EMAIL PROTECTED]> wrote:


Some gcc versions (I checked at least 4.1.1 from RHEL5 & 4.1.2 from
gentoo) can generate incorrect code with read_crX()/write_crX()
functions mix up, due to cached results of read_crX().



I'm not so sure volatile is the right answer, as compared to giving the
asm more strict contraints

asm volatile tends to mean something else than "the result has
changed"



One of the aspect of volatility is "the result will change in ways you 
(gcc) just don't understand."


-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [patch] unprivileged mounts update

2007-04-25 Thread H. Peter Anvin

Miklos Szeredi wrote:
> 
> Andrew, please skip this patch, for now.
> 
> Serge found a problem with the fsuid approach: setfsuid(nonzero) will
> remove filesystem related capabilities.  So even if root is trying to
> set the "user=UID" flag on a mount, access to the target (and in case
> of bind, the source) is checked with user privileges.
> 
> Root should be able to set this flag on any mountpoint, _regardless_
> of permissions.
> 

Right, if you're using fsuid != 0, you're not running as root (fsuid is
the equivalent to euid for the filesystem.)

I fail to see how ruid should have *any* impact on mount(2).  That seems
to be a design flaw.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [patch 2/8] allow unprivileged umount

2007-04-21 Thread H. Peter Anvin


Andrew Morton wrote:

On Fri, 20 Apr 2007 12:25:34 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:


+static bool permit_umount(struct vfsmount *mnt, int flags)
+{

...

+   return mnt->mnt_uid == current->uid;
+}


Yes, this seems very wrong.  I'd have thought that comparing user_struct*'s
would get us a heck of a lot closer to being able to support aliasing of
UIDs between different namespaces.



Not to mention it should be fsuid, not uid.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [patch 0/8] unprivileged mount syscall

2007-04-09 Thread H. Peter Anvin

Ram Pai wrote:
> 
> It is in FC6. I dont know the status off upstream util-linux. I did
> submit the patch many times to Adrian Bunk (the then util-linux
> maintainer) and got no response. I have not pushed the patches to the
> new maintainer(Karel Zak?) though.
> 

Well, do that, then :)

Seriously.  The whole point of util-linux-ng is to make forward progress.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [patch 0/8] unprivileged mount syscall

2007-04-06 Thread H. Peter Anvin

Jan Engelhardt wrote:
> On Apr 6 2007 16:16, H. Peter Anvin wrote:
>>>> - users can use bind mounts without having to pre-configure them in
>>>> /etc/fstab
>>>>
>> This is by far the biggest concern I see.  I think the security implication 
>> of
>> allowing anyone to do bind mounts are poorly understood.
> 
> $ whoami
> miklos
> $ mount --bind / ~/down_under
> 
> later that day:
> # userdel -r miklos
> 
> So both the source (/) and target (~/down_under) directory must be owned 
> by the user before --bind may succeed.
> 
> There may be other implications hpa might want to fill us in.

Consider backups, for example.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [patch 0/8] unprivileged mount syscall

2007-04-06 Thread H. Peter Anvin

>>
>> - users can use bind mounts without having to pre-configure them in
>>   /etc/fstab
>>

This is by far the biggest concern I see.  I think the security 
implication of allowing anyone to do bind mounts are poorly understood.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 3/3] cpuid: switch to cpuid_on_cpu()

2007-04-02 Thread H. Peter Anvin


Alexey Dobriyan wrote:

Now that cpuid_on_cpu() is in core, cpuid driver can be shrinked.

Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>


Hi Alexey,

This, and your other changes in this area does conflict with the work 
that I've been doing on extending the usability of the CPUID and MSR 
drivers (which is part of why this work has dragged out seemingly forever.)


I would really appreciate it if we could work together on this; there 
needs to be new paravirtualization entry points for this.  Consequently, 
I just updated and uploaded a git tree with the current status.  It 
still needs porting to x86-64, however.


The current cpuid/msr work is at:

http://git.kernel.org/?p=linux/kernel/git/hpa/linux-2.6-cpuidmsr.git;a=summary

-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 1/3] Introduce cpuid_on_cpu() and cpuid_eax_on_cpu()

2007-04-02 Thread H. Peter Anvin


Andi Kleen wrote:

On Monday 02 April 2007 13:38, Alexey Dobriyan wrote:

They will be used by cpuid driver and powernow-k8 cpufreq driver.

With these changes powernow-k8 driver could run correctly on OpenVZ kernels
with virtual cpus enabled (SCHED_VCPU).


This means openvz has multiple virtual CPU levels? One for cpuid/rdmsr and one
for the rest of the kernel? Both powernow-k8 and cpuid attempt to schedule
to the target CPU so they should already run there. But it is some other CPU,
but when they ask your _on_cpu() functions they suddenly get a "real" CPU?
Where is the difference between these levels of virtualness? 



The CPUID and MSR drivers do not schedule to the target CPU; instead, on 
hardware, they rely on IPI'ing the target processor if it is not the one 
that's currently running.


There were a lot of discussion back when about which was the better 
solution.  Alan Cox, in particular, really preferred the interrupt 
solution as being less likely to cause implicit deadlock.


I do want to add that it's been on my list for some time -- in fact, I 
keep implementing it half-way and then having other things to do -- to 
add MSR and CPUID ioctls() that allow the full register file to be set 
and read back, in order to support architecturally broken MSR and CPUID 
levels.


-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 3/3] lutimesat: actual syscall and wire-up on i386

2007-01-28 Thread H. Peter Anvin


Alexey Dobriyan wrote:

+asmlinkage long sys_lutimesat(int dfd, char __user *filename, struct timeval 
__user *utimes)


Could we get these to take struct timespec instead of struct timeval?

Right now we have a real problem in that the interfaces that *set* times 
take struct timeval (microsecond granularity) but the interfaces that 
*get* times return struct timespec (nanosecond granularity), which means 
information loss on any setting operations.


-hpa

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

94 matches

Mail list logo