Re: Kernel SCM saga..
Linus Torvalds writes: > NOTE! I detest the centralized SCM model, but if push comes to shove, > and we just _can't_ get a reasonable parallell merge thing going in > the short timeframe (ie month or two), I'll use something like SVN > on a trusted site with just a few committers, and at least try to > distribute the merging out over a few people rather than making _me_ > be the throttle. > > The reason I don't really want to do that is once we start doing > it that way, I suspect we'll have a _really_ hard time stopping. > I think it's a broken model. So I'd much rather try to have some > pain in the short run and get a better model running, but I just > wanted to let people know that I'm pragmatic enough that I realize > that we may not have much choice. I think you at least instinctively know this, but... Centralized SCM means you have to grant and revoke commit access, which means that Linux gets the disease of ugly BSD politics. Under both the old pre-BitKeeper patch system and under BitKeeper, developer rank is fuzzy. Everyone knows that some developers are more central than others, but it isn't fully public and well-defined. You can change things day by day without having to demote anyone. While Linux development isn't completely without jealousy and pride, few have stormed off (mostly IDE developers AFAIK) and none have forked things as severely as OpenBSD and DragonflyBSD. You may rank developer X higher than developer Y, but they have only a guess as to how things are. Perhaps developer X would be a prideful jerk if he knew. Perhaps developer Y would quit in resentment if he knew. Whatever you do, please avoid the BSD-style politics. (the MAINTAINERS file is bad enough; it has caused problems) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] Simple privacy enhancement for /proc/
On Sun, 2005-04-10 at 17:38 +0200, Rene Scharfe wrote: > Albert, allowing access based on tty sounds nice, but it _is_ expansive. > More importantly, perhaps, it would "virtualize" /proc: every user would > see different permissions for certain files in there. That's too comlex > for my taste. If you really can't allow access based on tty, then at least allow access if any UID value matches any UID value. Without this, a user can not always see a setuid program they are running. > First, configuring via kernel parameters is sufficient. It simplifies > implementation a lot because we know the settings cannot change. And we > don't need the added flexibility of sysctls anyway -- I assume these > parameters are set at installation time and never touched again. This means mucking with boot parameters, which can be a pain. The various boot loaders do not all use the same config file. > Then I suppose we don't need to be able to fine-tune the permissions for > each file in /proc//. All that we need is a distinction between > "normal" users (which are to be restricted) and admins (which need to > see everything). The /proc/*/maps file sure is different from the /proc/*/status file. The same for all the others, really. > This patch introduces two kernel parameters: proc.privacy and proc.gid. > The group ID attribute of all files below /proc/ is set to > proc.gid, but only if you activate the feature by setting proc.privacy > to a non-zero value. This is very bad. Please do not change the GID as seen by the stat() call. This value is used. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree
On Nov 28, 2007 6:31 AM, Eric W. Biederman <[EMAIL PROTECTED]> wrote: > Ingo Molnar <[EMAIL PROTECTED]> writes: > > * Albert Cahalan <[EMAIL PROTECTED]> wrote: > >> On Nov 27, 2007 7:49 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote: > In a lot of ways if you access /proc/self and you get back information > that does not correspond to yourself the result is nonsense. Which > is a fairly mighty problem. In general, this is not a problem we have. /proc/self points to the process, not the task group leader. They are different. Look at /proc/*/stat, where the per-process info is summary data. The per-thread stat file is not summary data. This is intended to be true for all files in /proc; there may be some with bugs. Some of the data can not be summed up and will not always be shared. This is legacy crud. Don't use it, and don't try to "fix" it. It's there so that old programs can continute to work as long as weird threading isn't in use. Note that it was intended that non-legacy additions would normally be added to either the process directory or the thread directory, not both. I think somebody may have ripped out the ability to do this; at the very least there have been numerous illogical additions. > I'm still trying to understand which will break user space more, > adding /proc/task or changing /proc/self. Changing /proc/self makes you get per-thread data when you asked for per-process data. That's bad. > >> This one is probably best: > >> /proc/task -> 123/task/456 > >> (with both numbers showing) > > > > this sounds good to me. If it's a symlink then there's not much other > > choice because the thread PIDs do not even show up under /proc anymore. > > The name sounds good to me. > > I am not certain the two components make sense as we have a possible > permission problem where it is remotely possible that a task will > have permission to access /proc/ but not /proc/. If it hurts, don't do that. We allow foot shooting. > The reason I care is that we need to fix /proc/mounts. So once we > have /proc/task we can also have change /proc/mounts to > be a symlink to /proc/task/mounts. > > Once we get the /proc/mounts thing sorted out. There are several > other entries in /proc that need to that need to follow in it's wake > as they also become per namespace. /proc/net and /proc/sysvipc for > starters. As I predicted, the container bloat would be a never-ending source of bugs. You're discovering bugs where there were none. You'll never run out of this sort of problem. Keeping Linux lean and simple would be far better. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree
On Nov 28, 2007 5:46 AM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > * Albert Cahalan <[EMAIL PROTECTED]> wrote: > > On Nov 27, 2007 7:49 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote: > > > [EMAIL PROTECTED] wrote: > > > > > > > We may be stuck with the current broken behavior for backwards > > > > compatibility reasons but lets try fixing our ancient bug for the 2.6.25 > > > > time frame and see if anyone screams. > > > > It's not broken. It's just not the feature you're looking for. > > well it's quite broken at the moment and we are looking for solutions > not for a blame game :-) You might have read the thread where i describe > what i had to go through to do something fairly trivial. In some ways that is NOT trivial, given that a high-level language is free to use N:M threading. If we assume that isn't allowed though, blaming the library for not using native Linux thread IDs is entirely reasonable. Linus picked sane ID numbering, not Solaris-style. Normal app developers are unable to take advantage of Linus' wise decision. > > Changing /proc/self is somewhat risky, and probably > > undesirable anyway. That file has always been used > > to represent the process; at one time this also meant > > the task. Documentation everywhere says "process". > > in Linux we never truly had a notion of "process" when your change was > done - "process" always meant the task itself. That's why all the > task_struct parameters/variables used to be named 'p', not 't'. So when > NTPL came around this became a poorly defined notion. We were sort of settling on "struct signal" as the process. > > This one is probably best: > > /proc/task -> 123/task/456 > > (with both numbers showing) > > this sounds good to me. If it's a symlink then there's not much other > choice because the thread PIDs do not even show up under /proc anymore. > > > The problem with /proc/self/task/self is that it > > makes /proc/789/task/self be ill-defined when > > the observer is not tgid 789. If the directory can > > only show up in the observer's own task directory, > > then this solution is good. > > agreed. > > > I really don't want to see anything that would encourage > > more use of the gdb backdoor. For those that don't > > remember, gdb broke when access to threads via the > > top-level /proc directory was temporarily removed. > > We need that back door, unfortunately, but having it > > show up in symlink targets is quite nasty. > > > > As for the history: > > > > I left it out. At the time it would have been fairly useless. > > Back then, glibc didn't make things painful by pulling > > phony thread IDs out of its ass. Shell scripts sure didn't > > deal in threads. Monitoring tools like "ps" didn't need it. > > If nothing needs it, well, why have it? > > sound, future-proof API design, with a little bit of foresight? Yes, in a way. Adding stuff is usually easier than removing stuff. I couldn't decide between /proc/self/task/self and /proc/task, so I left the decision for later. I wasn't sure that I'd thought of all the issues. > I am > faced with incidents on an almost daily basis that show how much we > kernel folks suck at defining new APIs. The only luck is that the set of > system calls is fairly complete already - but in the rare case where we > touch an API it's a catastrophy most of the time. With such an API track > record we'd probably never survive as a user-space project. Most of user-space is worse. What shocks me is that people keep designing ABIs with structs that contain holes. (data leaks, waste, portability trouble, etc.) This happens in kernel ABIs all the time. It ought to be blocked by some sort of build tool. (with a whitelist for old stuff) Another shocker is /proc/*/smaps, which should make you cry. At the time I was working too much overtime to post about it, and I figured that nobody would allow that into the kernel anyway... Speaking of which, that's one that has no need to be in the task directories. I put a maps file there to make porting old code easier, but neither one really belongs. It's per-mm, which was in a 1:n relationship with struct signal last I checked. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree
On Nov 29, 2007 4:40 PM, Eric W. Biederman <[EMAIL PROTECTED]> wrote: > "Albert Cahalan" <[EMAIL PROTECTED]> writes: > > > On Nov 28, 2007 6:31 AM, Eric W. Biederman <[EMAIL PROTECTED]> wrote: > >> Ingo Molnar <[EMAIL PROTECTED]> writes: > >> > * Albert Cahalan <[EMAIL PROTECTED]> wrote: > >> >> On Nov 27, 2007 7:49 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote: > Linux tasks when used in one particular way can fulfill the posix > requirements for single threaded processes. > > Linux task groups when used in one particular way can fulfill the > posix requirements for processes. Right. Once you leave this, weirdness happens. POSIX defines things in terms of processes and threads. POSIX defines many of our interfaces. That includes kernel behavior, the C library, and numerous programs. > As for where /proc/self points given that procps seems to read > files like /proc/self/stat. It looks to me like we have a clear > case of a user space application that cares about the current > behavior and would break if we changed things. I wasn't saying procps would break, though it would if /proc/self/task went away. I'm more concerned about multi-threaded things that look in their own /proc/self directory. The procps programs are single-threaded. In procps, the self link is used: a. to see if the wchan file exists b. to see if the task directory exists c. to find the tty number (that last one: there might not be a file descriptor for the tty, and anyway I need it with the bits in all the same places as what I get for the other processes) I'll bet that something reads /proc/self/stat to see CPU usage. > > Note that it was intended that non-legacy additions > > would normally be added to either the process directory > > or the thread directory, not both. I think somebody may > > have ripped out the ability to do this; at the very least > > there have been numerous illogical additions. > > The rationale was not conveyed and the policy you describe > seems like deprecating the /proc/ directory in favor > of the /proc//task//. Which was a pattern > never established and it doesn't seem to make anything better > so I don't see the point there. For the stuff that is logically per-task, yes. For the rest, no. Oh well... It does make things better because redundant info is a source of confusion. > >> I'm still trying to understand which will break user space more, > >> adding /proc/task or changing /proc/self. > > > > Changing /proc/self makes you get per-thread data > > when you asked for per-process data. That's bad. > > /proc/self used to ask for per task data. Which is why there > is some confusion. Heh. Well, /proc/self used to ask for per process data. It was all the same. I think it matters that /proc/self was always documented as being per-process. > >> >> This one is probably best: > >> >> /proc/task -> 123/task/456 > >> >> (with both numbers showing) > >> > > >> > this sounds good to me. If it's a symlink then there's not much other > >> > choice because the thread PIDs do not even show up under /proc anymore. > >> > >> The name sounds good to me. > > I will see about writing the patch for this in a bit and sending > it to Andrew. Nice. > Nope. /proc/mounts was a symlink to /proc/self/mounts long before > /proc/self was modified to stop pointing at the task directory and > changed it point at the new task group directory. Having the filesystem namespace be per-process is wild enough. We really don't need it to be per-thread. (and yes, I'm using the POSIX terms on purpose) > Frankly from what I have seen of the code the task-group work > seems to be a larger source of bugs, and complications, because > people have a darn hard time wrapping their head around how it > is supposed to behave, and all of the corner cases were not > resolved at the time it was developed. People look at me like I have two heads when I explain to them that the Linux kernel source uses "pid" to mean a thread. The bad terminology probably promotes bad thinking. It would be lovely if that could somehow get fixed. > My favorite ongoing issue is what is needed to allow a threaded > init to actually function properly. I think enough fixes have > gone in that it might even work. My "favorite" is the multi-threaded debugger. By this I mean the debugger itself wants to be multi-threaded, issuing ptrace commands from multiple threads. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree
On Nov 27, 2007 7:49 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > > We may be stuck with the current broken behavior for backwards > > compatibility reasons but lets try fixing our ancient bug for the 2.6.25 > > time frame and see if anyone screams. It's not broken. It's just not the feature you're looking for. > I'm not screaming because of this change, but I screamed when I > discovered I could not have a replacement for gettid() in Java, or any > other high level environment. Java is so high-level that it seems inappropriate to touch /proc. It is allowed for Java to do N:M threading you know. > So, instead of making /proc/self an unstable interface that changed in > 2.6.0 and 2.6.25, I'll vote for /proc/self/task/self. A new interface > that can trivially be detected for existence, and programs relying on > this interface will loudly break on older kernels, unlike with the > proposed interface change. > > Ccing Albert Cahalan as he made the change to /proc/self in the first > place: Changing /proc/self is somewhat risky, and probably undesirable anyway. That file has always been used to represent the process; at one time this also meant the task. Documentation everywhere says "process". This one is probably best: /proc/task -> 123/task/456 (with both numbers showing) The problem with /proc/self/task/self is that it makes /proc/789/task/self be ill-defined when the observer is not tgid 789. If the directory can only show up in the observer's own task directory, then this solution is good. I really don't want to see anything that would encourage more use of the gdb backdoor. For those that don't remember, gdb broke when access to threads via the top-level /proc directory was temporarily removed. We need that back door, unfortunately, but having it show up in symlink targets is quite nasty. As for the history: I left it out. At the time it would have been fairly useless. Back then, glibc didn't make things painful by pulling phony thread IDs out of its ass. Shell scripts sure didn't deal in threads. Monitoring tools like "ps" didn't need it. If nothing needs it, well, why have it? Regarding some of the discusison on LKML, I don't see how unshare matters. If you unshare to the point where you get a new TGID, then /proc/self must reflect that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] Make /proc/ chmod'able
> OK, folks, another try to enhance privacy by hiding > process details from other users. Why not simply use > chmod to set the permissions of /proc/ directories? > This patch implements it. > > Children processes inherit their parents' proc > permissions on fork. You can only set (and remove) > read and execute permissions, the bits for write, > suid etc. are not changable. A user would add > > chmod 500 /proc/$$ > > or something similar to his .profile to cloak his processes. > > What do you think about that one? This is a bad idea. Users should not be allowed to make this decision. This is rightly a decision for the admin to make. Note: I'm the procps (ps, top, w, etc.) maintainer. Probably I'd have to make /bin/ps run setuid root to deal with this. (minor changes needed) The same goes for /usr/bin/top, which I know is currently unsafe and difficult to fix. Let's not go there, OK? If you restricted this new ability to root, then I'd have much less of an objection. (not that I'd like it) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] Make /proc/ chmod'able
On Mon, 2005-03-14 at 10:42 +0100, Rene Scharfe wrote: > Albert Cahalan wrote: > > This is a bad idea. Users should not be allowed to > > make this decision. This is rightly a decision for > > the admin to make. > > Why do you think users should not be allowed to chmod their processes' > /proc directories? Isn't it similar to being able to chmod their home > directories? They own both objects, after all (both conceptually and as > attributed in the filesystem). This is, to use your own word, "cloaking". This would let a bad user or even an unauthorized user hide from the admin. Why should someone be able to hide a suspicious CPU hog? Maybe they are cracking passwords or selling your CPU time. Note that the admin hopefully does not normally run as root. The admin should be using a normal user account most of the time, to reduce the damage caused by his accidents. Even if the admin were not running as a normal user, it is expected that normal users can keep tabs on each other. The admin may be sleeping. Social pressure is important to prevent one user from sucking up all the memory and CPU time. > > Note: I'm the procps (ps, top, w, etc.) maintainer. > > > > Probably I'd have to make /bin/ps run setuid root > > to deal with this. (minor changes needed) The same > > goes for /usr/bin/top, which I know is currently > > unsafe and difficult to fix. > > > > Let's not go there, OK? > > I have to admit to not having done any real testing with those > utilities. My excuse is this isn't such a new feature, Openwall had > something similar for at least four years now and GrSecurity contains > yet another flavour of it. Openwall also provides one patch for > procps-2.0.6, so I figured that problem (whatever their patch is about) > got fixed in later versions. If I haven't seen that patch, to Hell with 'em. It appears that Openwall is using procps-2.0.7 now. Oooh, they've upgraded to something that's only 4.5 years old! Anybody using a 4-year-old procps is uninterested in security. > Why do ps and top need to be setuid root to deal with a resticted /proc? > What information in /proc/ needs to be available to any and all > users? Anything provided by traditional UNIX and BSD systems should be available. Users who want privacy can get their own computer. So, these need to work: ps -ef ps -el ps -ej ps axu ps axl ps axj ps axv w top Note that /proc does provide a bit more info than required. This could be changed; it requires new /proc files or a non-proc source of data. > > If you restricted this new ability to root, then I'd > > have much less of an objection. (not that I'd like it) > > How about a boot parameter or sysctl to enable the chmod'ability of > /proc/, defaulting to off? But I'd like to resolve your more > general objections above first, if possible. :) This at least avoids breaking the traditional ability of non-root users to spot abuse. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] new timeofday core subsystem (v. A3)
On Mon, 2005-03-14 at 12:27 -0800, Matt Mackall wrote: > On Mon, Mar 14, 2005 at 12:04:07PM -0800, john stultz wrote: > > > > > > > > +static inline cycle_t read_timesource(struct timesource_t* ts) > > > > > > > > +{ > > > > > > > > + switch (ts->type) { > > > > > > > > + case TIMESOURCE_MMIO_32: > > > > > > > > + return (cycle_t)readl(ts->mmio_ptr); > > > > > > > > + case TIMESOURCE_MMIO_64: > > > > > > > > + return (cycle_t)readq(ts->mmio_ptr); > > > > > > > > + case TIMESOURCE_CYCLES: > > > > > > > > + return (cycle_t)get_cycles(); > > > > > > > > + default:/* case: TIMESOURCE_FUNCTION */ > > > > > > > > + return ts->read_fnct(); > > > > > > > > + } > > > > > > > > +} > > > Well where we'd read an MMIO address, we'd simply set read_fnct to > > > generic_timesource_mmio32 or so. And that function just does the read. > > > So both that function and read_timesource become one-liners and we > > > drop the conditional branches in the switch. > > > > However the vsyscall/fsyscall bits cannot call in-kernel functions (as > > they execute in userspace or a sudo-userspace). As it stands now in my > > design TIMESOURCE_FUNCTION timesources will not be usable for > > vsyscall/fsyscall implementations, so I'm not sure if that's doable. > > > > I'd be interested you've got a way around that. > > We can either stick all the generic mmio timer functions in the > vsyscall page (they're tiny) or leave the vsyscall using type/ptr but > have the kernel internally use only the function pointer. Someone > who's more familiar with the vsyscall timer code should chime in here. When the vsyscall page is created, copy the one needed function into it. The kernel is already self-modifying in many places; this is nothing new. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] Make /proc/ chmod'able
On Tue, 2005-03-15 at 00:08 +0100, Bodo Eggert wrote: > On Mon, 14 Mar 2005, Albert Cahalan wrote: > > On Mon, 2005-03-14 at 10:42 +0100, Rene Scharfe wrote: > > > Albert Cahalan wrote: > > > > Why do you think users should not be allowed to chmod their processes' > > > /proc directories? Isn't it similar to being able to chmod their home > > > directories? They own both objects, after all (both conceptually and as > > > attributed in the filesystem). > > > > This is, to use your own word, "cloaking". This would let > > a bad user or even an unauthorized user hide from the admin. > > NACK, the admin (and with the new inherited capabilities all users with > cap_???_override) can see all processes. Only users who don't need to know > won't see the other user's processes. Capabilities are too broken for most people to use. Normal users do not get CAP_DAC_OVERRIDE by default anyway, for good reason. > > Note that the admin hopefully does not normally run as root. > > su1 and sudo exist. This is a pain. Now every user will need sudo access, and the sudoers file will have to disable requesting passwords so that scripts will work without hassle. > > Even if the admin were not running as a normal user, it is > > expected that normal users can keep tabs on each other. > > The admin may be sleeping. Social pressure is important to > > prevent one user from sucking up all the memory and CPU time. > > Privacy is important, too. Imagine each user can see the CEO (or the > admin) executing "ee nakedgirl.jpg". Obviously, he likes to have users see him do this. He'd use a private machine if he wanted privacy. > > > > Note: I'm the procps (ps, top, w, etc.) maintainer. > > > > > > > > Probably I'd have to make /bin/ps run setuid root > > > > to deal with this. (minor changes needed) The same > > > > goes for /usr/bin/top, which I know is currently > > > > unsafe and difficult to fix. > > I used unpatched procps 3.1.11, and it worked for me, except pstree. It does not work correctly. Look, patches with this "feature" are called rootkits. Think of the headlines: "Linux now with built-in rootkit". > > > Why do ps and top need to be setuid root to deal with a resticted /proc? > > > What information in /proc/ needs to be available to any and all > > > users? > > > > Anything provided by traditional UNIX and BSD systems > > should be available. > > e.g. the buffer overflow in sendmail? Or all the open relays? :) > > The demands to security and privacy have increased. Linux should be able > to provide the requested privacy. This really isn't about security. Privacy may be undesirable. With privacy comes anti-social behavior. Supposing that the users do get privacy, perhaps because the have paid for it: Xen, UML, VM, VMware, separate computers Going with separate computers is best. Don't forget to use network traffic control to keep users from being able to detect the network activity of other users. > > Users who want privacy can get their > > own computer. So, these need to work: > > > > ps -ef > > ps -el > > ps -ej > > ps axu > > ps axl > > ps axj > > ps axv > > w > > top > > Works as intended. Only pstree breaks, if init isn't visible. They work like they do with a rootkit installed. Traditional behavior has been broken. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] new timeofday core subsystem (v. A3)
On Mon, 2005-03-14 at 19:22 -0800, Christoph Lameter wrote: > On Mon, 14 Mar 2005, Albert Cahalan wrote: > > > When the vsyscall page is created, copy the one needed function > > into it. The kernel is already self-modifying in many places; this > > is nothing new. > > AFAIK this will only works on ia32 and x86_64 and not definitely not > on ia64. Who knows about the other platforms I'll bet it does work fine on IA-64. If it didn't, you would be unable to load the kernel or load an executable. I know it works for PowerPC. You'll need an isync instruction of course. You may also want a sync instruction and some code to invalidate the cache. Setting up the page content should be a 1-time operation done at boot. Check your processor manuals as needed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] Make /proc/ chmod'able
On Tue, 2005-03-15 at 15:31 +0100, Bodo Eggert wrote: > (snipped the CC list - hope that's ok) > > On Mon, 14 Mar 2005, Albert Cahalan wrote: > > On Tue, 2005-03-15 at 00:08 +0100, Bodo Eggert wrote: > > > On Mon, 14 Mar 2005, Albert Cahalan wrote: > > This really isn't about security. > > Information leakage is a security aspect. If you will go to such extremes, Linux is poorly suited. A user can detect activity on the computer by examining the performance of their own activity. > > Privacy may be undesirable. > > May. That's why I suggested the min/max sysctl. > > > With privacy comes anti-social behavior. > > With anti-social behavior comes the admin and his LART. > > BTW: If the users want to be anti-social, they'll just rename setiathome > to something like -bash or soffice. This does not matter: "Rene, your soffice program is eating too much CPU time. Find some other place to run it." > > Supposing that the > > users do get privacy, perhaps because the have paid for it: > > Vservers, > > Xen, UML, VM, VMware, separate computers > > > > Going with separate computers is best. > > If you like wasting space and energy. If the user's demands don't exceed > one percent of a historic PC, there is no point in buying more hardware. Sure there is: a. info leakage (way more than just /proc) b. admin control c. budget control d. downtime hits fewer users > > Don't forget to use > > network traffic control to keep users from being able to > > detect the network activity of other users. > > Like that:? > > $ netstat > Active Internet connections (w/o servers) > Proto Recv-Q Send-Q Local Address Foreign Address State > /proc/net/tcp: Permission denied Nope. If you really care about information leakage, you'll be concerned about the ability to detect network congestion. Example #1 A spy sends packets from time to time. He measures the delay and packet loss to determine if the network is busy. When the network suddenly becomes busy, he can guess that you have started some operation that requires heavy network traffic. Example #2 A spy sends packets from time to time. He measures the delay and packet loss to determine if the network is busy. Over time, he learns when workers are busy. From this he can determine an appropriate time to sneak into your building. Hey, if you're going to be paranoid about %CPU and %MEM, you have to be paranoid about %NET too. This requires traffic control unless you have separate networks. Assign a fixed portion of bandwidth to any user that you wish to hide info from. Be sure to consider latency as well. > > > > Users who want privacy can get their > > > > own computer. So, these need to work: > > > > > > > > ps [...] > > > > w > > > > top > > > > > > Works as intended. Only pstree breaks, if init isn't visible. > > > > They work like they do with a rootkit installed. > > Traditional behavior has been broken. > > They are as broken as finger or ls are if the home directory is chmodded. Probably something should be done to deal with the problem of a chmodded home directory. It's not ls that matters though. It's du that matters. On a normal shared system, a user should be able to see where all the disk blocks and inodes are going. Filenames need not be visible. Then: "Rene, you're being kind of greedy about the disk space aren't you? You're using 666 GB." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Capabilities across execve
Russell King, the latest person to notice defects, writes: > However, the way the kernel is setup today, this seems > impossible to achieve, which tends to make the whole > idea of capabilities completely and utterly useless. > > How is this stuff supposed to work? Are my ideas of > what's supposed to be achievable completely wrong, > although they look completely reasonable to me. > > Don't get me wrong - the capability system seems great at > permanently revoking capabilities via /proc/sys/kernel/cap-bound, > and dropping them within an application provided it remains UID0. > Apart from that, capabilities seem completely useless. ... > it seems to be something of a lost cause. ... > my goal of running the script with minimal capabilities > was completely *impossible* to achieve. Uh huh. First, some history. Capability bits were implemented in DG-UX and IRIX. The two systems did not agree on operation. The draft POSIX standard, withdrawn for good reason, greatly changed between draft 16 and draft 17. Settings that work for one draft are horribly insecure on the other. Linux capabilities were partly done by the IRIX crew, working from draft 16. Everyone else had draft 17 or even draft 13. (and DG-UX had a better system anyway) Tytso put things well when he wrote: "A lot of innocent bits have been deforested while trying work out the differences between what Linux is doing (which is basically following Draft 17), and what Trusted Irix is doing (which apparently is following Draft 16)." Then along comes a sendmail exploit. An emergency fix was produced, breaking an already-defective capability design. Note that, unlike DG-UX, our IRIX-inspired design did not reserve any capability bits for non-kernel use. This causes an inconsistent security model, with things like the X server relying on UID. Inconsistency is bad. OK, so that's how we got into this mess. Now, how do we get out? We will always have to deal with old-style apps. Those few apps that handle capabilities can handle the bad system we have now, and can handle a system without the capability syscalls. (for old kernels) These apps can not handle a changed setup though; to change things we must make the old syscalls return failure. ANYTHING ELSE IS VERY UNSAFE. There is exactly one capability system in popular use. That would be the one that comes with Solaris. Moving toward that, via a kernel config option, appears to be a sane way to get ourselves unstuck from this big mess. An added advantage that that the Solaris-style method instantly becomes the standard, especially if Linux is strongly compatible. This helps with admin training and portable software. See if you can find any holes: http://docs.sun.com/app/docs/doc/816-5175/6mbba7f39?a=view - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] /proc umask and gid [was: Make /proc/ chmod'able]
On Wed, 2005-03-16 at 03:39 +0100, Rene Scharfe wrote: > So, I gather from the feedback I've got that chmod'able /proc/ > would be a bit over the top. 8-) While providing the easiest and most > intuitive user interface for changing the permissions on those > directories, it is overkill. Paul is right when he says that such a > feature should be turned on or off for all sessions at once, and that's > it. > > My patch had at least one other problem: the contents of eac > /proc/ directory became chmod'able, too, which was not intended. > > Instead of fixing it up I took two steps back, dusted off the umask > kernel parameter patch and added the "special gid" feature I mentioned. > > Without the new kernel parameters behaviour is unchanged. Add > proc.umask=077 and all /proc/ will get a permission mode of 500. > This breaks pstree (no output), as Bodo already noted, because this > program needs access to /proc/1. It also breaks w -- it shows the > correct number of users but it lists X even for sessions owned > by the user running it. > > Use proc.umask=007 and proc.gid=50 instead and all /proc/ dirs > will have a mode of 550 and their group attribute will be set to 50 > (that's "staff" on my Debian system). Pstree will work for all members > of that special group (just like top, ps and w -- which also show > everything in that case). Normal users will still have a restricted > view. > > Albert, would you take fixes for w even though you despise the feature > that makes them necessary? I will take patches if they are not too messy and they do not cause tools to report garbage output. For example, I do not wish to have tools reporting -1, 0, or uninitialized data in place of correct data. Distinct controls for the various files could be useful. I might want to make /proc/*/cmdline be public, or make /proc/*/maps be private. This is particularly helpful if a low-security file is added for bare-bones ps operation. You might make a special exception for built-in kernel tasks and init. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] /proc umask and gid [was: Make /proc/ chmod'able]
Better interface: /sbin/sysctl -w proc.maps=0440 /sbin/sysctl -w proc.cmdline=0444 /sbin/sysctl -w proc.status=0444 The /etc/sysctl.conf file can be used to set these at boot time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] new timeofday core subsystem (v. A3)
On Thu, 2005-03-17 at 16:55 +, Russell King wrote: > On Tue, Mar 15, 2005 at 10:23:54AM -0500, Albert Cahalan wrote: > > On Mon, 2005-03-14 at 19:22 -0800, Christoph Lameter wrote: > > > On Mon, 14 Mar 2005, Albert Cahalan wrote: > > > > > > > When the vsyscall page is created, copy the one needed function > > > > into it. The kernel is already self-modifying in many places; this > > > > is nothing new. > > > > > > AFAIK this will only works on ia32 and x86_64 and not definitely not > > > on ia64. Who knows about the other platforms > > > > I'll bet it does work fine on IA-64. If it didn't, you would > > be unable to load the kernel or load an executable. > > > > I know it works for PowerPC. You'll need an isync instruction > > of course. You may also want a sync instruction and some code > > to invalidate the cache. > > > > Setting up the page content should be a 1-time operation done > > at boot. Check your processor manuals as needed. > > Won't work on ARM. We have XIP kernels, which prevents the use of > self-modifying code. Does the ARM kernel provide a special page of code for apps to execute? If not, then ARM is irrelevant. Doesn't ARM always have an MMU? If you have an MMU, then it is no problem to have one single page of non-XIP code for this purpose. Supposing that you do support the vsyscall hack and you don't have an MMU, you can just place the tiny code fragment on the stack (or anywhere else) when an exec is performed. So, as far as I can see, ARM is fully capable of supporting this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][0/6] Change proc file permissions with sysctls
On Sun, 2005-03-20 at 01:22 +0100, Rene Scharfe wrote: > The permissions of files in /proc/1 (usually belonging to init) are > kept as they are. The idea is to let system processes be freely > visible by anyone, just as before. Especially interesting in this > regard would be instances of login. I don't know how to easily > discriminate between system processes and "normal" processes inside > the kernel (apart from pid == 1 and uid == 0 (which is too broad)). > Any ideas? The ideal would be to allow viewing: 1. killable processes (that is, YOU can kill them) 2. processes sharing a tty with a killable process Optionally, add: 3. processes controlling a tty master of a killable process 4. ancestors of all of the above 5. children of killable processes This is of course expensive, but maybe you can get some of it cheaply. For example, allow viewing a process if the session leader, group leader, parent, or tpgid process is killable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] A new entry for /proc
[quoting various people...] > Here is a new entry developed for /proc that prints for each process > memory area (VMA) the size of rss. The maps from original kernel is > able to present the virtual size for each vma, but not the physical > size (rss). This entry can provide an additional information for tools > that analyze the memory consumption. You can know the physical memory > size of each library used by a process and also the executable file. > > Take a look the output: > # cat /proc/877/smaps > 08048000-08132000 r-xp /usr/bin/xmms > Size: 936 kB > Rss: 788 kB > 08132000-0813a000 rw-p /usr/bin/xmms > Size: 32 kB > Rss: 32 kB > 0813a000-081dd000 rw-p > Size: 652 kB > Rss: 616 kB The most important thing about a /proc file format is that it has a documented means of being extended in the future. Without such documentation, it is impossible to write a reliable parser. The "Name: value" stuff is rather slow. Right now procps (ps, top, etc.) is using a perfect hash function to parse the /proc/*/status files. ("man gperf") This is just plain gross, but needed for decent performance. Extending the /proc/*/maps file might be possible. It is commonly used by debuggers I think, so you'd better at least verify that gdb is OK. The procps "pmap" tool uses it too. To satisfy the procps parser: a. no more than 31 flags b. no '/' prior to the filename c. nothing after the filename d. no new fields inserted prior to the inode number > If there were a use for it, that use might want to distinguish between > the "shared rss" of pagecache pages from a file, and the "anon rss" of > private pages copied from file or originally zero - would need to get > the struct page and check PageAnon. And might want to count swap > entries too. Hard to say without real uses in mind. ... > It's a mixture of two different styles, the /proc//maps > many-hex-fields one-vma-per-line style and the /proc/meminfo > one-decimal-kB-per-line style. I think it would be better following > the /proc//maps style, but replacing the major,minor,ino fields > by size and rss (anon_rss? swap?) fields (decimal kB? I suppose so). The more info the better. See the pmap "-x" option, currently missing some data that the kernel does not supply. There are numerous other pmap options that are completely unimplemented because of the lack of info. See the Solaris 10 man page for pmap, available on Sun's web site. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] audit: handle loginuid through proc
Assuming you'd like ps to print the LUID, how about putting it with all the others? There are "Uid:" lines in the /proc/*/status files. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] audit: handle loginuid through proc
On Thu, 2005-02-24 at 22:49 -0800, Chris Wright wrote: > * Albert Cahalan ([EMAIL PROTECTED]) wrote: > > Assuming you'd like ps to print the LUID, how about > > putting it with all the others? There are "Uid:" > > lines in the /proc/*/status files. > > It's also set (written) via /proc, so it should probably stay separate. Gross. Please rip this out before it hits the streets. (it's an interface change that might need eternal support) Consider that: 1. Every other UID is handled by system calls: getuid, setuid, geteuid, setreuid, setresuid, getresuid, setfsuid 2. HP's Tru64 has getluid() and setluid() system calls that Linux should be compatible with. SecureWare has a version too, which looks more-or-less compatible with what HP is offering. (the descriptions do not conflict, but one has more details) It looks like ssh, apache, and sendmail (huh?) already knows to use these system calls even. The header is used. Prototypes are the obvious. The setuid() call returns 0 on success. Tru64 notes that the login UID is sometimes called the audit UID (AUID) because it is recorded with most audit events. getluid() returns an error if the LUID (AUID) is unset. SecureWare additionally notes that setuid() and setgid() will also fail when the luid is unset, to ensure that the LUID is set before any other identity changes. (probably Linux should just disable setting LUID after that point) Just to be complete, here's what Sun did: Sun has getauid() and setauid() syscalls which are somewhat similar. They take pointers to the ID, and they require privilege (PRIV_SYS_AUDIT and PRIV_PROC_AUDIT for setauid, or just PRIV_PROC_AUDIT for getauid) These calls have been superceded by getaudit_addr() and setaudit_addr(), which use structs containing: au_id_t ai_auid; // audit user ID au_mask_t ai_mask; // preselection mask au_tid_addr_t ai_termid; // terminal ID au_asid_t ai_asid; // audit session ID (the terminal ID is variable length, containing a network address and a length value for it) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] inotify for 2.6.11
Christoph Hellwig writes: > On Sat, Mar 05, 2005 at 07:40:06PM -0500, Robert Love wrote: >> On Sun, 2005-03-06 at 00:04 +, Christoph Hellwig wrote: >>> The user interface is still bogus. >> >> I presume you are talking about the ioctl. I have tried to engage you >> and others on what exactly you prefer instead. I have said that moving >> to a write interface is fine but I don't see how ut is _any_ better than >> the ioctl. Write is less typed, in fact, since we lose the command >> versus argument delineation. >> >> But if it is a anonymous decision, I'll switch it. Or take patches. ;-) >> It isn't a big deal. > > See the review I sent. Write is exactly the right interface for that kind > of thing. For comment vs argument either put the number first so we don't > have the problem of finding a delinator that isn't a valid filename, or > use '\0' as such. That's just putrid. You've proposed an interface that combines the worst of ASCII with the worst of binary. It is now well-established that ASCII interfaces are horribly slow. This one will be no exception... but with the '\0' in there, you have a binary interface. So, it's an evil hybrid. An ioctl() is a syscall with scope restricting it to a single fd. This is a fine user interface, not a bogus one. (keep 32-on-64 operation in mind to be polite) If you'd rather have a normal (global) system call though, that'll do too, likely leading to a bit more type checking in the glibc-provided headers. Adding plain old syscalls is rather nice actually. It's only a pain at first, while waiting for glibc to catch up. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: binary drivers and development
Lennart Sorensen writes: > You forgot the very important: >- Only works on architecture it was compiled for. So anyone not > using i386 (and maybe later x86-64) is simply out of luck. What do > nvidia users that want accelerated nvidia drivers for X DRI do > right now if they have a powerpc or a sparc or an alpha? How about > porting Linux to a new architecture. With binary drivers you now > start out with no drivers on the new architecture except for the > ones you have source for. Not very productive. Rik van Riel writes: > No, it wouldn't. I can use a source code driver on x86, > x86-64 and PPC64 systems, but a binary driver is only > usable on the architecture it was compiled for. > > Source code is way more portable than binary anything. The kernel already has an AML interpreter for ACPI. **duck** As for portability, AML would do the job. It beats typical vendor source code IMHO, because endianness and integer size are well-defined. (like the Java VM and .net) For the x86 and ia64 users, the AML interpreter is probably already compiled into the kernel. Most people need it to set up SMP or power management. So, no added bloat even. AML code is fairly well controlled and isolated. There is of course the backdoor via DMA for the truly determined evil author, but such paranoia is rather extreme. AML is really designed for this sort of task. As with any interpreter, there are ways (JIT) to make the AML interpreter go faster if need be. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
Peter Chubb writes: > There are three new system calls: > > long usr_pci_open(int bus, int slot, int function, __u64 dma_mask); > Returns a filedescriptor for the PCI device described > by bus,slot,function. It also enables the device, and sets it > up as a bus-mastering DMA device, with the specified dma mask. You forgot the PCI domain (a.k.a. hose, phb...) number. Also, you might encode bus,slot,function according to the PCI spec. So that gives: long usr_pci_open(unsigned pcidomain, unsigned devspec, __u64 dmamask); (with the user library returning an int instead of long) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)
On Fri, 2005-03-11 at 19:15 +, Alan Cox wrote: > > You forgot the PCI domain (a.k.a. hose, phb...) number. > > Also, you might encode bus,slot,function according to > > the PCI spec. So that gives: > > > > long usr_pci_open(unsigned pcidomain, unsigned devspec, __u64 dmamask); > > Still insufficient because the device might be hotplugged on you. You > need a file handle that has the expected revocation effects on unplug > and refcounts I was under the impression that a file handle would be returned. I'm not so sure that is a sane way to handle hot-plug though. First of all, in general, it's going to be like this: Fan, meet shit. Shit, meet fan. Those who care might best be served by SIGBUS with si_code and si_info set appropriately. Perhaps a revoke() syscall that handled mmap() would work the same way. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Can't use SYSFS for "Proprietry" driver modules !!!.
greg k-h writes: > On Sat, Mar 26, 2005 at 10:30:20PM -0500, Lee Revell wrote: >> That's the problem, it's not spelled out explicitly anywhere. >> That file does not address the issue of whether a driver is >> a "derived work". This is the part he should talk to a lawyer >> about, right? > > How about the fact that when you load a kernel module, it is > linked into the main kernel image? The GPL explicitly states > what needs to be done for code linked in. This probably fails. Obviously, it's not over until the courts say so, but... First of all, the GPL might not be as infectious as you and RMS wish it to be. There is a limit to what can be a derived work in copyright law. Second of all, module loading is not the same as "linking" in the traditional sense. The GPL was written before Linux had kernel modules. Don't be so sure a court would rule as you would like it to rule. > Also, realize that you have to use GPL licensed header files > to build your kernel module... Um, like the printer cartridges and game cartridges with code in them? Courts have held that it was OK to copy because it was needed to implement an interface. Whatever your lawyer may have said was undoubtably influenced by your biased attempt to describe the technical issues. Not that I care for proprietary stuff, being a PowerPC user myself, but spreading unjustified FUD isn't proper behavior. Neither is it proper to be marking key driver interfaces as GPL-only. It's far better to just ignore the proprietary stuff. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: util-linux: orphan
On 12/20/06, Jan Engelhardt <[EMAIL PROTECTED]> wrote: >> I've originally thought about util-linux upstream fork, >> but as usually an fork is bad step. So.. I'd like to start >> some discussion before this step. > ... >> after few weeks I'm pleased to announce a new "util-linux-ng" >> project. This project is a fork of the original util-linux (2.13-pre7). > > Well, how about giving me a chunk of it? I'd like /bin/kill please. > I already ship a nicer one in procps anyway, so you can just delete > the files and call that done. (just today I was working on a Fedora > system and /bin/kill annoyed me) How can you ship a "nicer" kill, given that its sole purpose is to accept kill { -l | -t | {-s SIGNUM | -SIGNAME } somepid [morepids] } ? I checked compatibility with Solaris, Tru64, probably a few BSDs, and man pages of many others. Fedora Core 5 doesn't seem to like this command: /bin/kill -l 17 19 (which reminds me, I need to add sigqueue support and maybe tgkill support) What about merging util-linux and procps? How? Which way? As I mentioned before, I was twice disappointed in missing announcements of util-linux maintainership being up for grabs. I certainly have a track record for keeping things stable. Prior to me, procps has a history of being abandoned and broken. Procps is a fork of the long-dead kmem-ps project. Procps was then passed to someone who added color and then disappeared. The prior maintainer picked up the old code again, no doubt under influence of his employer Red Hat. I rewrote much of it then, but had trouble getting in all of my changes. Debian started using my code, which slowly turned into a fork. Maintainership was passed to somebody else, without even telling me. That person and his immediate successor added numerous serious bugs. Inexperience with the code and the lack of a test suite soon led to that group being bogged down in problems. One by one, the various Linux distributions switched over to my version of the code. So as you may imagine, I'd be rather nervous about letting procps get into that situation again. Bugs are yucky. Having multiple committers and no testing is a sure path to ruin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] procfs: export context switch counts in /proc/*/stat
On 12/20/06, David Wragg <[EMAIL PROTECTED]> wrote: "Albert Cahalan" <[EMAIL PROTECTED]> writes: > On Mon, Dec 18, 2006 at 11:50:08PM +, David Wragg wrote: >> This patch (against 2.6.19/2.6.19.1) adds the four context >> switch values (voluntary context switches, involuntary >> context switches, and the same values accumulated from >> terminated child processes) to the end of /proc/*/stat, >> similarly to min_flt, maj_flt and the time used values. > > Hmmm, OK, do people have a use for these values? My reason for writing the patch was to track which processes are active (i.e. got scheduled to run) by polling these context switch values. The time used values are not a reliable way to detect process activity on fast machines. So for example, when sorting by %CPU, top often shows many processes using 0% CPU, despite the fact that these processes are running occasionally. If top sorted by (%CPU, context switch count delta), it might give a more useful display of which processes are active on the system. Oh, that'd be great. The cumulative ones are still not justified though, and I fear they may be 64-bit even on i386. It turns out that an i386 procps spends much of its time doing 64-bit division to parse the damn ASCII crap. I suppose I could just skip those fields, but generating them isn't too cheap and probably I'd get stuck parsing them for some other reason -- having them separate is probably a good idea. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] daemon.c blows up on OSX
Linus Torvalds writes: So it would appear that for OS X, the #define _XOPEN_SOURCE_EXTENDED 1 /* AIX 5.3L needs this */ #define _GNU_SOURCE #define _BSD_SOURCE sequence actually _disables_ those things. Yes, of course. The odd one here is glibc. Normal systems enable everything by default. As soon as you specify a feature define, you get ONLY what you asked for. I'm not sure why glibc is broken, but I suspect that somebody wants to make everyone declare their code to be GNU source. (despite many "GNU" things not working on the HURD at all) Define _APPLE_C_SOURCE to make MacOS X give you everything. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nasty thread-related bugs, maybe in exit
There are big nasty bugs related to threaded processes exiting, especially when involving: zombie leaders, clone w/o SIGCHLD, and ptrace. I can make tasks that remain until reboot. I've seen things stuck in "X" state. I've seen pending SIGKILL and even blocked SIGKILL. I've seen "D" state pretending to dump core for eternity, despite having core dumps disabled. Does this not bother anybody? I posted this twice already: http://lkml.org/lkml/2006/12/18/312 http://lkml.org/lkml/2006/12/19/335 Killing the parent does NOT always clear these zombies. Well, perhaps it would, but PID 1 is protected. The source code included below is cloninator.c minus SIGCHLD. Run it in a loop, periodically sending it SIGKILL, like this: gcc -m32 -O2 -std=gnu99 -o foo foo.c while true; do killall -9 foo; ./foo; sleep 1; done Note: it's NOT an unlimited fork bomb. The original has SIGCHLD in the clone flags. Things go very badly if you rapidly SIGKILL things while ptracing. You can cause this with "strace" and "killall", but a more reliable method is to have the ptracer use tgkill to SIGKILL all the tasks as fast as possible. Tested: both mainline 2.6.19 and the latest Fedora Core 5 kernel /// #include #include #include #include #include #include #include #include #include #include #include #include #include #include static void early_write(int fd, const void *buf, size_t count) { #if 0 unsigned long eax = __NR_write; /* push and pop because -fPIC probably needs ebx for the GOT base pointer */ __asm__ __volatile__( "push %%ebx ; " "push %1 ; pop %%ebx ; int $0x80" "; pop %%ebx" :"=a"(eax) :"r"(fd),"c"(buf),"d"(count),"0"(eax) :"memory" ); #endif } static void p_str(char *s) { size_t count = strlen(s); early_write(STDERR_FILENO,s,count); } static void p_hex(unsigned long u) { char buf[9]; char x[] = "0123456789abcdef"; char *s = buf; s[8] = '\0'; int i = 8; while(i--) buf[7-i] = x[(u>>(i*4))&15]; early_write(STDERR_FILENO,buf,8); } static void p_dec(unsigned long u) { char buf[11]; char *s = buf+10; *s-- = '\0'; int count = 0; while(u || !count) { *s-- = u%10 + '0'; u /= 10; count++; } early_write(STDERR_FILENO,s+1,count); } #define FUTEX_WAIT 0 #define FUTEX_WAKE 1 typedef int lock_t; #define LOCK_INITIALIZER 0 static inline void init_lock(lock_t* l) { *l = 0; } // lock_add performs an atomic add // and returns the resulting value static inline int lock_add(lock_t* l, int val) { int result = val; __asm__ __volatile__ ( "lock; xaddl %1, %0;" : "=m" (*l), "=r" (result) : "1" (result), "m" (*l) : "memory"); return result + val; // Returns the value written to memory } // lock_bts_high_bit atomically tests and // sets the high bit and returns // true if the bit was clear initially static inline bool lock_bts_high_bit(lock_t* l) { bool result; __asm__ __volatile__ ( "lock; btsl $31, %0;\n\t" "setnc %1;" : "=m" (*l), "=q" (result) : "m" (*l) : "memory"); return result; } static int futex(int* uaddr, int op, int val, const struct timespec*timeout, int*uaddr2, int val3) { (void)timeout; (void)uaddr2; (void)val3; int eax = __NR_futex; __asm__ __volatile__( "push %%ebx ; push %1 ; pop %%ebx" " ; int $0x80; pop %%ebx" :"=a"(eax) :"r"(uaddr),"c"(op),"d"(val),"0"(eax) :"memory" ); return eax; } // lock will wait for and lock a mutex static void lock(lock_t* l) { // Check the mutex and set held bit if (lock_bts_high_bit(l)) { // Got the mutex return; } // Increment wait count lock_add(l, 1); while (true) { // Check the mutex and set held bit if (lock_bts_high_bit(l)) { // Got mutex, decrement wait count lock_add(l, -1); return; } int val = *l; // Ensure mutex not given up since check if (!(val & 0x8000)) continue; // Wait for the mutex futex(l, FUTEX_WAIT, val, NULL, NULL, 0); } } // unlock will release a mutex static void unlock(lock_t* l) { // Turn off lock held bit and check for waiters if (lock_add(l, 0x8000) == 0) { // No waiters return;
Re: kernel + gcc 4.1 = several problems
Linus Torvalds writes: [probably Mikael Pettersson] writes: The suggestions I've had so far which I have not yet tried: - Select a different x86 CPU in the config. - Unfortunately the C3-2 flags seem to simply tell GCC to schedule for ppro (like i686) and enabled MMX and SSE - Probably useless Actually, try this one. Try using something that doesn't like "cmov". Maybe the C3-2 simply has some internal cmov bugginess. Of course that changes register usage, register spilling, and thus ultimately even the stack layout. :-( Adjusting gcc flags to eliminate optimizations is another way to go. Adding -fwrapv would be an excellent start. Lack of this flag breaks most code which checks for integer wrap-around. The compiler "knows" that signed integers don't ever wrap, and thus eliminates any code which checks for values going negative after a wrap-around. I could imagine this affecting a switch() or other jump table. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PID entries in /proc sorted by number, not start time in 2.6.19
On 2/28/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote: Chuck Ebbert <[EMAIL PROTECTED]> writes: > Starting with kernel 2.6.19, the process directories in > /proc are sorted by number. They were sorted by process > start time in 2.6.18 and earlier. This makes the output > of procps come out in that order too, pissing off users > who are used to the old way. ps --sort=start_time I've always just assumed the order to be random. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: new procfs memory analysis feature
David Singleton writes: Add variation of /proc/PID/smaps called /proc/PID/pagemaps. Shows reference counts for individual pages instead of aggregate totals. Allows more detailed memory usage information for memory analysis tools. An example of the output shows the shared text VMA for ld.so and the share depths of the pages in the VMA. a7f4b000-a7f65000 r-xp 00:0d 19185826 /lib/ld-2.5.90.so 11 11 11 11 11 11 11 11 11 13 13 13 13 13 13 13 8 8 8 13 13 13 13 13 13 13 Arrrgh! Not another ghastly maps file! The original was mildly defective. Somebody thought " (deleted)" was a reserved filename extension. Somebody thought "/SYSV*" was also some kind of reserved namespace. Nobody ever thought to bother with a properly specified grammar; it's more fun to blame application developers for guessing as best they can. The use of %08lx is quite a wart too, looking ridiculous on 64-bit systems. Now we have /proc/*/smaps, which should make decent programmers cry. Really now, WTF? It has compact non-obvious parts, which would be a nice choice for performance if not for being MIXED with wordy bloated parts of a completely different nature. Parsing is terribly painful. Supposedly there is a NUMA version too. Along the way, nobody bothered to add support for describing the page size (IMHO your format ***severely*** needs this) or for the various VMA flags to indicate if memory is locked, randomized, etc. There can be a million pages in a mapping for a 32-bit process. If my guess (since you too failed to document your format) is right, you propose to have one decimal value per page. In other words, the lines of this file can be megabytes long without even getting to the issue of 64-bit hardware. This is no text file! How about a proper system call? Enough is enough already. Take a look at the mincore system call. Imagine it taking a PID. The 7 available bits probably won't do, so expand that a bit. Just take the user-allowed parts of the VMA and/or PTE (both varients are good to have) and put them in a struct. There may be some value in having both low-privilage and high-privilege versions of this. BTW, you might wish to ensure that Wine can implement VirtualQueryEx perfectly based on this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
unreapable zombies, maybe futex+ptrace+exit
I have a fun little test program for people to try. It creates zombies that persist until reboot, despite being reparented to init. Sometimes it creates processes that block SIGKILL, sit around with pending SIGKILL, or both. You'll want: a. either assembly skills or the ability to run 32-bit x86 code b. the procps-3.2.7 release, so you can easily view the results c. the strace program, or some other ptrace-based debugger d. a recent kernel -- updated Fedora 5 or mainline 2.6.19 will do Compile like this: gcc -m32 -std=gnu99 -O2 -o cloninator cloninator.c Run like this: strace -f -F ./cloninator Let the program run for a bit, then do one of a few fun things: a. hit ^C to stop it b. run "killall -9 cloninator" to stop it c. send SIGKILL to the process group (the negative as PID) d. send SIGKILL to all your processes (use -1 as PID) View the results: ps -Ccloninator -mwostat,ppid,pid,tid,nlwp,pending,sigmask,sigignore,caught,wch I suggest trying other debuggers. Under a debugger I can't share, thousands of messed-up zombies get created in under a minute. With strace, you'll probably get a half dozen after a couple trys. You might try gdb, fenris, nightview, and anything else which uses ptrace to observe something. (Ideas?) Be sure to specify any options needed to follow child processes; you may need to comment out the CLONE_VFORK case for wimpy debuggers. BTW, we can probably now answer this question: $ egrep -i 'todo.*safe' kernel/*.c kernel/exit.c: // TODO: is this safe? kernel/exit.c: // TODO: is this safe? /// #include #include #include #include #include #include #include #include #include #include #include #include #include #include static void early_write(int fd, const void *buf, size_t count) { #if 0 unsigned long eax = __NR_write; // push and pop because -fPIC probably needs ebx for the GOT base pointer __asm__ __volatile__( "push %%ebx ; push %1 ; pop %%ebx ; int $0x80; pop %%ebx" :"=a"(eax) :"r"(fd),"c"(buf),"d"(count),"0"(eax) :"memory" ); #endif } static void p_str(char *s) { size_t count = strlen(s); early_write(STDERR_FILENO,s,count); } static void p_hex(unsigned long u) { char buf[9]; char x[] = "0123456789abcdef"; char *s = buf; s[8] = '\0'; int i = 8; while(i--) buf[7-i] = x[(u>>(i*4))&15]; early_write(STDERR_FILENO,buf,8); } static void p_dec(unsigned long u) { char buf[11]; char *s = buf+10; *s-- = '\0'; int count = 0; while(u || !count) { *s-- = u%10 + '0'; u /= 10; count++; } early_write(STDERR_FILENO,s+1,count); } #define FUTEX_WAIT 0 #define FUTEX_WAKE 1 typedef int lock_t; #define LOCK_INITIALIZER 0 static inline void init_lock(lock_t* l) { *l = 0; } // lock_add performs an atomic add and returns the resulting value static inline int lock_add(lock_t* l, int val) { int result = val; __asm__ __volatile__ ( "lock; xaddl %1, %0;" : "=m" (*l), "=r" (result) : "1" (result), "m" (*l) : "memory"); return result + val; // Return the value written to memory } // lock_bts_high_bit atomically tests and sets the high bit and returns // true if the bit was clear initially static inline bool lock_bts_high_bit(lock_t* l) { bool result; __asm__ __volatile__ ( "lock; btsl $31, %0;\n\t" "setnc %1;" : "=m" (*l), "=q" (result) : "m" (*l) : "memory"); return result; } static int futex(int* uaddr, int op, int val, const struct timespec*timeout, int*uaddr2, int val3) { (void)timeout; (void)uaddr2; (void)val3; int eax = __NR_futex; __asm__ __volatile__( "push %%ebx ; push %1 ; pop %%ebx ; int $0x80; pop %%ebx" :"=a"(eax) :"r"(uaddr),"c"(op),"d"(val),"0"(eax) :"memory" ); return eax; } // lock will wait for and lock a mutex static void lock(lock_t* l) { // Check the mutex and set held bit if (lock_bts_high_bit(l)) { // Got the mutex return; } // Increment wait count lock_add(l, 1); while (true) { // Check the mutex and set held bit if (lock_bts_high_bit(l)) { // Got the mutex, decrement wait count lock_add(l, -1); return; } int val = *l; // Ensure the mutex wasn't given up since the check if (!(val & 0x8000)) continue;
BUG: wedged processes, test program supplied
Somebody PLEASE try this... Normally, when a process dies it becomes a zombie. If the parent dies (before or after the child), the child is adopted by init. Init will reap the child. The program included below DOES NOT get reaped. Do like so: gcc -m32 -O2 -std=gnu99 -o foo foo.c while true; do killall -9 foo; ./foo; sleep 1; done BTW, it gets even better if you start playing with ptrace. Use the "strace" program (following children) and/or start sending rapid-fire SIGKILL to all the various _threads_ in the processes. You can get processes wedged in a wide variety of interesting states. I've seen "X" state, processes sitting around with pending SIGKILL, a process stuck in "D" state supposedly core dumping despite ulimit 0 on the core size, etc. / #include #include #include #include #include #include #include #include #include #include #include #include #include #include static void early_write(int fd, const void *buf, size_t count) { #if 0 unsigned long eax = __NR_write; /* push and pop because -fPIC probably needs ebx for the GOT base pointer */ __asm__ __volatile__( "push %%ebx ; " "push %1 ; pop %%ebx ; int $0x80" "; pop %%ebx" :"=a"(eax) :"r"(fd),"c"(buf),"d"(count),"0"(eax) :"memory" ); #endif } static void p_str(char *s) { size_t count = strlen(s); early_write(STDERR_FILENO,s,count); } static void p_hex(unsigned long u) { char buf[9]; char x[] = "0123456789abcdef"; char *s = buf; s[8] = '\0'; int i = 8; while(i--) buf[7-i] = x[(u>>(i*4))&15]; early_write(STDERR_FILENO,buf,8); } static void p_dec(unsigned long u) { char buf[11]; char *s = buf+10; *s-- = '\0'; int count = 0; while(u || !count) { *s-- = u%10 + '0'; u /= 10; count++; } early_write(STDERR_FILENO,s+1,count); } #define FUTEX_WAIT 0 #define FUTEX_WAKE 1 typedef int lock_t; #define LOCK_INITIALIZER 0 static inline void init_lock(lock_t* l) { *l = 0; } // lock_add performs an atomic add // and returns the resulting value static inline int lock_add(lock_t* l, int val) { int result = val; __asm__ __volatile__ ( "lock; xaddl %1, %0;" : "=m" (*l), "=r" (result) : "1" (result), "m" (*l) : "memory"); return result + val; // Returns the value written to memory } // lock_bts_high_bit atomically tests and // sets the high bit and returns // true if the bit was clear initially static inline bool lock_bts_high_bit(lock_t* l) { bool result; __asm__ __volatile__ ( "lock; btsl $31, %0;\n\t" "setnc %1;" : "=m" (*l), "=q" (result) : "m" (*l) : "memory"); return result; } static int futex(int* uaddr, int op, int val, const struct timespec*timeout, int*uaddr2, int val3) { (void)timeout; (void)uaddr2; (void)val3; int eax = __NR_futex; __asm__ __volatile__( "push %%ebx ; push %1 ; pop %%ebx" " ; int $0x80; pop %%ebx" :"=a"(eax) :"r"(uaddr),"c"(op),"d"(val),"0"(eax) :"memory" ); return eax; } // lock will wait for and lock a mutex static void lock(lock_t* l) { // Check the mutex and set held bit if (lock_bts_high_bit(l)) { // Got the mutex return; } // Increment wait count lock_add(l, 1); while (true) { // Check the mutex and set held bit if (lock_bts_high_bit(l)) { // Got mutex, decrement wait count lock_add(l, -1); return; } int val = *l; // Ensure mutex not given up since check if (!(val & 0x8000)) continue; // Wait for the mutex futex(l, FUTEX_WAIT, val, NULL, NULL, 0); } } // unlock will release a mutex static void unlock(lock_t* l) { // Turn off lock held bit and check for waiters if (lock_add(l, 0x8000) == 0) { // No waiters return; } // Waiters found, wake up one of them futex(l, FUTEX_WAKE, 1, NULL, NULL, 0); } unsigned toomany = 42; struct data { unsigned nprocs; lock_t lock; unsigned count; }; struct data *data; static struct data *get_shm(void) { void *addr; int shmid; // create shmid = shmget(IPC_PRIVATE,42,IPC_CREAT|0666); // attach addr = shmat(shmid, NULL, 0); // don't want it to
Re: [PATCH] procfs: export context switch counts in /proc/*/stat
David Wragg writes: Benjamin LaHaise <[EMAIL PROTECTED]> writes: On Mon, Dec 18, 2006 at 11:50:08PM +, David Wragg wrote: This patch (against 2.6.19/2.6.19.1) adds the four context switch values (voluntary context switches, involuntary context switches, and the same values accumulated from terminated child processes) to the end of /proc/*/stat, similarly to min_flt, maj_flt and the time used values. Hmmm, OK, do people have a use for these values? Please put these into new files, as the stat files in /proc are horribly overloaded and have always been somewhat problematic when it comes to changing how things are reported due to internal changes to the kernel. Cheers, No thanks. Yours truly, the maintainer of "ps", "top", "vmstat", etc. The delay accounting value was added to the end of /proc/pid/stat back in July without discussion, so I assumed this approach was still considered satisfactory. /proc/*/stat is the very best place in /proc for any per-process data that will be commonly needed. Unlike /proc/*/status, few people are tempted to screw with the formatting and/or spelling. Unlike the /sys crap, it doesn't take 3 syscalls PER VALUE to get at the data. The things to ask are of course: will this really be used, and does it really belong in /proc at all? Putting just these four values into a new file would seem a little odd, since they have a lot in common with the other getrusage values that are already in /proc/pid/stat. One possibility is to add /proc/pid/rusage, mirroring the full struct rusage in text form, since struct rusage is already part of the kernel ABI (though Linux doesn't fill in half of the values). Since we already have a struct defined and all... sys_get_rusage(int pid) Or perhaps it makes sense to reorganize all the values from /proc/pid/stat and its siblings into a sysfs-like one-value-per-file structure, though that might introduce atomicity and efficiency issues (calculating some of the values involves iterating over the threads in the process; with everything in one file, these loops are folded together). Yeah, big time. Things are quite bad in /proc, but /sys is a joke. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: wedged processes, test program supplied
On 12/20/06, Mike Galbraith <[EMAIL PROTECTED]> wrote: On Tue, 2006-12-19 at 21:46 -0500, Albert Cahalan wrote: > Somebody PLEASE try this... I was having enough fun with cloninator (which was whitespace munged btw). Anything stuck? Besides refusing to die, that beast slays debuggers left and right. I just need to add execve of /proc/self/exe and a massive storm of signals on the alternate stack. In the original post, I also mangled the recommended ps command: ps -Ccloninator -mwostat,ppid,pid,tid,nlwp,pending,sigmask,sigignore,caught,wchan Leave out pid,tid,nlwp if you need to save screen space, like so: ps -Ccloninator -mwostat,ppid,pending,sigmask,sigignore,caught,wchan (note: procps versions prior to 3.2.7 are mostly fine, but will mess up the PENDING column for any single-threaded processes you get) This is fun to look at: watch ps -Ccloninator fostat,ppid,wchan:9,comm > Normally, when a process dies it becomes a zombie. > If the parent dies (before or after the child), the child > is adopted by init. Init will reap the child. > > The program included below DOES NOT get reaped. While true wasn't a great test recommendation :) Oh. I wanted to be sure you'd see the problem. Did you have some... difficulty? A plain old ^C should make things stop. The second test program is like the first, but missing SIGCHLD from the clone flags, and hopefully not whitespace-mangled. Note that the test program is not normally a fork bomb. It self-limits itself to 42 tasks via a lock in shared memory. If things are working OK, you should see no more than about 60 tasks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: util-linux: orphan
Karel Zak writes: I've originally thought about util-linux upstream fork, but as usually an fork is bad step. So.. I'd like to start some discussion before this step. ... after few weeks I'm pleased to announce a new "util-linux-ng" project. This project is a fork of the original util-linux (2.13-pre7). Aw damn, I missed it again. LKML gets about 300 posts/day. The last time util-linux was offered, I missed out. Bummer. Well, how about giving me a chunk of it? I'd like /bin/kill please. I already ship a nicer one in procps anyway, so you can just delete the files and call that done. (just today I was working on a Fedora system and /bin/kill annoyed me) VERY STRONG SUGGESTION: build a full test suite before you mess with the source. This isn't some cute toy like xeyes or a silly game. This is util-linux, which MUST work. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel + gcc 4.1 = several problems
On 1/4/07, Segher Boessenkool <[EMAIL PROTECTED]> wrote: > Adjusting gcc flags to eliminate optimizations is another way to go. > Adding -fwrapv would be an excellent start. Lack of this flag breaks > most code which checks for integer wrap-around. Lack of the flag does not break any valid C code, only code making unwarranted assumptions (i.e., buggy code). Right, if "C" means "strictly conforming ISO C" to you. (in which case, nearly all real-world code is broken) FYI, the kernel also assumes that a "char" is 8 bits. Maybe you should run away screaming. > The compiler "knows" > that signed integers don't ever wrap, and thus eliminates any code > which checks for values going negative after a wrap-around. You cannot assume it eliminates such code; the compiler is free to do whatever it wants in such a case. You should typically write such a computation using unsigned types, FWIW. Anyway, with 4.1 you shouldn't see frequent problems due to Right, it gets much worse with the current gcc snapshots. IMHO you should play such games with "g++ -O9", but that's a discussion for a different mailing list. "not using -fwrapv while my code is broken WRT signed overflow" yet; and if/when problems start to happen, to "correct" action to take is not to add the compiler flag, but to fix the code. Nope, unless we decide that the performance advantages of a language change are worth the risk and pain. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: Right now, Linux isn't all that friendly to JIT emulators. Here are the problems and suggestions to improve the situation. There is an SE Linux execmem restriction that enforces W^X. Assuming you don't wish to just disable SE Linux, there are two ugly ways around the problem. You can mmap a file twice, or you can abuse SysV shared memory. The mmap method requires that you know of a filesystem mounted rw,exec where you can write a very large temporary file. This arbitrary filesystem, rather than swap space, will be the backing store. The SysV shared memory method requires an undocumented flag and is subject to some annoying size limits. Both methods create objects that will fail to be deleted if the program dies before marking the objects for deletion. If the policy forbidding self-modifying code lacks a method of exempting programs such as JIT interpreters (which I doubt) then it's a problem. I'm with Alan on this one. It does and it doesn't. There is not a reasonable way for a user to mark an app as needing full self-modifying ability. It's not like the executable stack, which can be set via the ELF note markings on the executable. (ELF note markings are ideal because they can not be used via a ret-to-libc attack) With admin privs, one can change SE Linux settings. Mark the executable, disable the protection system-wide, generate a completely new SE Linux policy, or just turn SE Linux off. Normally we don't expect/require admin privs to install an executable in one's own ~/bin directory. This is broken. It ought to be easier to get a JIT working well without enabling arbitrary mprotect. This would allow a JIT to partially benefit from the recent security enhancements. (think of all the buggy browser-based JIT things!) On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: Processors often have annoying limits on the immediate values in instructions. An x86 or x86_64 JIT can go a bit faster if all allocations are kept to the low 2 GB of address space. There are also reasons for a 32bit-to-x86_64 JIT to chose a nearly arbitrary 2 GB region that lies above 4 GB. Other archs have other limits, such as 32 MB or 256 MB. This sort of logic might be appropriate for a sort of parametrized and specialized vma allocator setting the policy in /proc/ along with various sorts of limits. There are limits to such and at some point things will have to manually manage their own process address spaces in a platform-specific fashion. If kernel assistance here is rejected they may have to do so in all cases. I prefer ELF notes (for start-up allocations) and prctl, plus a mmap flag for per-allocation behavior. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: Additions to better support JIT emulators: a. sysctl to set IPC_RMID by default This is a bad idea. The standard semantics are needed for programs relying upon them. I didn't mean that the default default :-) setting would change. I meant that people could change the behavior from a boot script. Things that break are really foul and nasty anyway, probably with serious problems that ought to get fixed. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: c. open() flag to unlink a file before returning the fd You probably want a tmpfile(3) -like affair which never has a pathname to begin with. It could be useful for security purposes more generally. Yes, exactly. I think there are some possible optimizations available too, particularly with the cifs filesystem. On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: d. mremap() flag to always keep the old mapping This sounds vaguely like another syscall, like mdup(). This is particularly meaningful in the context of anonymous memory, for which there is no method of replicating mappings within a single process address space. Yes, mdup() and probably mdup2(). It could be mremap flags or not. JIT emulators generally need a second mapping so that they can have both read/write and execute for the same physical memory. It is somewhat tolerable to have SE Linux enforce that the second mapping be randomized. (it helps security greatly, but slows the emulator by a tiny bit) On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote: e. mremap() flag to get a read/write mapping of a read/exec one f. mremap() flag to get a read/exec mapping of a read/write one Presumably to be used in conjunction with keeping the old mapping. A composite mdup()/mremap() and mprotect(), presumably saving a TLB flush or other sorts of overhead, may make some sort of sense here. Odds are it'll get rejected as the sequence of syscalls is a rather precise equivalent, though it would optimize things (as would other composite syscalls, e.g. ones combining fork() and execve() etc.).
Re: JIT emulator needs
On 6/20/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote: William Lee Irwin III wrote: > I presumed an ELF note or extended filesystem attributes were already > in place for this sort of affair. It may be that the model implemented > is so restrictive that users are forbidden to create new executables, > in which case using a different model is certainly in order. Otherwise > the ELF note or attributes need to be implemented. Another thing to keep in mind, since we're talking about security policies in the first place, is that anything like this *MUST* be "opt-in" on the part of the security policy, because what we're talking about is circumventing an explicit security policy just based on a user-provided binary saying, in effect, "don't worry, I know what I'm doing." Changing the meaning of an established explicit security policy is not acceptable. Not in this case. If an attacker can CHANGE THE BINARY then it's already game over. Putting this into the security policy was an error born of lazyness to begin with. Abuse of the security mechanism was easier than hacking the toolchain, ELF loader, etc. Either a binary needs self-modification, or it doesn't. This is determined by the author of the code. If you don't trust an executable that needs this ability, then you simply can not run it in a useful way. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/20/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: If the policy forbidding self-modifying code lacks a method of exempting programs such as JIT interpreters (which I doubt) then it's a problem. I'm with Alan on this one. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: It does and it doesn't. There is not a reasonable way for a user to mark an app as needing full self-modifying ability. It's not like the executable stack, which can be set via the ELF note markings on the executable. (ELF note markings are ideal because they can not be used via a ret-to-libc attack) With admin privs, one can change SE Linux settings. Mark the executable, disable the protection system-wide, generate a completely new SE Linux policy, or just turn SE Linux off. Normally we don't expect/require admin privs to install an executable in one's own ~/bin directory. This is broken. It ought to be easier to get a JIT working well without enabling arbitrary mprotect. This would allow a JIT to partially benefit from the recent security enhancements. (think of all the buggy browser-based JIT things!) I presumed an ELF note or extended filesystem attributes were already in place for this sort of affair. It may be that the model implemented is so restrictive that users are forbidden to create new executables, in which case using a different model is certainly in order. Otherwise the ELF note or attributes need to be implemented. Users can create executables. Some will be non-functional unless specially marked by an admin. What is the goal here? I see no reasonable goal that would result in such a policy. On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: This sort of logic might be appropriate for a sort of parametrized and specialized vma allocator setting the policy in /proc/ along with various sorts of limits. There are limits to such and at some point things will have to manually manage their own process address spaces in a platform-specific fashion. If kernel assistance here is rejected they may have to do so in all cases. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: I prefer ELF notes (for start-up allocations) and prctl, plus a mmap flag for per-allocation behavior. Beware that the kernel (upstream of me) will likely refuse to support to exotic mmap() placement policies. At that point userspace will have to implement them itself with a front-end to mmap(). Userspace can actually live without kernel placement support for everything but the executable itself, which is already implemented via ELF loading standards. This is not to downplay the tremendous amounts of pain involved for moving the stack, getting ld.so to land in the right place, and so on. Actually I'm less sure about .interp placement. In any event, exotic virtualspace allocation policies are largely yet another "simple matter of programming" implementable entirely in userspace. When you go that route, you may need to abandon libc. I've done exactly that for one emulator. It was not easy. Nearly nobody will want to go down that path. Things improve a bit if MAP_ANONYMOUS and SysV shared mem allocations can be made to ignore the available memory checking. If I could allocate a 2 GB chunk on a system with 1 GB total swap+RAM, then I could use that as an area in which to perform MAP_FIXED allocations. As of now this would require either adding the swap space or disabling the available memory checking system-wide via sysctl. On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: This is a bad idea. The standard semantics are needed for programs relying upon them. On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote: I didn't mean that the default default :-) setting would change. I meant that people could change the behavior from a boot script. Things that break are really foul and nasty anyway, probably with serious problems that ought to get fixed. It's actually not a good idea to make it the default even via sysctl. People won't realize something will break until it does, and what will break is likely to be a database responsible for data integrity. The IPC_RMID creation flag should suffice. It's highly unlikely that such breakage would cause corruption. Most likely it would cause the database to exit with an error about failing to attach to a SysV shared memory segment. I believe that a major cause of reboots is that admins are unaware of SysV shared memory cruft left behind by apps that crashed at the wrong moment or had other bugs. If something is eating memory and you don't know what it is, you reboot. On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote: This is MADV_REMOVE, though most filesystems don't support it. Do you need it for more than tmpfs? On Tue, Jun 19, 2007 at 11:
Re: JIT emulator needs
On 6/20/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote: Albert Cahalan wrote: > Putting this into the security policy was an error born of > lazyness to begin with. Abuse of the security mechanism > was easier than hacking the toolchain, ELF loader, etc. > > Either a binary needs self-modification, or it doesn't. This is > determined by the author of the code. If you don't trust an > executable that needs this ability, then you simply can not > run it in a useful way. That's fine. That's a policy decision. That's what a security policy *is*. The owner of the system has decided, by security policy, that that is not allowed. Bypassing that is not acceptable. Fixing a bug should be acceptable. Look, let's back up a bit here. At a high level, what exactly do you imagine that this behavior was intended for? I suggest you list some examples of the attacks that are blocked. Can you come up with a reasonable argument that the current behavior is the least painful restriction required to block those attacks? Does the current behavior block any attack that the proposed behavior would not? (list the attacks please) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/20/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote: Albert Cahalan wrote: > Look, let's back up a bit here. At a high level, what exactly do > you imagine that this behavior was intended for? I suggest you > list some examples of the attacks that are blocked. > > Can you come up with a reasonable argument that the current behavior > is the least painful restriction required to block those attacks? > Does the current behavior block any attack that the proposed behavior > would not? (list the attacks please) See above. Nope. I asked you to justify the existing behavior. Apparently you are unable to do so. This should be a hint. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/21/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote: On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote: > Right now, Linux isn't all that friendly to JIT emulators. > Here are the problems and suggestions to improve the situation. > > There is an SE Linux execmem restriction that enforces W^X. > Assuming you don't wish to just disable SE Linux, there are > two ugly ways around the problem. You can mmap a file twice, > or you can abuse SysV shared memory. The mmap method requires > that you know of a filesystem mounted rw,exec where you can > write a very large temporary file. This arbitrary filesystem, > rather than swap space, will be the backing store. The SysV > shared memory method requires an undocumented flag and is > subject to some annoying size limits. Both methods create > objects that will fail to be deleted if the program dies > before marking the objects for deletion. and these methods also destroy yourself on any machine with a looser cache coherency between I and D-cache for all but x86 you pretty much have to do the mprotect() between the two states to deal with the cache flushing properly... If the instructions to force data write-back and/or to invalidate the instruction cache are priveleged, yes. AFAIK, only ARM is that lame. For example, PowerPC lets unprivileged code run the required instructions. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [TOMOYO 5/9] Memory and pathname management functions.
On 6/21/07, Pavel Machek <[EMAIL PROTECTED]> wrote: > >> It's really not worth getting bothered by. Truth is, big > >> giant > >> pathnames break lots of stuff already, both kernel and > >> userspace. > > > >> Just look in /proc for some nice juicy kernel breakage: > >> cwd, exe, fd/*, maps, mounts, mountstats, root, smaps > > > >Well, but we should be fixing that, not adding more. And /proc is > >info-only, while this is security related code. > > Security tools read from /proc, so /proc is security-related. If some tool relies on pathnames in /proc, that tool is broken... as is /proc. We should be fixing that. Running TOMOYO or AppArmor fixes the bug. :-) You can't get long paths that break /proc if you are running either. Therefore, one of those is required. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/22/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote: On Fri, 2007-06-22 at 01:56 -0400, Albert Cahalan wrote: > On 6/21/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote: > > > Right now, Linux isn't all that friendly to JIT emulators. > > > Here are the problems and suggestions to improve the situation. > > > > > > There is an SE Linux execmem restriction that enforces W^X. > > > Assuming you don't wish to just disable SE Linux, there are > > > two ugly ways around the problem. You can mmap a file twice, > > > or you can abuse SysV shared memory. The mmap method requires > > > that you know of a filesystem mounted rw,exec where you can > > > write a very large temporary file. This arbitrary filesystem, > > > rather than swap space, will be the backing store. The SysV > > > shared memory method requires an undocumented flag and is > > > subject to some annoying size limits. Both methods create > > > objects that will fail to be deleted if the program dies > > > before marking the objects for deletion. > > > > and these methods also destroy yourself on any machine with a looser > > cache coherency between I and D-cache > > > > for all but x86 you pretty much have to do the mprotect() between the > > two states to deal with the cache flushing properly... > > If the instructions to force data write-back and/or to > invalidate the instruction cache are priveleged, yes. > AFAIK, only ARM is that lame. and your program executes this on all the cpus in the system? I'll remember that if I ever run a JIT on the SMP ARM box. (there's like one, at the manufacturer, right?) I don't recall seeing such code in the libgcc tranpoline setup for PowerPC. Either it's not required, or this is a rather popular bug. Perhaps ARM needs syscalls for this, or emulation for the privileged instructions. This may already exist; it sure is required. So this would be another need for properly supporting JIT emulators. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/22/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > > > and these methods also destroy yourself on any machine with a looser > > > > cache coherency between I and D-cache > > > > > > > > for all but x86 you pretty much have to do the mprotect() between the > > > > two states to deal with the cache flushing properly... > > > > > > If the instructions to force data write-back and/or to > > > invalidate the instruction cache are priveleged, yes. > > > AFAIK, only ARM is that lame. > > > > and your program executes this on all the cpus in the system? no I meant that you had to call your userspace instruction on all cpus, so on all-but-arm (from the Intel side I know IA64 needs such a flush, but I'm pretty sure PPC does too) I understood. AFAIK, it is common to propagate this via a special bus cycle. Section 5.1.5.2.1 of the PowerPC manual states that this is so. Secion 5.1.5.2 lists the requirements for both uniprocessor and multiprocessor. Note that Linux uses the coherent memory model for PowerPC SMP. See also the "icbi" instruction description, where the use of an address-only broadcast is mentioned. > I don't recall seeing such code in the libgcc tranpoline > setup for PowerPC. Either it's not required, or this is > a rather popular bug. I suspect it'll be playing under the assumption that going from "no code" to "code" is fine since the icache is cold. A previous trampoline would ruin that. Fortunately, PowerPC is not as brain-dead as ARM and IA64. (not that I'm writing code for any of these) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Neat! It's great to see somebody else waking up to the idea that storage media is NOT to be trusted. Judging by the design paper, it looks like your structs have some alignment problems. The usual wishlist: * inode-to-pathnames mapping * a subvolume that is a single file (disk image, database, etc.) * directory indexes to better support Wine and Samba * secure delete via destruction of per-file or per-block random crypto keys * fast (seekless) access to normal-sized SE Linux data * atomic creation of copy-on-write directory trees * immutable bits like UFS has * hole punch ability * insert/delete ability (add/remove a chunk in the middle of a file) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On 6/13/07, Chris Mason <[EMAIL PROTECTED]> wrote: On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote: > The usual wishlist: > > * inode-to-pathnames mapping This one I'll code, it will help with inode link count verification. I want to be able to detect at run time that an inode with a link count of zero is still actually in a directory. So there will be back pointers from the inode to the directory. Great, but fsck improvement wasn't on my mind. This is a desirable feature for the NFS server, and for regular users. Think about a backup program trying to maintain hard links. Also, the incremental backup code will be able to walk the btree to find inodes that have changed, and the backpointers will help make a list of file names that need to be rsync'd or whatever. > * a subvolume that is a single file (disk image, database, etc.) subvolumes can be made that have a single file in them, but they have to be directories right now. Doing otherwise would complicate mounts and other management tools (inside the btree, it doesn't really matter). Bummer. As I understand it, ZFS provides this. :-) > * directory indexes to better support Wine and Samba > * secure delete via destruction of per-file or per-block random crypto keys I'd rather keep secure delete as a userland problem (or a layered FS problem). When you take backups and other copies of the file into account, it's a bigger problem than btrfs wants to tackle right now. It can't be a userland problem if you allow disk blocks to move. Volume resizing, logging/journalling, etc. -- they combine to make the userland solution essentially impossible. (one could wipe the whole partition, or maybe fill ALL space on the volume) I think it needs to be per-extent. At each level in the btree, you place a randomly generated key for the more leafward nodes. This means that secure deletion is merely the act of wiping the key... which can itself occur by wiping the key of the more rootward node. > * atomic creation of copy-on-write directory trees Do you mean something more fine grained than the current snapshotting system? I believe so. Example: I have a linux-2.6 directory. It's not a mount point or anything special like that. I want to copy it to a new directory called wip, without actually copying all the blocks. To all the normal POSIX API stuff, this copy should look like the result of "cp -a", not hard links. > * insert/delete ability (add/remove a chunk in the middle of a file) The disk format makes this O(extent records past the chunk). It's possible to code but it would not be optimized. That's understandable, but note that Reiserfs can support this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On 6/13/07, Chris Mason <[EMAIL PROTECTED]> wrote: On Wed, Jun 13, 2007 at 12:14:40PM -0400, Albert Cahalan wrote: > On 6/13/07, Chris Mason <[EMAIL PROTECTED]> wrote: > >On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote: > >> * secure delete via destruction of per-file or per-block random crypto > >keys > > > >I'd rather keep secure delete as a userland problem (or a layered FS > >problem). When you take backups and other copies of the file into > >account, it's a bigger problem than btrfs wants to tackle right now. > > It can't be a userland problem if you allow disk blocks to move. > Volume resizing, logging/journalling, etc. -- they combine to make > the userland solution essentially impossible. (one could wipe the > whole partition, or maybe fill ALL space on the volume) Right about here is where I would insert a long story about ecryptfs, or encryption solutions that happen all in userland. At any rate, it is outside the scope of v1.0, even though I definitely agree it is an important problem for some people. I'm sure you do have a nice long story, and I'm sure it seems correct, but there is something not quite right about the add-on hacks. BTW, I'm suggesting that this be about deletion, not protection of data you wish to keep. It covers more than just file bodies. It covers inode data, block allocations, etc. > >> * atomic creation of copy-on-write directory trees > > > >Do you mean something more fine grained than the current snapshotting > >system? > > I believe so. Example: I have a linux-2.6 directory. It's not > a mount point or anything special like that. I want to copy > it to a new directory called wip, without actually copying > all the blocks. To all the normal POSIX API stuff, this copy > should look like the result of "cp -a", not hard links. This would be a snapshot, which has to be done on a subvolume right now. It is not as nice as being able to pick a random directory, but I've only been able to get this far by limiting the feature scope significantly. What I did do was make subvolumes very cheap...just make a bunch of them. Can a regular user create and use a subvolume? If not, then this doesn't work. (if so, then I have other concerns...) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [TOMOYO 5/9] Memory and pathname management functions.
Christoph Hellwig writes: On Thu, Jun 14, 2007 at 04:36:09PM +0900, Kentaro Takeda wrote: We limit the maximum length of any string data (such as domainname and pathnames) to TOMOYO_MAX_PATHNAME_LEN (which is 4000) bytes to fit within a single page. Userland programs can obtain the amount of RAM currently used by TOMOYO from /proc interface. Same NACK for this as for AppArmor, on exactly the same grounds. Please stop wasting your time on pathname-based non-solutions. This issue is a very very small wart on an otherwise fine idea. It's really not worth getting bothered by. Truth is, big giant pathnames break lots of stuff already, both kernel and userspace. Just look in /proc for some nice juicy kernel breakage: cwd, exe, fd/*, maps, mounts, mountstats, root, smaps So, is that a NACK for the /proc filesystem too? :-) We even limit filenames to 255 chars; just the other day a Russian guy was complaining that his monstrous filenames on a vfat filesystem could not be represented in UTF-8 mode. Both TOMOYO and AppArmor are good ideas. At minimum, one of them ought to be accepted. My preference would be TOMOYO, having origins untainted by Novell's Microsoft dealings. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [TOMOYO 5/9] Memory and pathname management functions.
On 6/15/07, Pavel Machek <[EMAIL PROTECTED]> wrote: [Albert Cahalan] > It's really not worth getting bothered by. Truth is, big > giant > pathnames break lots of stuff already, both kernel and > userspace. > Just look in /proc for some nice juicy kernel breakage: > cwd, exe, fd/*, maps, mounts, mountstats, root, smaps Well, but we should be fixing that, not adding more. And /proc is info-only, while this is security related code. Security tools read from /proc, so /proc is security-related. The limit imposed by TOMOYO (or AppArmor) is fine, despite being security-related. It just needs to fail in the safe direction: access denied. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
partially mounted cifs filesystem
I had one share mounted, from XP to Linux, and wanted another. At first I had an incorrect setting on the XP box, almost certainly related to permissions. The mount failed of course. Running "mount" showed that the filesystem was not mounted, but apparently it didn't remain fully unmounted either. There was also nothing under the mount point, and the "ls -l" data (directory size and link count) looked like ext3. I changed settings on the XP box numerous times. After many frustrating attempts, I ran "umount" on the mount point and then successfully mounted the filesystem. I'll guess that the kernel returned an error for my early attempts at mounting, but left open a CIFS connection. I suppose the cifs error handling is buggy. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: partially mounted cifs filesystem
On 7/7/07, Satyam Sharma <[EMAIL PROTECTED]> wrote: On 7/7/07, Albert Cahalan <[EMAIL PROTECTED]> wrote: I had one share mounted, from XP to Linux, and wanted another. At first I had an incorrect setting on the XP box, almost certainly related to permissions. The mount failed of course. Running "mount" showed that the filesystem was not mounted, but apparently it didn't remain fully unmounted either. There was also nothing under the mount point, and the "ls -l" data (directory size and link count) looked like ext3. That means nothing was mounted there ... I changed settings on the XP box numerous times. After many frustrating attempts, I ran "umount" on the mount point and then successfully mounted the filesystem. ... but still umount succeeded? Didn't it complain about nothing being mounted there in the first place? Surprising that it actually resolved the problem ... It complained, and it resolved the problem. I'll guess that the kernel returned an error for my early attempts at mounting, but left open a CIFS connection. I suppose the cifs error handling is buggy. Yes, that could be the case. Could you please: 1. Tell us which kernel version was it? .config? 2. Was there some dmesg output from the failed mount(2) attempt? 3. What was the mount command line / options? Server: Windows XP service pack 2, recently updated Client: Fedora kernel 2.6.20-1.3094.fc7, mount.cifs version 1.10 My xterm still had the commands in the scrollback buffer. I added a few, grepping dmesg and /etc/fstab, and chopped out the unrelated stuff. Note that the number in my command prompt is the exit code of the previous command; these are all correct despite editing out the unrelated commands. There are some interesting error messages, plus a lock order warning that mentions cifs. Note that I have numerous cifs shares mounted, so not every log message relates to this one. Then: 1. Rebuild kernel with CIFS_DEBUG2. 2. Revert back (on the XP share export side) to the buggy / incorrect settings -- so that you can try and reproduce the problem. 3. Let us know if you could reproduce, if so, any debug ouput / etc? I probably spent a week messing with Windows settings. I switched back and forth between simple file sharing and not, adjusted many registry settings related to anonymous/guest treatment, redid the ACLs more times than I care to think about... There really isn't any hope I could get back to the original settings. My best guess would be something related to an ACL for guest, everybody, SYSTEM, or anonymous, or something related to the checkboxes for client permissions in the file sharing dialog. At one time I had a deny ACL. Here you go. The fstab lines will be word wrapped in this email, but are not word wrapped in the file. -- proc 0 # mount /mnt/vm/sc Password: mount error 11 = Resource temporarily unavailable Refer to the mount.cifs(8) manual page (e.g.man mount.cifs) proc 255 # smbclient -L //192.168.1.141 Password: Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager] Sharename Type Comment - --- IPC$IPC Remote IPC sourcecode Disk ADMIN$ Disk Remote Admin C$ Disk Default share homedir Disk session request to 192.168.1.141 failed (Called name not present) session request to 192 failed (Called name not present) Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager] Server Comment ---- WorkgroupMaster ---- proc 0 # smbclient //192.168.1.141/sourcecode Password: Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager] smb: \> ls . D0 Wed Dec 6 18:12:30 2006 .. D0 Wed Dec 6 18:12:30 2006 development D0 Mon Jul 2 15:10:15 2007 legacy D0 Wed Dec 6 22:29:42 2006 libraries D0 Mon Jul 2 16:03:25 2007 mmm D0 Mon Jul 2 16:53:27 2007 re D0 Mon Jul 2 17:39:34 2007 s D0 Mon Jul 2 17:46:23 2007 thirdparty D0 Mon Jul 2 18:05:05 2007 40931 blocks of size 524288. 18955 blocks available smb: \> q proc 0 # mount /mnt/vm/sc Password: mount error 11 = Resource temporarily unavailable Refer to the mount.cifs(8) manual page (e.g.man mount.cifs) proc 255 # ls -l /mnt/vm/sc total 0 proc 0 # ls -l /mnt/vm total 2 drwxr-xr-x 1 root root0 2007-07-03 17:43 homedir drwxr-xr-x 2 root root 1024 2007-07-03 13:30 sc proc 0 # ls -al /mnt/vm/sc total 4 drwxr-xr-x 2 root ro
Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK
Robin Holt writes: On Mon, Apr 09, 2007 at 08:36:21AM -0600, Eric W. Biederman wrote: Robin Holt <[EMAIL PROTECTED]> writes: I would say this is more a benefit than a problem. With a couple of these systems we are testing, the number of kernel threads is far greater than the number of user processes and having pstree not normally show them, but maybe have an option we add later to show them again would be beneficial. Sure. Robin how many kernel thread per cpu are you seeing? 10. This has long been rotten. Mind fixing it for us? :-) We have N types of thread on M CPUs. Pick something, N or M, to be at the top level in /proc. The other goes below, in the per-process task directories. You then have either N or M things showing up in ps, not N*M. Note that both ps and top can print the CPU number just fine. Abusing the task name for this is just retarded. This suggests that the top level should be the type of task, with the lower level in /proc/*/task being per-CPU and not needing distinct naming at all. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK
Jan Engelhardt writes: On Apr 10 2007 17:47, Jan Engelhardt wrote: On Apr 8 2007 20:57, Oleg Nesterov wrote: Anyway, re-parenting to swapper breaks pstree, it doesn't show kernel threads. And if ->parent == /sbin/init, we can't remove us from ->children (unless we forbid sub-thread-of-init exec). So the only safe change is set ->exit_state = -1. Then we have to fix pstree and all that. (In fact, I'm trying to patch `ps f` to DTRT ;p) Done that and the result is that `ps afwx` now looks like: PID TTY STAT TIME COMMAND 2722 ?S 0:00 [lockd] ... 3 ?S< 0:00 [events/0] 2 ?SN 0:00 [ksoftirqd/0] 1 ?Ss 0:02 init [3] 537 ?S ... -if(self_pid==1 && ADOPTED(processes[i]) && forest_type!='u') +if(ADOPTED(processes[i]) && forest_type!='u') That's not compatible because init's children are now in the logical place. Since the days of procps-1.x.x or earlier, such processes have been listed at top level. BTW, what does "ps -ejH" do for you, with and without the patch? I'd be a lot happier about breaking compatibility in this area if I could get a functional adoption flag. That is, I really would like to show a process as child of init if it naturally was created as a child of init. It's less informative to have fake children showing up the same as real ones. The original parent PID would do. (BTW, the original parent name and/or grandparent PID would be great to have) As a bonus, the kernel could reap these processes more quickly than init can... and then maybe we can stop caring if init is alive. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK
On 5/29/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote: "Albert Cahalan" <[EMAIL PROTECTED]> writes: > Jan Engelhardt writes: -if(self_pid==1 && ADOPTED(processes[i]) && forest_type!='u') +if(ADOPTED(processes[i]) && forest_type!='u') That's not compatible because init's children are now in the logical place. Since the days of procps-1.x.x or earlier, such processes have been listed at top level. BTW, what does "ps -ejH" do for you, with and without the patch? ps -ejH displays everything. That's not what I mean. (the "-e" causes that of course) I'm asking about the parent-child relationships shown. The "-H" option is a bit different from the "f" option. I'd be a lot happier about breaking compatibility in this area if I could get a functional adoption flag. That is, I really would like to show a process as child of init if it naturally was created as a child of init. It's less informative to have fake children showing up the same as real ones. The original parent PID would do. (BTW, the original parent name and/or grandparent PID would be great to have) As a bonus, the kernel could reap these processes more quickly than init can... and then maybe we can stop caring if init is alive. Having the kernel not reparent user processes to init is an interesting idea, especially when those processes have not existed. I'm not certain that is POSIX complaint and otherwise backwards compatible. I'm not suggesting that this be visible via POSIX APIs. It's almost certainly a given that getppid() must return 1, and probably /proc needs to show this as well. Without question, any process created by init must be reaped by init. Processes NOT created by init could be silently reaped by the kernel. They need to see their own PPID as 1, but there need not be any parent-child relationship in the kernel data structures. The kernel can fake the whole thing, which is nice because then the kernel isn't depending on userspace to correctly perform the pointless action of playing with zombies. (might setting the death signal to 0 be useful here?) For "ps fax" and such, I'd like to distinguish between init's real and adopted children. Right now the adopted children look like they were created by init, which is not true. I only need a simple boolean flag, set upon reparenting, to tell me. Such a flag may also be useful for optimizing away the whole wait/waitpid/wait4/waitid/wait3 nonsense when an adopted child dies. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Ingo Molnar writes: looking over the list of our new generic APIs (see further below) i think there are three important things that are needed for an API to become widely used: 1) it should solve a real problem (ha ;-), it should be intuitive to humans and it should fit into existing things naturally. 2) it should be ubiquitous. (if it's about IO it should cover block IO, network IO, timers, signals and everything) Even if it might look silly in some of the cases, having complete, utter, no compromises, 100% coverage for everything massively helps the uptake of an API, because it allows the user-space coder to pick just one paradigm that is closest to his application and stick to it and only to it. 3) it should be end-to-end supported by glibc. 4) At least slightly portable. Anything supported by any similar OS is already ahead, even if it isn't the perfect API of our dreams. This means kqueue and doors. If it's not on any BSD or UNIX, then most app developers won't touch it. Worse yet, it won't appear in programming books, so even the Linux-only app programmers won't know about it. Running ideas by the FreeBSD and OpenSolaris developers wouldn't be a bad idea. Agreement leads to standardization, which leads to interfaces getting used. BTW, wrapper libraries that bury the new API under a layer of gunk are not helpful. One might as well just use the old API. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove PAGE_SIZE from headers_install
Olaf Hering writes: On Sat, Jul 14, H. Peter Anvin wrote: Olaf Hering wrote: Declare PAGE_SIZE as getpagesize() for userspace. PAGE_SIZE is used in resource.h and shm.h I would think it would be better to not define it at all. Several architectures already don't have PAGE_SIZE visible to userspace in any way. i386 has it, so everyone uses it. Since i386 was the first architecture and is still probably the most common architecture (x86_64 being 30% AFAIK), i386 sets the standard for the Linux API. Several architectures are broken and thus suffering from incompatibility. A real constant-value PAGE_SIZE is useful and doable. It's useful because a getpagesize() can't be used for numerous things, such as setting the size of an array. It's doable, even on architectures that support multiple page sizes, because ABIs specify alignment requirements. There are two alignments of interest here: a. the smallest that mmap() will ever naturally return on any correct implementation of the architecture's ABI ("naturally" meaning that MAP_FIXED was not used) b. the smallest that mprotect() will tolerate on all correct implementations of the architecture Pick either to be the Linux definition of PAGE_SIZE. For example, if an architecture is specified to have a page size of at least 4 K but no more than 64 K, then mprotect() will only tolerate 64 K on all correct implementations of the architecture. The ABI might allow mmap() to naturally return 4 K aligned data, but might instead require 64 K alignment. Assuming 4 K, then the mmap() value doesn't match the mprotect() value. Either one will do as the value of PAGE_SIZE, as long as this is standardized in the way that breaks the least code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] remove PAGE_SIZE from headers_install
On 7/14/07, David Miller <[EMAIL PROTECTED]> wrote: From: "Albert Cahalan" <[EMAIL PROTECTED]> Date: Sat, 14 Jul 2007 22:48:57 -0400 > A real constant-value PAGE_SIZE is useful and doable. It's bogus to use it. The kernel can get recompiled to arbitrary page sizes on some architectures, so a constat page size assumption cannot work. Sure it can work. The ABI specifies limits on such things. Probably the most appropriate size is the one specified for alignment of ELF sections. If I remember right, it's 64 K for the PowerPC ABI. This allows for 64 K pages, even though many chips offer 4 K pages. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Please don't forget the immutable bit. ("man lsattr") Having both, BSD-style, would be even better. The immutable bit is important for working around software bugs and "features" that damage files. I also can't find xattr support. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: slow open() calls and o_nonblock
David Schwartz writes: [Aaron Wiebe] open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147> How could they make any difference? I can't think of any conceivable way they could. Now, I'm a userspace guy so I can be pretty dense, but shouldn't a call with a nonblocking flag return EAGAIN if its going to take anywhere near 415ms? Is there a way I can force opens to EAGAIN if they take more than 10ms? There is no way you can re-try the request. The open must either succeed or not return a handle. It is not like a 'read' operation that has an "I didn't do anything, and you can retry this request" option. If 'open' returns a file handle, you can't retry it (since it must succeed in order to do that, failure must not return a handle). If you 'open' doesn't return a file handle, you can't retry it (because, without a handle, there is no way to associate a future request with this one, if it creates a file, the file must not be created if you don't call 'open' again). The 'open' function must, at minimum, confirm that the file exists (or doesn't exist and can be created, or whatever). This takes however long it takes on NFS. This is not the case, though we might need to allocate a new flag to avoid breaking things. Let open() with O_UNCHECKED always return a file descriptor, except perhaps when failure can be identified without doing IO. The "real" open then proceeds in the background. From poll() or select(), you can see that the file descriptor is not ready for anything. Eventually it becomes ready for IO or reports an error condition. Both select() and poll() are capable of reporting errors. If the "real" (background) open() fails, then the only valid operation is close(). Attempts to do anything else get EBADFD or ESTALE. You'll also need a background close(). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid
Eric W. Biederman writes: Badari Pulavarty <[EMAIL PROTECTED]> writes: Your recent cleanup to shm code, namely [PATCH] shm: make sysv ipc shared memory use stacked files took away one of the debugging feature for shm segments. Originally, shmid were forced to be the inode numbers and they show up in /proc/pid/maps for the process which mapped this shared memory segments (vma listing). That way, its easy to find out who all mapped this shared memory segment. Your patchset, took away the inode# setting. So, we can't easily match the shmem segments to /proc/pid/maps easily. (It was really useful in tracking down a customer problem recently). Is this done deliberately ? Anything wrong in setting this back ? Theoretically it makes the stacked file concept more brittle, because it means the lower layers can't care about their inode number. We do need something to tie these things together. So I suspect what makes most sense is to simply rename the dentry SYSVID Please stop breaking things in /proc. The pmap command relys on the old behavior. It's time to revert. Put back the segment ID where it belongs, and leave the key where it belongs too. Containers are NOT worth breaking our ABIs left and right. We don't need to leap off that bridge just because Solaris did, unless you can explain why complexity and bloat are desirable. We already have SE Linux, chroot, KVM, and several more! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid
On 6/6/07, Andrew Morton <[EMAIL PROTECTED]> wrote: On Wed, 6 Jun 2007 23:27:01 -0400 "Albert Cahalan" <[EMAIL PROTECTED]> wrote: > Eric W. Biederman writes: > > Badari Pulavarty <[EMAIL PROTECTED]> writes: > > >> Your recent cleanup to shm code, namely > >> > >> [PATCH] shm: make sysv ipc shared memory use stacked files > >> > >> took away one of the debugging feature for shm segments. > >> Originally, shmid were forced to be the inode numbers and > >> they show up in /proc/pid/maps for the process which mapped > >> this shared memory segments (vma listing). That way, its easy > >> to find out who all mapped this shared memory segment. Your > >> patchset, took away the inode# setting. So, we can't easily > >> match the shmem segments to /proc/pid/maps easily. (It was > >> really useful in tracking down a customer problem recently). > >> Is this done deliberately ? Anything wrong in setting this back ? > > > > Theoretically it makes the stacked file concept more brittle, > > because it means the lower layers can't care about their inode > > number. > > > > We do need something to tie these things together. > > > > So I suspect what makes most sense is to simply rename the > > dentry SYSVID > > Please stop breaking things in /proc. The pmap command relys > on the old behavior. What effect did this change have upon the pmap command? Details, please. > It's time to revert. Probably true, but we'd need to understand what the impact was. Very simply, pmap reports the shmid. albert 0 ~$ pmap `pidof X` | egrep -2 shmid 3005 16384K rw-s- /dev/fb0 3105152K rw---[ anon ] 31076000384K rw-s-[ shmid=0x3f428000 ] 310d6000384K rw-s-[ shmid=0x3f430001 ] 31136000384K rw-s-[ shmid=0x3f438002 ] 31196000384K rw-s-[ shmid=0x3f440003 ] 311f6000384K rw-s-[ shmid=0x3f448004 ] 31256000384K rw-s-[ shmid=0x3f450005 ] 312b6000384K rw-s-[ shmid=0x3f460006 ] 31316000384K rw-s-[ shmid=0x3f870007 ] 31491000140K r /usr/share/fonts/type1/gsfonts/n021003l.pfb 3150e000 9496K rw---[ anon ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid
On 6/7/07, Badari Pulavarty <[EMAIL PROTECTED]> wrote: BTW, I agree with Eric that its would be nice to use shmid as part of name instead of forcing to be as inode number. It should be possible for pmap to workout shmid from "key" or name. Isn't it ? It is not at all nice. 1. it's incompatible ABI breakage 2. where will you put the key then, in the inode? :-) Changing to "SYSVID%d" is no good either. Look, people are ***parsing*** this stuff in /proc. The /proc filesystem is not some random sandbox to be playing in. Before you go messing with it, note that the device number also matters. (it's per-boot dynamic, but that's OK) That's how one knows that /SYSV is not just a regular file; sadly these didn't get a non-/ prefix. (and no you can't fix that now; it's way too late) Next time you feel like breaking an ABI, mind putting "LET'S BREAK AN ABI!" in the subject of your email? BTW, I suspect this kind of thing also breaks: a. fuser, lsof, and other resource usage display tools b. various obscure emulators (similar to valgrind) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid
On 6/7/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote: So it looks to me like we need to do three things: - Fix the inode number - Fix the name on the hugetlbfs dentry to hold the key - Add a big fat comment that user space programs depend on this behavior of both the dentry name and the inode number. Assuming that this proposed fix goes in: Since the inode number is the shmid, and this is a number that the kernel randomly chooses AFAIK, there should be no need to have different shm segments sharing the same inode number. The situation with the key is a bit more disturbing, though we already hit that anyway when IPC_PRIVATE is used. (why anybody would NOT use IPC_PRIVATE is a mystery) So having the key in the name doesn't make things worse. I have some concern about the device minor number. This should be the same for all shm mappings; I do not know if the behavior changed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
JIT emulator needs
Right now, Linux isn't all that friendly to JIT emulators. Here are the problems and suggestions to improve the situation. There is an SE Linux execmem restriction that enforces W^X. Assuming you don't wish to just disable SE Linux, there are two ugly ways around the problem. You can mmap a file twice, or you can abuse SysV shared memory. The mmap method requires that you know of a filesystem mounted rw,exec where you can write a very large temporary file. This arbitrary filesystem, rather than swap space, will be the backing store. The SysV shared memory method requires an undocumented flag and is subject to some annoying size limits. Both methods create objects that will fail to be deleted if the program dies before marking the objects for deletion. Processors often have annoying limits on the immediate values in instructions. An x86 or x86_64 JIT can go a bit faster if all allocations are kept to the low 2 GB of address space. There are also reasons for a 32bit-to-x86_64 JIT to chose a nearly arbitrary 2 GB region that lies above 4 GB. Other archs have other limits, such as 32 MB or 256 MB. Sometimes it is very helpful to have the read/write mapping be a fixed offset from the read/exec mapping. A power of 2 can be especially desirable. Emulators often need a cheap way to change page permissions. One VMA per page is no good. Besides taking up space and making many things generally slower, having one VMA per page causes a huge performance loss for snapshot roll-back operations. Just tearing down all those VMAs takes a good while. Additions to better support JIT emulators: a. sysctl to set IPC_RMID by default b. shmget() flag to set IPC_RMID by default c. open() flag to unlink a file before returning the fd d. mremap() flag to always keep the old mapping e. mremap() flag to get a read/write mapping of a read/exec one f. mremap() flag to get a read/exec mapping of a read/write one g. mremap() flag to make the 5th arg (new addr) be the upper limit h. 6-bit wide mremap() "flag" to set the upper limit above given base i. support the prot argument to remap_file_pages j. a documented way (madvise?) to punch same-VMA zero-page holes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid
On 6/8/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote: "Albert Cahalan" <[EMAIL PROTECTED]> writes: > On 6/7/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote: >> So it looks to me like we need to do three things: >> - Fix the inode number >> - Fix the name on the hugetlbfs dentry to hold the key >> - Add a big fat comment that user space programs depend on this >> behavior of both the dentry name and the inode number. > > Assuming that this proposed fix goes in: > > Since the inode number is the shmid, and this is a number > that the kernel randomly chooses AFAIK, there should be > no need to have different shm segments sharing the same > inode number. Where we run into inode number confusion is that all of these shm segments are actually files on a tmpfs filesystem somewhere, and by making the inode number the shmid we loose the tmpfs inode number. So it is possible we get tmpfs inode number conflicts. However the inode number is not used for anything, and the files are not visible in any other way except as shm segments so it doesn't matter. Eh, the kernel choses both shmid and tmpfs inode number. You could set a high bit in one or the other. There is another case with ipc namespaces where we ultimately need to support duplicate shmids on the same machine (so migration is a possibility). However by and large the user space processes with duplicate ids should be invisible to each other. On the bright side, this only screws up people who get the crazy idea that processes can be migrated. > The situation with the key is a bit more disturbing, though > we already hit that anyway when IPC_PRIVATE is used. > (why anybody would NOT use IPC_PRIVATE is a mystery) > So having the key in the name doesn't make things worse. Having "SYSV" in the name appears mandatory. Otherwise you don't even know it is a shm file. Although I may be confused. It's mandatory for a different reason: to satisfy parsers. It is nearly useless for identifying shm files. Look what I can do: touch /SYSV touch '/SYSV (deleted)' (so pmap creates a shm, looks for the address in /proc/self/maps, determines the device major/minor in use, and then uses that) Hmm. Thinking about this I have just realized that we may want to approach this a little differently. Currently I am reusing the dentry and inode structure that hugetlbfs and tmpfs return me, and simply have a distinct struct file for each shm mapping. There is a little more cost but it may actually make sense to have a dentry and inode that is specific to shm.c so we can do whatever we need to without adding requirements to the normal tmpfs or hugtlb code. Piggybacking on tmpfs has always seemed a bit dirty to me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/8/07, Eric Dumazet <[EMAIL PROTECTED]> wrote: Albert Cahalan a écrit : > Additions to better support JIT emulators: > > a. sysctl to set IPC_RMID by default Not very good, this will break some apps. As a sysctl, the admin gets to choose between compatibility and sanity. I can see such a sysctl also being really helpful for a shared computer used for an Operating Systems or System Programming course. > b. shmget() flag to set IPC_RMID by default This is better :) Both are good. This one requires that all apps using SysV shared memory be modified to use the flag. The other requires that a very few apps be modified to tolerate a behavior change. > c. open() flag to unlink a file before returning the fd Well, I assume you would like fd = open("/path/somefile", O_RDWR | O_CREAT | O_UNLINK, 0644) (ie allocate a file handle but no name ?) Yes. Quite difficult to implement this atomically with current vfs, maybe a new syscall would be better. (Linus will kill me for that :) ) (We dont need to insert "somefile" in one directory, then unlink it, we only need to allocate an unnamed inode to get some backing store) I suspect that SMB/CIFS has a native call for this. There is some sort of tmpfile flag defined over in that world. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: JIT emulator needs
On 6/8/07, Alan Cox <[EMAIL PROTECTED]> wrote: > There is an SE Linux execmem restriction that enforces W^X. This depends on whatever SELinux rulesets you are running. Its just a good rule to have present that most programs shouldn't be self patching, and then label those that do differently. A marking in the executable would have made more sense. It is really broken having an unprivileged user being able to create whole new executables but unable to lift this restriction on those executables. In any case, the restriction is common and troublesome. > Sometimes it is very helpful to have the read/write mapping > be a fixed offset from the read/exec mapping. A power of 2 > can be especially desirable. mmap MAP_FIXED can do this but you need to know a lot about the memory layout of the system so it gets a bit platform specific. Yes. There are unportable programs, and UNPORTABLE ones. Memory layout can vary between vendor kernels, between normal and 32-on-64 situations, between two different C libraries... > Emulators often need a cheap way to change page permissions. mprotect(, range) rather than a page at a time. The kernel will do merging. Nope. This can happen rapidly and repeatedly to pages that are essentially random. The median length of a range will be a page or two. Merging won't do very much at all. > a. sysctl to set IPC_RMID by default > b. shmget() flag to set IPC_RMID by default Use POSIX shared memory That appears to have the exact same problem. > c. open() flag to unlink a file before returning the fd Is it really that costly to create a blank file, why do you need to do it a lot in a JIT ? This part isn't about cost. It's about not leaving around debris when the JIT crashes. > e. mremap() flag to get a read/write mapping of a read/exec one > f. mremap() flag to get a read/exec mapping of a read/write one > g. mremap() flag to make the 5th arg (new addr) be the upper limit This is all mprotect and munmap. That won't get me a second mapping. Supposing that I had a second mapping, SE Linux would deny the mprotect. I'm looking for a mapping that is born executable or a mapping that is born writable, as needed, so that no transition is needed. > h. 6-bit wide mremap() "flag" to set the upper limit above given base > i. support the prot argument to remap_file_pages > j. a documented way (madvise?) to punch same-VMA zero-page holes mmap (although you get more VMAs from that) so memset() is probably genuinely cheaper if the permissions are not changing. Well cost is the problem here. I sure can find some way to get the operation done, but it isn't cheap. For some usages, the current setup is costly enough that one must consider abandoning the hardware MMU in favor of a software one emitted as part of the JIT. :-( - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver
Sergei Shtylyov writes: Kumar Gala wrote: [Sergei Shtylyov] Kumar Gala wrote: I haven't looked at all the new clock/timer code, is there any utility in having support for more than one clock source? Of course, you may register as many as you like. Sure, but is there any utility in registering more than the decrementer on PPC? Not yet. I'm not sure I know any other PPC CPU facility fitting for clockevents. In theory, FIT could be used -- but its period is measured in powers of 2, IIRC. I'd really like to have that as an option. It would allow oprofile to safely use hardware events on the MPC74xx "G4" processors. Alternately it would allow thermal events. It is safe to use at most one of the three (decrementer,profiling,thermal) interrupts. If two were to hit at the same time, badness happens. It's possible to wrapper the interrupt in something that divides down, calling the normal code only some of the time. I think one of the FIT choices is about 4 kHz on my system, which would be OK. Full oprofile functionality would be wonderful. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver
On 5/18/07, Sergei Shtylyov <[EMAIL PROTECTED]> wrote: Albert Cahalan wrote: >>> Sure, but is there any utility in registering more than the >>> decrementer on PPC? >> Not yet. I'm not sure I know any other PPC CPU facility fitting >> for clockevents. In theory, FIT could be used -- but its period >> is measured in powers of 2, IIRC. > I'd really like to have that as an option. It would allow oprofile > to safely use hardware events on the MPC74xx "G4" processors. > Alternately it would allow thermal events. It is safe to use at > most one of the three (decrementer,profiling,thermal) interrupts. > If two were to hit at the same time, badness happens. Unfortunately, FIT exists only on Book E CPUs and MPC74xx aren't Book E, IIUC. By the name "FIT" perhaps, but MPC74xx has essentially the same thing. > It's possible to wrapper the interrupt in something that divides > down, calling the normal code only some of the time. I think one > of the FIT choices is about 4 kHz on my system, which would be OK. Erm, are you sure you have FIT (or is your system not MPC74xx based)? Set MMCR0[TBEE], set MMCR0[PMXE], and choose a TBL bit via MMCR0[TBSEL]. TBSEL is a 2-bit field which selects a timebase bit to use. The timebase bits that can be chosen are numbered 15, 19, 23, and 31. In the notation used by every other CPU vendor those would be bits 0, 8, 12, and 16. Example: My system uses a TBL frequency of 24907667. This gives choices of 12453833, 48648, 3040, and 190 Hz. The lowest three of those could be useful, with 48648 only for profiling and extreme real-time. It's also possible to trigger on the CPU cycle counter, but this would cost one of the performance counters. MPC7400 has 4, later CPUs have 6 or more, and I think xPC7x0 had only 2. This method is a bit nicer, since then one could trigger interrupts on arbitrary clock cycles without needing to write the timebase register. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver
On 5/19/07, Segher Boessenkool <[EMAIL PROTECTED]> wrote: [Albert Cahalan] > Set MMCR0[TBEE], set MMCR0[PMXE], and choose a TBL bit via > MMCR0[TBSEL]. That's the performance monitor, which could very well be in use already (for performance monitoring stuff, who would have guessed). It is the performance monitor, which sadly can not be used very well unless the decrementer is disabled. The hardware is buggy. As long as we use the decrementer for timekeeping, we can not safely generate performance monitor interrupts. I'd like to have the performance monitor available. It's NOT available unless we use part of it for timekeeping. That's the choice the hardware gives us. We can get TBL bit flip interrupts for free. We don't even need to give up one of the event counters. If we do give up one of the event counters (a rather reasonable idea), then we can count one of those TBL bit flips or the cycle counter. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
setting all 3 file times
Why can we still not do this? It's a stupid restriction. Security isn't a reason; we have SE Linux policy and auditing to take care of any issues. Heck, SE Linux policy could even deny this feature for the truly paranoid. Writing to /dev/* to update timestamps is surely a worse security situation. (see "dump" program) Ideally we'd have atomic update in some way. That might mean feeding the old times into the system call, so that the kernel can fail it if any changes have happened meanwhile. Maybe the syscall could take a pair of "struct stat" even, making the operation really easy and powerful. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ext3 vs NTFS performance
Andrew Morton writes: "Cabot, Mason B" <[EMAIL PROTECTED]> wrote: I've been testing the NAS performance of ext3/Openfiler 2.2 against NTFS/WinXP and have found that NTFS significantly outperforms ext3 for video workloads. The Windows CIFS client will attempt a poor-man's pre-allocation of the file on the server by sending 1-byte writes at 128K-byte strides, breaking block allocation on ext3 and leading to fragmentation and poor performance. This will happen for many applications (including iTunes) as the CIFS client issues these pre-allocates under the application layer. Oh my gawd, what a stupid hack. Now we know what the MS interoperability lab has been working on. Stupid or not, this is their protocol. The cifs filesystem driver needs a patch to do this. Probably that'll help get better performance when Linux is writing to a Windows server. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Broken process startup times after suspend (regression)
john stultz writes: Indeed. The monotonic clock's behavior around suspend and resume is poorly defined. When we increased it, folks didn't like the fact that uptime would increase while a system was suspended. The uptime really does need to increase during suspend. Otherwise, things get really weird with devices like the OLPC XO which will be sleeping between keystrokes. You could run the device for hours, yet get an uptime of only a few minutes. Suspended time should get counted as stolen time, same as when a hypervisor takes away time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Long file names in VFAT broken with iocharset=utf8
Andrey Borzenkov writes: This was posted in one of Russian forums. It was not possible to archive (under Linux, using tar) vfat directory where files had long Russian names (really long - over 150 - 170 characters) - tar returned stat failure. When looking with plain ls, file names appeared truncated. I have an idea to deal with this, but first a rant... At two bytes per character, you get 127 characters in a filename. That's wider than the standard 80-column display, and far wider than the 28 or 29 characters that an "ls -l" has room for. In a GUI file manager or file dialog box, you'll have to scroll sideways. In a web browser directory listing, you'll almost certainly have to scroll sideways. Must of this even applies to Windows tools. In other words, this is user error. Somebody thought that a filename was a place to store a document, probably a README file. What next, shall we MIME-encode an icon into the filename? Fix: the vfat driver should use the 8.3 name for such files. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] LogFS take two
[EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], linux-kernel@vger.kernel.org, [EMAIL PROTECTED], [EMAIL PROTECTED] Re: [PATCH 0/2] LogFS take two You seem to be missing the immutable bit. This is really useful for dealing with buggy or badly-designed things running as root. I've used to to protect /dev/null from becoming a normal file filled with junk, and to protect /etc/resolv.conf from "helpful" network management daemons that don't know my DNS servers. Anything else missing? BTW, BSD offers an unprivileged immutable bit as well. I'm sure it's useful for the apps that trash their own config files. Actually, this bit alone would do fine, and we could really use a way to protect writable device files from deletion or permission bit changes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Long file names in VFAT broken with iocharset=utf8
On 5/8/07, Jan Engelhardt <[EMAIL PROTECTED]> wrote: On May 8 2007 00:43, Albert Cahalan wrote: > Fix: the vfat driver should use the 8.3 name for such files. Or the 31-character ISO Level 1(?). That might be appropriate for a similar problem on CD-ROM filesystems. (when the CD is rockridge KOI8 and you want UTF-8) It may even be appropriate for Joliet, though 8.3 may be the better choice in that case. It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those have built-in support for 8.3 aliases. Normally the 8.3 names are like hidden hard links, except that deletion of either name will wipe out the other. (same as case differences too) So the names are there, and they should already work. They just need to be reported for directory listings when the long names would be too long. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Long file names in VFAT broken with iocharset=utf8
On 5/9/07, Andrey Borzenkov <[EMAIL PROTECTED]> wrote: On Wednesday 09 May 2007, Albert Cahalan wrote: ... On May 8 2007 00:43, Albert Cahalan wrote: Fix: the vfat driver should use the 8.3 name for such files. ... It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those have built-in support for 8.3 aliases. Normally the 8.3 names are like hidden hard links, except that deletion of either name will wipe out the other. (same as case differences too) So the names are there, and they should already work. They just need to be reported for directory listings when the long names would be too long. several problems associated with it 1. those names are rather meaningless. How do you find out which file they refer to? It is OK for trivial cases but not in a directory full of long names; nor am I sure how many unique short names can be generated. If a short name can not be generated, then no OS could create the file at all. The vfat and iso9660 filesystems require short names. Any OS writing to such a filesystem MUST generate short names in addition to any long names. Mount your vfat as filesystem type "msdos" to see. By default, Windows will also generate short names on NTFS. Note that you can't put your files on a CD-ROM in a way that Windows could read the filenames. Windows limits CD-ROM filenames to 63 characters; you get at most 103 if you violate the spec. 2. directory contents is effectively invalidated upon backup and restore (tar c; rm -rf; tar x). It is impossible to infer long names from short ones. It may be that tar fails to use the vfat ioctl calls to save and restore short names. You could try using Wine to run a Windows-native backup program. This shouldn't really matter though; you'd only be getting short names for files that had truly unreasonable long names anyway. I suppose somebody should check to see if there is a danger of overwrite when the short-named files get written back. The safest thing might be to mount the filesystem as type "msdos". 3. this still does not answer how can I *create* long name from within Linux. WTF? These names are too annoying to use, even if there weren't this limit. Anything over about 29 characters is in need of a rename. (that'd be 58 bytes for you, which is OK) The limit is already 4 times larger than what is reasonable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Only send pdeath_signal when getppid changes.
On 4/10/07, Roland McGrath <[EMAIL PROTECTED]> wrote: > Does a parent death signal make most sense between separately written programs? I don't think it does. It has always seemed an utterly cockamamy feature to me, and I've never understood what actually motivated it. It's useful, but the other case is more important. > Does a parent death signal make most sense between processes that are part of > a larger program. That is the only way I can really see it being used. The only actual example of use I know is what Albert Cahalan reported. To my mind, the only semantics that matter for pdeath_signal are what previous uses expected in the past and still need for compatibility. If we started with a fresh rationale from the ground up on what the feature is good for, I am rather skeptical it would pass muster to be added today. Until inotify and dnotify work on /proc/12345/task, there really isn't an alternative for some of us. Polling is unusable. Ideally one could pick any container, session, process group, mm, task group, or task for notification of state change. State change means various things like destruction, addition of something new, exec, etc. (stuff one can see in /proc) With appropriate privs, having the debug-related stuff would be good as well. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
console font limits
I'm having problems with a font I just created. It's a rather big one, intended for a framebuffer console in UTF-8 mode. The strace program reports that /bin/setfont fails on a KDFONTOP ioctl with EINVAL. In reading the kernel code, I find this: vt.c:static int con_font_set(struct vc_data *vc, struct console_font_op *op) vt.c-{ vt.c- struct console_font font; vt.c- int rc = -EINVAL; vt.c- int size; vt.c- vt.c- if (vc->vc_mode != KD_TEXT) vt.c- return -EINVAL; vt.c- if (!op->data) vt.c- return -EINVAL; vt.c- if (op->charcount > 512) vt.c- return -EINVAL; Ouch. Why is the old VGA limit being applied to the framebuffer console? Could this just get removed? I dearly hope we aren't still storing the framebuffer data as two bytes per character+attribute pair. I nearly hit the 32-pixel height limit as well, yet another relic from the VGA hardware. I also nearly hit the 64 KB font size limit. Currently I'm doing a 15x30 font with 870 glyphs to represent 978 different Unicode code points. This is for a 200 DPI display with an anti-aliasing filter, so fonts need to be big. I'm considering 15x36 so that I'll have more room for double-accented letters, but clearly the kernel would block that too. BTW, the PSF font format documentation seems to suggest that there is a way to make the kernel handle combining accents: http://www.win.tue.nl/~aeb/linux/kbd/font-formats-1.html Does anybody know if that really works? I could sure use that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: console font limits
On 5/1/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote: Antonino A. Daplas wrote: > > And this will entail a lot of work to change (Is it worth it to rework > the code and remove the limitation?). The linux-console project > (http://linuxconsole.sourceforge.net/) might have , but I don't know its > current status. Well, I think the consensus is that anything beyond that should be done in userspace; the main such console daemon was Kon2 last I checked. Font size is not a sane place to draw the line. Features are. The levels of support go something like this: 0. 7-bit ASCII 1. Simple direct-to-font VGA characters. 2. UTF-8 and large fonts, but no compositing or wide characters. 3. Simple compositing and double-wide characters. (like xterm) 4. Right-to-left. (like Kermit95) 5. Complex shaping, glyph substitution, and vertical text. Without large fonts, UTF-8 is 90% pointless bloat. Userspace console daemons are rotten to the core. There is no safe and reliable way to make kernel messages pass through the userspace console. You'd either be in graphics mode or you'd still be subject to the limit of 256 simultaneous glyphs while normal VGA attributes are in use. This is so defective that one might as well just run X with a fullscreen xterm. If userspace is your answer, then let's rip out the UTF-8 code. Personally I don't even need #1, but I think anything less than #3 is really rude toward people outside of Europe+Americas. I especially hate to hear Europeans argue against this when they have 100% precomposed characters for themselves and appear to have played a role (via ISO votes) in denying stuff like the mere 12 precomposed characters needed to use the Yoruba language with simple renderers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: console font limits
On 5/2/07, Jan Engelhardt <[EMAIL PROTECTED]> wrote: On May 1 2007 11:49, Albert Cahalan wrote: >> >> Well, I think the consensus is that anything beyond that should be done >> in userspace; the main such console daemon was Kon2 last I checked. > > Font size is not a sane place to draw the line. Features are. > The levels of support go something like this: > > 0. 7-bit ASCII > 1. Simple direct-to-font VGA characters. > 2. UTF-8 and large fonts, but no compositing or wide characters. > 3. Simple compositing and double-wide characters. (like xterm) > 4. Right-to-left. (like Kermit95) > 5. Complex shaping, glyph substitution, and vertical text. > > Without large fonts, UTF-8 is 90% pointless bloat. > Personally I don't even need #1, but I think anything less than #3 is > really rude toward people outside of Europe+Americas. I especially hate > to hear Europeans argue against this when they have 100% precomposed > characters for themselves and appear to have played a role (via ISO votes) > in denying stuff like the mere 12 precomposed characters needed to use > the Yoruba language with simple renderers. Note: I never suggested going beyond #3. 0. yes we want that 1. can't tell 2. utf8 yes, many text files are in that encoding. large fonts - can't tell, I am fine with the regular vga font infrastructure (8x16, 8x8) Those sizes are unreadable on the 200 dpi OLPC XO screen, and kind of icky on some of the really big desktop displays when in native (framebuffer) mode. 200 dpi may be in your future. Even the 32-pixel height limit is starting to be a problem. 3. compositing - no, don't need that, wide characters - does not even work in vga. just display a '??' and everything is fine. It's been shown to be workable, and it allows support for some additional languages. 4. I do not really think this has a future on VC. You would also 'need' kerning and that serif combiner thing (complex shaping?) for Arabic. At best, Arabic would look as horrible on VC as it does in xterm today (no RTL, no serif combiner) I agree. Hebrew is more doable, but probably not worth the effort because of the rarity and because of the general lack of support in text mode apps for such odd behavior. Very few emulators support this; kermit95 is one of the few. 5. Vertical text - who else supports this please? Webpages in languages that want to do TTB(top-to-bottom) scripting use html workarounds - probably because TTB availability it's not even guaranteed in a webbrowser. I hope you didn't think I was suggesting this. It's quite absurd. "Complex shaping, glyph substitution, and vertical text." was the full item listed. Vertical is the least troublesome of those issues, and as far as I know has never been implemented. In short, the current console is very much OK. I wouldn't say that. We suffer the bloat of all this UTF-8 stuff without being able to load a decent-sized font to go with it. We're stuck at 256 characters really, with the very lame option of trading foreground color intensity control for an extra 256. I think one could make a reasonable argument that all the internationalization is bloat, and that thus UTF-8 should go. Given that we do support UTF-8 though, allowing a font with more than 256 characters (with foreground intensity control) is obviously sensible. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: console font limits
On 5/3/07, Jan Engelhardt <[EMAIL PROTECTED]> wrote: On May 3 2007 02:17, Albert Cahalan wrote: > Those sizes are unreadable on the 200 dpi OLPC XO screen, Hm that should have read, for you: I don't object implementing support for larger sizes. (But I wonder how that should work without FB/CVIDIX/SVGA/VESA extensions.) Note that I was assuming that no FB is used: I'm assuming that the FB is used. Neither of my two computers can do VGA text mode. Even for computers which can do VGA text mode, if you want large fonts (either by number of characters or by character width) you need to use FB. That's just a requirement; anything else would be insane. For everything beyond Latin, fbiterm should work a lot better. Then, as with X, you have problems with kernel messages. Reliably sending printk through a userspace console is not even possible. (consider a panic, OOM, or runaway RT task) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
18-year-old bug
This bug was introduced with SE Linux, 18 years ago. People have been adding hacks to work around it as the bug bites them, but really the bug ought to be fixed. Signals related to a tty are supposed to come from the kernel. This got broken for pty devices. We now act as if the signal is sent from the process on the master side of the pty. That isn't right; the signal is supposed to come from the tty itself and thus have a kernel identity. How to reproduce: Copy /bin/sleep to /tmp/work and /tmp/fail. Start up xterm, run /tmp/work in the window, close the window, and see the process gone. Now repeat that for /tmp/fail, but run "su -" in the window first. Meanwhile, to view the problem, run this in another window: ps -Cwork -Cfail -o tty,pid,ppid,tpgid,pgid,sid,ruid,euid,comm (so like "/tmp/fail 100" or however much time you need) I first saw the problem when I was maintaining top. People would run top as root, close the window, and then find that top got stuck spinning on select. Eventually top was hacked up to work around the kernel bug, but really we shouldn't have userspace trying to work around kernel bugs. I tried to fix it back then, but got a bit lost in the then-new code. Sorry. Since then, I've become insanely busy with ten kids. I'd really appreciate if somebody could take a shot at fixing this bug. It seems to have hit a coworker a few months back, and he is just living with it. (ouch) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
page table isolation alternative mechanism
We got into the current situation for performance reasons, avoiding the costly reload of CR3 that a hardware task switch would cause. It seems we'll be loading CR3 now anyway, so it might be time to reconsider hardware task switches. The recent patches leave kernel entry/exit code mapped. Hardware task switches wouldn't need that. All they need is a single entry in a reduced-size IDT, for the doublefault, and a minimal GDT, and a TSS. Taking the fault switches CR3. That then gets you a proper IDT and GDT because those are virtually mapped. Not a single byte of kernel code would need to be mapped while user code runs.