Re: [RFC] LZO1X de/compression support

2007-05-19 Thread Bill Rugolsky Jr.
On Fri, May 18, 2007 at 11:14:57PM +0200, Krzysztof Halasa wrote:
> I'm certainly missing something but what are the advantages of this
> code (over current gzip etc.), and what will be using it?

Richard's patchset added it to the crypto library and wired it into
the JFFS2 file system.  We recently started using LZO in a userland UDP
proxy to do stateless per-packet payload compression over a WAN link.
With ~1000 octet packets, our particular data stream sees 60% compression
with zlib, and 50% compression with (mini-)LZO, but LZO runs at ~5.6x
the speed of zlib.  IIRC, that translates into > 700Mbps on the input
side on a 2GHZ Opteron, without any further tuning.

Once LZO is in the kernel, I'd like to see it wired into IPComp.
Unfortunately, last I checked only the "deflate" algorithm had an
assigned compression parameter index (CPI), so one will have to use a
private index until an official one is assigned.

Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] LZO1X de/compression support

2007-05-19 Thread Bill Rugolsky Jr.
On Fri, May 18, 2007 at 11:14:57PM +0200, Krzysztof Halasa wrote:
 I'm certainly missing something but what are the advantages of this
 code (over current gzip etc.), and what will be using it?

Richard's patchset added it to the crypto library and wired it into
the JFFS2 file system.  We recently started using LZO in a userland UDP
proxy to do stateless per-packet payload compression over a WAN link.
With ~1000 octet packets, our particular data stream sees 60% compression
with zlib, and 50% compression with (mini-)LZO, but LZO runs at ~5.6x
the speed of zlib.  IIRC, that translates into  700Mbps on the input
side on a 2GHZ Opteron, without any further tuning.

Once LZO is in the kernel, I'd like to see it wired into IPComp.
Unfortunately, last I checked only the deflate algorithm had an
assigned compression parameter index (CPI), so one will have to use a
private index until an official one is assigned.

Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Feature Request?] Inline compression of process core dumps

2007-04-12 Thread Bill Rugolsky Jr.
On Thu, Apr 12, 2007 at 11:52:38AM -0400, Christopher S. Aker wrote:
> I've been trying to find a method for compressing process core dumps 
> before they hit disk.
> 
> I ask because we've got some fairly large UML processes (1GB for some), 
> and we're trying to capture dumps to help Jeff debug an evasive bug. 
> Our systems use a small root partition and most of the other disk 
> resources on the host are allocated towards the UMLs.
> 
> There are userspace solutions to this problem:  allowing the 
> uncompressed core dump to spin out to disk and then coming in afterwards 
> and doing the compression, or maybe even a compressed filesystem where 
> the core dumps land, but I just thought I'd throw this out there since 
> it seems it would be a useful feature :)

See Documentation/kernel.txt for kernels >= 2.6.19:

core_pattern:

core_pattern is used to specify a core dumpfile pattern name.
. max length 128 characters; default value is "core"
. core_pattern is used as a pattern template for the output filename;
  certain string patterns (beginning with '%') are substituted with
  their actual values.
. backward compatibility with core_uses_pid:
If core_pattern does not include "%p" (default does not)
and core_uses_pid is set, then .PID will be appended to
the filename.
. corename format specifiers:
%  '%' is dropped
%%  output one '%'
%p  pid
%u  uid
%g  gid
%s  signal number
%t  UNIX time of dump
%h  hostname
%e  executable filename
% both are dropped
. If the first character of the pattern is a '|', the kernel will treat
  the rest of the pattern as a command to run.  The core dump will be
  written to the standard input of that program instead of to a file.


Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Feature Request?] Inline compression of process core dumps

2007-04-12 Thread Bill Rugolsky Jr.
On Thu, Apr 12, 2007 at 05:28:45PM +0100, Alan Cox wrote:
> > There are userspace solutions to this problem:  allowing the 
> > uncompressed core dump to spin out to disk and then coming in afterwards 
> > and doing the compression, or maybe even a compressed filesystem where 
> > the core dumps land, but I just thought I'd throw this out there since 
> > it seems it would be a useful feature :)
> 
> Indeed. So useful that in current kernels you can set the core dump path
> to be
> 
>   "|application"
> 
> and it will call out to the helper. Take care with the helper as it will
> get run for setuid apps, roots core dumps etc.

The current functionality doesn't parse command line arguments into argv,
nor provide the % variable replacements in the environment, so it is
somewhat less useful than it could be.  I supposed that parsing the command
line introduces potential problems with file names that include whitespace.
It would probably be better to split the command-line on whitespace, then
replace variables in the argv[]?

fs/exec.c:
1507 if (corename[0] == '|') {
1508 /* SIGPIPE can happen, but it's just never processed */
1509 if(call_usermodehelper_pipe(corename+1, NULL, NULL, 
)) {
1510 printk(KERN_INFO "Core dump to %s pipe failed\n",
1511corename);
1512 goto fail_unlock;
1513 }
1514 ispipe = 1;
1515 } else
1516 file = filp_open(corename,
1517  O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | 
flag,
1518  0600);


Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Feature Request?] Inline compression of process core dumps

2007-04-12 Thread Bill Rugolsky Jr.
On Thu, Apr 12, 2007 at 11:52:38AM -0400, Christopher S. Aker wrote:
 I've been trying to find a method for compressing process core dumps 
 before they hit disk.
 
 I ask because we've got some fairly large UML processes (1GB for some), 
 and we're trying to capture dumps to help Jeff debug an evasive bug. 
 Our systems use a small root partition and most of the other disk 
 resources on the host are allocated towards the UMLs.
 
 There are userspace solutions to this problem:  allowing the 
 uncompressed core dump to spin out to disk and then coming in afterwards 
 and doing the compression, or maybe even a compressed filesystem where 
 the core dumps land, but I just thought I'd throw this out there since 
 it seems it would be a useful feature :)

See Documentation/kernel.txt for kernels = 2.6.19:

core_pattern:

core_pattern is used to specify a core dumpfile pattern name.
. max length 128 characters; default value is core
. core_pattern is used as a pattern template for the output filename;
  certain string patterns (beginning with '%') are substituted with
  their actual values.
. backward compatibility with core_uses_pid:
If core_pattern does not include %p (default does not)
and core_uses_pid is set, then .PID will be appended to
the filename.
. corename format specifiers:
%NUL  '%' is dropped
%%  output one '%'
%p  pid
%u  uid
%g  gid
%s  signal number
%t  UNIX time of dump
%h  hostname
%e  executable filename
%OTHER both are dropped
. If the first character of the pattern is a '|', the kernel will treat
  the rest of the pattern as a command to run.  The core dump will be
  written to the standard input of that program instead of to a file.


Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Feature Request?] Inline compression of process core dumps

2007-04-12 Thread Bill Rugolsky Jr.
On Thu, Apr 12, 2007 at 05:28:45PM +0100, Alan Cox wrote:
  There are userspace solutions to this problem:  allowing the 
  uncompressed core dump to spin out to disk and then coming in afterwards 
  and doing the compression, or maybe even a compressed filesystem where 
  the core dumps land, but I just thought I'd throw this out there since 
  it seems it would be a useful feature :)
 
 Indeed. So useful that in current kernels you can set the core dump path
 to be
 
   |application
 
 and it will call out to the helper. Take care with the helper as it will
 get run for setuid apps, roots core dumps etc.

The current functionality doesn't parse command line arguments into argv,
nor provide the % variable replacements in the environment, so it is
somewhat less useful than it could be.  I supposed that parsing the command
line introduces potential problems with file names that include whitespace.
It would probably be better to split the command-line on whitespace, then
replace variables in the argv[]?

fs/exec.c:
1507 if (corename[0] == '|') {
1508 /* SIGPIPE can happen, but it's just never processed */
1509 if(call_usermodehelper_pipe(corename+1, NULL, NULL, 
file)) {
1510 printk(KERN_INFO Core dump to %s pipe failed\n,
1511corename);
1512 goto fail_unlock;
1513 }
1514 ispipe = 1;
1515 } else
1516 file = filp_open(corename,
1517  O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | 
flag,
1518  0600);


Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFD: Kernel release numbering

2005-03-03 Thread Bill Rugolsky Jr.
On Thu, Mar 03, 2005 at 02:33:58PM -0500, Dave Jones wrote:
> If you accelerate the merging process, you're lowering the review process.
> The only answer to get regressions fixed up as quickly as possible
> (because prevention is nigh on impossible at the current rate, so
>  any faster is just absurd), would be more regular releases, so that
> they got spotted quicker.

Right.  My point, and I think Jeff's, was that being extra careful for
the 'even' releases and waiting around N days to see whether someone will
finally test the -rc and see that it is broken impedes the whole process.
Getting more people to test is not necessarily a function of the wait
duration.

>  > Dave has been building "unstable" bleeding-edge Fedora kernels from
>  > 2.6.x-rcM-bkN, as well as "test" kernels for Fedora updates;
> 
> Actually only rawhide (FC4-to-be) has been getting -rc-bk patches.

When Arjan started testing 2.6, he set up a repo, and FC1 users at
the time could just pull rpms from that repo.  I and many others did it.
Currently I'm rolling my own, so I haven't been consistently testing
your Fedora kernels, rawhide or update, but I occasionally do the "grab
the SRPM from Rawhide and rebuild it on FC3"-thing for my laptop.  I
think that we should institutionalize that.

> The FC2/FC3 updates have been release versions only, with -ac patches.
> (and also some additional patches backing out bits of the -ac)

I've watched you periodically announce "hey, I'm doing an update for
FC3/FC2, please test" on the mail list, and a handful of people go test.
If we could convince many of the the less risk-averse but lazy users to
grab kernels automatically from updates/3/testing/ or updates/3/unstable/
as part of "yum update", and have a way to manage the plethora of (even
daily) kernel updates by removing old unused kernels, then we'd only
have to convince them *once* to set up their YUM repos, and then get them
to poweroff or reboot [or use a Xen domain] occasionally. :-)

> This is part of the problem with rebasing the existing releases to
> new kernels. It's shoving a largely untested codebase into a release
> that was never tested in that combination. It's expected that some
> stuff will break, but the volume of breakage is increasing as time goes on.
> Even if _I_ stopped rebasing the Fedora kernel, some of our users
> will still want to build and run the latest kernel.org kernel on their
> FC2 boxes. We shouldn't be expecting them to have to rebuild half of
> their userspace just because we've been sloppy and broke interfaces.
 
Yes, this is miserable, and exacerbated by the inability of almost all
distros to deal with multiple installed versions of packages, or easily
roll back changes, which crimps my argument w.r.t. wider testing, since
the typical user willing to test while otherwise doing useful work wants to
be able to roll back easily if there is a serious problem.

The LVM packaging situation between 2.4 and 2.6 illustrated the problem
well; one couldn't boot back into 2.4 unless LVM1 and LVM2 userland
could coexist; I spent time rolling packages back then to do just that.
In any case, kernel helper packages (udev, device-mapper, iptables, etc.)
need to be added to a "kernel+related" package repo.

Users could be encouraged to do more testing if they are provided
with a simple mechanism to snapshot, update, test, and then either
keep the changes or roll back.  I do this in UML with the COW tools.
Currently LVM2 has writable snapshots, but no easy way (at system boot)
to reintegrate the changes into the base from a snapshot, or "fallback"
to a snapshot.  Still, using Xen/UML/QEMU, it is not difficult to take
advantage of copy-on-write to update the kernel and other packages, then
boot the image to start a shell or web/ftp/mail/... daemon(s).  At least
that would exercise the non-hardware-related, (and for now, non-SMP)
parts of the kernel.

Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFD: Kernel release numbering

2005-03-03 Thread Bill Rugolsky Jr.
On Thu, Mar 03, 2005 at 02:15:06AM -0800, Andrew Morton wrote:
> If we were to get serious with maintenance of 2.6.x.y streams then that is
> a 100% productisation activity.  It's a very useful activity, and there is
> demand for it.  But it is a very different activity.  And a lot of this
> discussion has been getting these two activities confused.
 
IMHO, Jeff Garzik has made two very useful points in this thread:

1. The number of changesets flowing towards the Linus kernel is accelerating,
   so the kernel developers should be trying to accelerate the merging process,
   not introducing delays.  Having an extended -rc period that stuffs up merging
   just creates back pressure and causes changesets that could be getting
   reviewed, merged, and booted somewhere to instead lie dormant.

2. No matter what one calls it, -rc1, ., or just 2.6.X these days,
   intelligent consumers know a "dot-zero" release when they see one.
   [I've had experience of several boneheaded corporate policies dictating
an unpatched kernel.org kernel, but they are uninteresting users.]  The
   class of users that want to use the kernel in production are going to
   wait days to weeks, no matter what.  The trick is in encouraging everyone
   else to overcome inertia and test new releases.

As part of a solution to the "production kernel" problem, Jeff suggested a
2.6.x.y tree that gets pulled to 2.6.x+1.  Neil Brown made a similar point:

   For the kernel, I am the "distribution" for my employer and I choose
   which kernel to use, with which patches.  I really don't want to hunt
   around for all those stablisation patches, or sift through the patches
   in 2.6.X+1-pre to find things to apply to 2.6.X.  I would be really
   happy there was a central place where maintainers can put suitably  
   reviewed "important bug fix"es for recent releases, and from where
   kernel maintainers for any distribution (official or not) could pull
   them.

I'm in the same boat with Neil. Determined to stay reasonably close
to mainline, I started in the 2.6.9-bk series to try to nail down a
stable production kernel. I spent about two months reading lkml and
bk-commits-head, picking through -mm for patches that might be important
for my workloads (e.g., vmtrunc), and spending my days with "quilt",
merging up a new -bk kernel every few days, backing out "dangerous
changes", and retesting. At 2.6.10, I stopped revving up and started
to just merge fixes from 2.6.11-bk.

I'm sure Neil and I are not alone.  I perceive four groups of users for
kernel.org users, with differing requirements:

1. Developers.  For them, the Linus kernel is a synchronization
   point for merging, as well as their personal test environment.

2. "Casual" end-users who like to build their own kernels, and for 
   whom a kernel oops, crash, or driver failure is not a big
   hassle; they just reboot into their previous kernel.  They are
   content if a new kernel doesn't corrupt their data.

3. "Production" end-users, who need a kernel that is going to run
   stably, usually on many servers, indefinitely [until a bug or
   desired feature forces an upgrade/reboot].  Rolling out a new
   kernel is a hassle, and is usually done to fix a serious kernel
   bug or driver problem.

4. Vendors, who need a long period of stabilization and testing,
   as well as a (vendor-internal) mechanism for determining what
   features, drivers, etc.  to support.

As individuals, many of us live in multiple categories, e.g., I'm a (3) at work,
and a mix of (2) [laptop] and (3) [file server] at home.

Greg KH complained:

   Bug fixes for what?  Kernel api changes that fix bugs?  That's pretty
   big.  Some driver fixes, but not others?  Driver fixes that are in the
   middle of bigger, subsystem reworks as a series of patches?  All of this
   currently happens today in the main tree in a semi-cohesive manner.  To
   try to split it out is a very difficult task.

Opinions will differ, but I think things are a lot more clear-cut than
Greg allows.  I certainly don't expect to download, build, and deploy
a kernel devoid of patches without expecting at least a few problems.  It's
the incredible duplication of effort to sort through thousands of changesets in
order to cull dozens to hundreds, with the result that everyone is running
a subtly different kernel core.  And most of us are far less qualified
than subsystem maintainers to evaluate the risk of individual changesets.

Folks in categories (3) and (4) care very deeply about subtle corruption
[like the recent pty lost bytes], even if rare, as well as easily
triggerable oopses, races, deadlock, livelock, resource leaks, massive
performance regressions, and serious breakage in the (rapidly evolving)
networking stack.  These belong in 2.6.x.y.  API changes do not, unless
they are required to fix one if the above.

Sure, this is going to 

Re: RFD: Kernel release numbering

2005-03-03 Thread Bill Rugolsky Jr.
On Thu, Mar 03, 2005 at 02:15:06AM -0800, Andrew Morton wrote:
 If we were to get serious with maintenance of 2.6.x.y streams then that is
 a 100% productisation activity.  It's a very useful activity, and there is
 demand for it.  But it is a very different activity.  And a lot of this
 discussion has been getting these two activities confused.
 
IMHO, Jeff Garzik has made two very useful points in this thread:

1. The number of changesets flowing towards the Linus kernel is accelerating,
   so the kernel developers should be trying to accelerate the merging process,
   not introducing delays.  Having an extended -rc period that stuffs up merging
   just creates back pressure and causes changesets that could be getting
   reviewed, merged, and booted somewhere to instead lie dormant.

2. No matter what one calls it, -rc1, .odd, or just 2.6.X these days,
   intelligent consumers know a dot-zero release when they see one.
   [I've had experience of several boneheaded corporate policies dictating
an unpatched kernel.org kernel, but they are uninteresting users.]  The
   class of users that want to use the kernel in production are going to
   wait days to weeks, no matter what.  The trick is in encouraging everyone
   else to overcome inertia and test new releases.

As part of a solution to the production kernel problem, Jeff suggested a
2.6.x.y tree that gets pulled to 2.6.x+1.  Neil Brown made a similar point:

   For the kernel, I am the distribution for my employer and I choose
   which kernel to use, with which patches.  I really don't want to hunt
   around for all those stablisation patches, or sift through the patches
   in 2.6.X+1-pre to find things to apply to 2.6.X.  I would be really
   happy there was a central place where maintainers can put suitably  
   reviewed important bug fixes for recent releases, and from where
   kernel maintainers for any distribution (official or not) could pull
   them.

I'm in the same boat with Neil. Determined to stay reasonably close
to mainline, I started in the 2.6.9-bk series to try to nail down a
stable production kernel. I spent about two months reading lkml and
bk-commits-head, picking through -mm for patches that might be important
for my workloads (e.g., vmtrunc), and spending my days with quilt,
merging up a new -bk kernel every few days, backing out dangerous
changes, and retesting. At 2.6.10, I stopped revving up and started
to just merge fixes from 2.6.11-bk.

I'm sure Neil and I are not alone.  I perceive four groups of users for
kernel.org users, with differing requirements:

1. Developers.  For them, the Linus kernel is a synchronization
   point for merging, as well as their personal test environment.

2. Casual end-users who like to build their own kernels, and for 
   whom a kernel oops, crash, or driver failure is not a big
   hassle; they just reboot into their previous kernel.  They are
   content if a new kernel doesn't corrupt their data.

3. Production end-users, who need a kernel that is going to run
   stably, usually on many servers, indefinitely [until a bug or
   desired feature forces an upgrade/reboot].  Rolling out a new
   kernel is a hassle, and is usually done to fix a serious kernel
   bug or driver problem.

4. Vendors, who need a long period of stabilization and testing,
   as well as a (vendor-internal) mechanism for determining what
   features, drivers, etc.  to support.

As individuals, many of us live in multiple categories, e.g., I'm a (3) at work,
and a mix of (2) [laptop] and (3) [file server] at home.

Greg KH complained:

   Bug fixes for what?  Kernel api changes that fix bugs?  That's pretty
   big.  Some driver fixes, but not others?  Driver fixes that are in the
   middle of bigger, subsystem reworks as a series of patches?  All of this
   currently happens today in the main tree in a semi-cohesive manner.  To
   try to split it out is a very difficult task.

Opinions will differ, but I think things are a lot more clear-cut than
Greg allows.  I certainly don't expect to download, build, and deploy
a kernel devoid of patches without expecting at least a few problems.  It's
the incredible duplication of effort to sort through thousands of changesets in
order to cull dozens to hundreds, with the result that everyone is running
a subtly different kernel core.  And most of us are far less qualified
than subsystem maintainers to evaluate the risk of individual changesets.

Folks in categories (3) and (4) care very deeply about subtle corruption
[like the recent pty lost bytes], even if rare, as well as easily
triggerable oopses, races, deadlock, livelock, resource leaks, massive
performance regressions, and serious breakage in the (rapidly evolving)
networking stack.  These belong in 2.6.x.y.  API changes do not, unless
they are required to fix one if the above.

Sure, this is going to create 

Re: RFD: Kernel release numbering

2005-03-03 Thread Bill Rugolsky Jr.
On Thu, Mar 03, 2005 at 02:33:58PM -0500, Dave Jones wrote:
 If you accelerate the merging process, you're lowering the review process.
 The only answer to get regressions fixed up as quickly as possible
 (because prevention is nigh on impossible at the current rate, so
  any faster is just absurd), would be more regular releases, so that
 they got spotted quicker.

Right.  My point, and I think Jeff's, was that being extra careful for
the 'even' releases and waiting around N days to see whether someone will
finally test the -rc and see that it is broken impedes the whole process.
Getting more people to test is not necessarily a function of the wait
duration.

   Dave has been building unstable bleeding-edge Fedora kernels from
   2.6.x-rcM-bkN, as well as test kernels for Fedora updates;
 
 Actually only rawhide (FC4-to-be) has been getting -rc-bk patches.

When Arjan started testing 2.6, he set up a repo, and FC1 users at
the time could just pull rpms from that repo.  I and many others did it.
Currently I'm rolling my own, so I haven't been consistently testing
your Fedora kernels, rawhide or update, but I occasionally do the grab
the SRPM from Rawhide and rebuild it on FC3-thing for my laptop.  I
think that we should institutionalize that.

 The FC2/FC3 updates have been release versions only, with -ac patches.
 (and also some additional patches backing out bits of the -ac)

I've watched you periodically announce hey, I'm doing an update for
FC3/FC2, please test on the mail list, and a handful of people go test.
If we could convince many of the the less risk-averse but lazy users to
grab kernels automatically from updates/3/testing/ or updates/3/unstable/
as part of yum update, and have a way to manage the plethora of (even
daily) kernel updates by removing old unused kernels, then we'd only
have to convince them *once* to set up their YUM repos, and then get them
to poweroff or reboot [or use a Xen domain] occasionally. :-)

 This is part of the problem with rebasing the existing releases to
 new kernels. It's shoving a largely untested codebase into a release
 that was never tested in that combination. It's expected that some
 stuff will break, but the volume of breakage is increasing as time goes on.
 Even if _I_ stopped rebasing the Fedora kernel, some of our users
 will still want to build and run the latest kernel.org kernel on their
 FC2 boxes. We shouldn't be expecting them to have to rebuild half of
 their userspace just because we've been sloppy and broke interfaces.
 
Yes, this is miserable, and exacerbated by the inability of almost all
distros to deal with multiple installed versions of packages, or easily
roll back changes, which crimps my argument w.r.t. wider testing, since
the typical user willing to test while otherwise doing useful work wants to
be able to roll back easily if there is a serious problem.

The LVM packaging situation between 2.4 and 2.6 illustrated the problem
well; one couldn't boot back into 2.4 unless LVM1 and LVM2 userland
could coexist; I spent time rolling packages back then to do just that.
In any case, kernel helper packages (udev, device-mapper, iptables, etc.)
need to be added to a kernel+related package repo.

Users could be encouraged to do more testing if they are provided
with a simple mechanism to snapshot, update, test, and then either
keep the changes or roll back.  I do this in UML with the COW tools.
Currently LVM2 has writable snapshots, but no easy way (at system boot)
to reintegrate the changes into the base from a snapshot, or fallback
to a snapshot.  Still, using Xen/UML/QEMU, it is not difficult to take
advantage of copy-on-write to update the kernel and other packages, then
boot the image to start a shell or web/ftp/mail/... daemon(s).  At least
that would exercise the non-hardware-related, (and for now, non-SMP)
parts of the kernel.

Bill Rugolsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: i8042 access timings

2005-02-13 Thread Bill Rugolsky Jr.
On Sun, Feb 13, 2005 at 09:22:46AM +0100, Vojtech Pavlik wrote:
> And I suppose it was running just fine without the patch as well?
 
Correct.

> The question was whether the patch helps, or whether it is not needed.
 
If you look again at the patch I posted, it only borrowed a few lines
of the patch from Dmitry that started this thread; I eliminated Alan's
recent udelay(50) addition, reduced the loop delay, and added debug
printks to the *_wait routines to determine whether the loop is ever taken.

At least so far, those debugging statements have produced no output.
I'll use the machine a bit and report back if I trigger anything.

Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: i8042 access timings

2005-02-13 Thread Bill Rugolsky Jr.
On Sun, Feb 13, 2005 at 09:22:46AM +0100, Vojtech Pavlik wrote:
 And I suppose it was running just fine without the patch as well?
 
Correct.

 The question was whether the patch helps, or whether it is not needed.
 
If you look again at the patch I posted, it only borrowed a few lines
of the patch from Dmitry that started this thread; I eliminated Alan's
recent udelay(50) addition, reduced the loop delay, and added debug
printks to the *_wait routines to determine whether the loop is ever taken.

At least so far, those debugging statements have produced no output.
I'll use the machine a bit and report back if I trigger anything.

Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: i8042 access timings

2005-02-12 Thread Bill Rugolsky Jr.
On Thu, Jan 27, 2005 at 05:37:14PM +0100, Vojtech Pavlik wrote:
> On Thu, Jan 27, 2005 at 11:34:31AM -0500, Bill Rugolsky Jr. wrote:
> > I have a Digital HiNote collecting dust which had this keyboard problem
> > with the RH 6.x 2.2.x boot disk kernels, IIRC.  I can test if you like,
> > but I won't be able to get to it until the weekend.
>  
> That'd be very nice indeed.
 
Sorry for the long delay in replying; the HiNote needed some effort to get
the thing up and running again.  [Various bits of hardware are broken;
the power switch, floppy, and CD-ROM are busted/flakey.]  I've now got
Fedora Core 3 running on it. I was pleasantly surprised that the 2.6.9
i83265 PCMCIA module loads, and the internal Xircom CEM56 network/modem works.
[Broken with 2.6.10+ though; the fix is probably trivial.]

I wasn't sure exactly what to test.  I applied the following patch
to 2.6.11-rc3-bk9, and booted with i8042_debug=1.  So far, it works
as expected, and there is nothing of interest in the kernel log.
[Also worked with the FC3 2.6.9 kernel and this patch+DEBUG.]

Now that things are up and running, I will apply any patches that you
would like tested.

Bill Rugolsky

--- linux/drivers/input/serio/i8042.c.udelay-backout2005-02-12 
16:22:48.647851998 -0500
+++ linux/drivers/input/serio/i8042.c   2005-02-12 16:23:39.963997609 -0500
@@ -131,9 +131,10 @@
 {
int i = 0;
while ((~i8042_read_status() & I8042_STR_OBF) && (i < 
I8042_CTL_TIMEOUT)) {
-   udelay(50);
+   udelay(I8042_STR_DELAY);
i++;
}
+   if (i > 0) dbg("i8042_wait_read: looped %d times",i);
return -(i == I8042_CTL_TIMEOUT);
 }
 
@@ -141,9 +142,10 @@
 {
int i = 0;
while ((i8042_read_status() & I8042_STR_IBF) && (i < 
I8042_CTL_TIMEOUT)) {
-   udelay(50);
+   udelay(I8042_STR_DELAY);
i++;
}
+   if (i > 0) dbg("i8042_wait_write: looped %d times",i);
return -(i == I8042_CTL_TIMEOUT);
 }
 
@@ -161,7 +163,6 @@
spin_lock_irqsave(_lock, flags);
 
while ((i8042_read_status() & I8042_STR_OBF) && (i++ < 
I8042_BUFFER_SIZE)) {
-   udelay(50);
data = i8042_read_data();
dbg("%02x <- i8042 (flush, %s)", data,
i8042_read_status() & I8042_STR_AUXDATA ? "aux" : 
"kbd");
--- linux/drivers/input/serio/i8042.h.udelay-backout2005-02-12 
16:22:48.647851998 -0500
+++ linux/drivers/input/serio/i8042.h   2005-02-12 16:23:39.964997456 -0500
@@ -30,12 +30,18 @@
 #endif
 
 /*
- * This is in 50us units, the time we wait for the i8042 to react. This
+ * The time (in us) that we wait for the i8042 to react.
+ */
+
+#define I8042_STR_DELAY20
+
+/*
+ * This is in units of the time we wait for the i8042 to react. This
  * has to be long enough for the i8042 itself to timeout on sending a byte
  * to a non-existent mouse.
  */
 
-#define I8042_CTL_TIMEOUT  1
+#define I8042_CTL_TIMEOUT  25000
 
 /*
  * When the device isn't opened and it's interrupts aren't used, we poll it at
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: i8042 access timings

2005-02-12 Thread Bill Rugolsky Jr.
On Thu, Jan 27, 2005 at 05:37:14PM +0100, Vojtech Pavlik wrote:
 On Thu, Jan 27, 2005 at 11:34:31AM -0500, Bill Rugolsky Jr. wrote:
  I have a Digital HiNote collecting dust which had this keyboard problem
  with the RH 6.x 2.2.x boot disk kernels, IIRC.  I can test if you like,
  but I won't be able to get to it until the weekend.
  
 That'd be very nice indeed.
 
Sorry for the long delay in replying; the HiNote needed some effort to get
the thing up and running again.  [Various bits of hardware are broken;
the power switch, floppy, and CD-ROM are busted/flakey.]  I've now got
Fedora Core 3 running on it. I was pleasantly surprised that the 2.6.9
i83265 PCMCIA module loads, and the internal Xircom CEM56 network/modem works.
[Broken with 2.6.10+ though; the fix is probably trivial.]

I wasn't sure exactly what to test.  I applied the following patch
to 2.6.11-rc3-bk9, and booted with i8042_debug=1.  So far, it works
as expected, and there is nothing of interest in the kernel log.
[Also worked with the FC3 2.6.9 kernel and this patch+DEBUG.]

Now that things are up and running, I will apply any patches that you
would like tested.

Bill Rugolsky

--- linux/drivers/input/serio/i8042.c.udelay-backout2005-02-12 
16:22:48.647851998 -0500
+++ linux/drivers/input/serio/i8042.c   2005-02-12 16:23:39.963997609 -0500
@@ -131,9 +131,10 @@
 {
int i = 0;
while ((~i8042_read_status()  I8042_STR_OBF)  (i  
I8042_CTL_TIMEOUT)) {
-   udelay(50);
+   udelay(I8042_STR_DELAY);
i++;
}
+   if (i  0) dbg(i8042_wait_read: looped %d times,i);
return -(i == I8042_CTL_TIMEOUT);
 }
 
@@ -141,9 +142,10 @@
 {
int i = 0;
while ((i8042_read_status()  I8042_STR_IBF)  (i  
I8042_CTL_TIMEOUT)) {
-   udelay(50);
+   udelay(I8042_STR_DELAY);
i++;
}
+   if (i  0) dbg(i8042_wait_write: looped %d times,i);
return -(i == I8042_CTL_TIMEOUT);
 }
 
@@ -161,7 +163,6 @@
spin_lock_irqsave(i8042_lock, flags);
 
while ((i8042_read_status()  I8042_STR_OBF)  (i++  
I8042_BUFFER_SIZE)) {
-   udelay(50);
data = i8042_read_data();
dbg(%02x - i8042 (flush, %s), data,
i8042_read_status()  I8042_STR_AUXDATA ? aux : 
kbd);
--- linux/drivers/input/serio/i8042.h.udelay-backout2005-02-12 
16:22:48.647851998 -0500
+++ linux/drivers/input/serio/i8042.h   2005-02-12 16:23:39.964997456 -0500
@@ -30,12 +30,18 @@
 #endif
 
 /*
- * This is in 50us units, the time we wait for the i8042 to react. This
+ * The time (in us) that we wait for the i8042 to react.
+ */
+
+#define I8042_STR_DELAY20
+
+/*
+ * This is in units of the time we wait for the i8042 to react. This
  * has to be long enough for the i8042 itself to timeout on sending a byte
  * to a non-existent mouse.
  */
 
-#define I8042_CTL_TIMEOUT  1
+#define I8042_CTL_TIMEOUT  25000
 
 /*
  * When the device isn't opened and it's interrupts aren't used, we poll it at
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: i8042 access timings

2005-01-27 Thread Bill Rugolsky Jr.
On Thu, Jan 27, 2005 at 03:14:36PM +, Alan Cox wrote:
> Myths are not really involved here. The IBM PC hardware specifications
> are fairly well defined and the various bits of "we glued a 2Mhz part
> onto the bus" stuff is all well documented. Nowdays its more complex
> because most kbc's aren't standalone low end microcontrollers but are
> chipset integrated cells or even software SMM emulations.
> 
> The real test is to fish out something like an old Digital Hi-note
> laptop or an early 486 board with seperate kbc and try it.
 
I have a Digital HiNote collecting dust which had this keyboard problem
with the RH 6.x 2.2.x boot disk kernels, IIRC.  I can test if you like,
but I won't be able to get to it until the weekend.

Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: i8042 access timings

2005-01-27 Thread Bill Rugolsky Jr.
On Thu, Jan 27, 2005 at 03:14:36PM +, Alan Cox wrote:
 Myths are not really involved here. The IBM PC hardware specifications
 are fairly well defined and the various bits of we glued a 2Mhz part
 onto the bus stuff is all well documented. Nowdays its more complex
 because most kbc's aren't standalone low end microcontrollers but are
 chipset integrated cells or even software SMM emulations.
 
 The real test is to fish out something like an old Digital Hi-note
 laptop or an early 486 board with seperate kbc and try it.
 
I have a Digital HiNote collecting dust which had this keyboard problem
with the RH 6.x 2.2.x boot disk kernels, IIRC.  I can test if you like,
but I won't be able to get to it until the weekend.

Bill Rugolsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch, 2.6.11-rc2] sched: /proc/sys/kernel/rt_cpu_limit tunable

2005-01-25 Thread Bill Rugolsky Jr.
On Tue, Jan 25, 2005 at 02:03:02PM -0800, Chris Wright wrote:
> * Ingo Molnar ([EMAIL PROTECTED]) wrote:
> > did that thread go into technical details? There are some rlimit users
> > that might not be prepared to see the rlimit change under them. The
> > RT_CPU_RATIO one ought to be safe, but generally i'm not so sure.
> 
> Not really.   I mentioned the above, as well as the security concern.
> Right now, at least the task_setrlimit hook would have to change to take
> into account the task.  And I never convinced myself that async changes
> would be safe for each rlimit.

As was mentioned, but not discussed, in the /proc//rlimit thread,
it is not difficult to envision conditions where setrlimit() on another
process could make exploiting an application bug much easier, by, e.g.,
setting number of open files ridiculously low.  So IMHO, it ought require
privileges similar to ptrace() to change some, if not all, of the rlimits.

Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch, 2.6.11-rc2] sched: /proc/sys/kernel/rt_cpu_limit tunable

2005-01-25 Thread Bill Rugolsky Jr.
On Tue, Jan 25, 2005 at 02:03:02PM -0800, Chris Wright wrote:
 * Ingo Molnar ([EMAIL PROTECTED]) wrote:
  did that thread go into technical details? There are some rlimit users
  that might not be prepared to see the rlimit change under them. The
  RT_CPU_RATIO one ought to be safe, but generally i'm not so sure.
 
 Not really.   I mentioned the above, as well as the security concern.
 Right now, at least the task_setrlimit hook would have to change to take
 into account the task.  And I never convinced myself that async changes
 would be safe for each rlimit.

As was mentioned, but not discussed, in the /proc/pid/rlimit thread,
it is not difficult to envision conditions where setrlimit() on another
process could make exploiting an application bug much easier, by, e.g.,
setting number of open files ridiculously low.  So IMHO, it ought require
privileges similar to ptrace() to change some, if not all, of the rlimits.

Bill Rugolsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc//rlimit

2005-01-20 Thread Bill Rugolsky Jr.
On Thu, Jan 20, 2005 at 03:43:58PM +0100, Pavel Machek wrote:
> It would be nice if you could make it "value-per-file". That way,
> it could become writable in future. If "max nice level" ever becomes rlimit,
> this would be very usefull.

Agreed, though write support present difficulties.

My principal concern is that we don't want users changing resource limits
of privileged processes.  If we want an ordinary user to be allowed to
change limits, the rules would have to be similar to those allowed for
ptrace(), e.g., no-setuid processes, etc.  [With ptrace(), one can of
course attach to the process and invoke the setrlimit() syscall directly].
Additionally, sys_setrlimit() has an LSM hook:

security_task_setrlimit(unsigned int resource, struct rlimit *)

One would need to take account of changing the limit from a different
context.  It's a bit of a mess, and outside of the standard API; that's
why I didn't bother.

Anyway, for Jan, here's my incomplete and unmergeable cut-n-paste hack
to implement write on top of my previous patch.  Format is as was
suggested by Jan:

 <%u|unlimited> <%u|unlimited>

E.g.,
echo  memlock 65536 65536 > /proc/1/rlimit

Writing is limited to root (i.e. CAP_SYS_PTRACE), though see
fs/proc/base.c:may_ptrace_attach() for an idea of how to change that.

-Bill


--- linux-2.6.11-rc1-bk6/fs/proc/base.c.proc-pid-rlimit-write
+++ linux-2.6.11-rc1-bk6/fs/proc/base.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -127,7 +128,7 @@
E(PROC_TGID_ROOT,  "root",S_IFLNK|S_IRWXUGO),
E(PROC_TGID_EXE,   "exe", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_MOUNTS,"mounts",  S_IFREG|S_IRUGO),
-   E(PROC_TGID_RLIMIT,"rlimit",  S_IFREG|S_IRUGO),
+   E(PROC_TGID_RLIMIT,"rlimit",  S_IFREG|S_IRUGO|S_IWUSR),
 #ifdef CONFIG_SECURITY
E(PROC_TGID_ATTR,  "attr",S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -153,7 +154,7 @@
E(PROC_TID_ROOT,   "root",S_IFLNK|S_IRWXUGO),
E(PROC_TID_EXE,"exe", S_IFLNK|S_IRWXUGO),
E(PROC_TID_MOUNTS, "mounts",  S_IFREG|S_IRUGO),
-   E(PROC_TID_RLIMIT, "rlimit",  S_IFREG|S_IRUGO),
+   E(PROC_TID_RLIMIT, "rlimit",  S_IFREG|S_IRUGO|S_IWUSR),
 #ifdef CONFIG_SECURITY
E(PROC_TID_ATTR,   "attr",S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -595,9 +596,99 @@
return single_release(inode, file);
 }
 
+static inline char *skip_ws(char *s)
+{
+   while (isspace(*s))
+   s++;
+   return s;
+}
+
+static inline char *find_ws(char *s)
+{
+   while (!isspace(*s) && *s != '\0')
+   s++;
+   return s;
+}
+
+#define MAX_RLIMIT_WRITE 79
+static ssize_t rlimit_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+   struct task_struct *task = proc_task(file->f_dentry->d_inode);
+   struct rlimit new_rlim, *old_rlim;
+   unsigned int i;
+   char *s, *t, kbuf[MAX_RLIMIT_WRITE+1];
+
+   /* changing resources limits can crash or subvert a process */
+   if (!capable(CAP_SYS_PTRACE) || security_ptrace(current,task))
+   return -ESRCH;
+
+if (count > MAX_RLIMIT_WRITE)
+return -EINVAL;
+if (copy_from_user(, buf, count))
+return -EFAULT;
+kbuf[MAX_RLIMIT_WRITE] = '\0'; 
+
+   /* parse the resource id */
+   s = skip_ws(kbuf);
+   t = find_ws(s);
+   if (*t == '\0')
+   return -EINVAL;
+   *t++ = '\0';
+   for (i = 0 ; i < RLIM_NLIMITS ; i++)
+   if (rlim_name[i] && !strcmp(s,rlim_name[i]))
+   break;
+   if (i >= RLIM_NLIMITS) {
+   if (!strncmp(s, "rlimit-",7))
+   s += 7;
+   if (sscanf(s, "%u", ) != 1 || i >= RLIM_NLIMITS)
+   return -EINVAL;
+   }
+
+   /* parse the soft limit */
+   s = skip_ws(t);
+   t = find_ws(s);
+   if (*t == '\0')
+   return -EINVAL;
+   *t++ = '\0';
+   if (!strcmp(s, "unlimited")) 
+   new_rlim.rlim_cur = RLIM_INFINITY;
+   else if (sscanf(s, "%lu", _rlim.rlim_cur) != 1)
+   return -EINVAL;
+
+   /* parse the hard limit */
+   s = skip_ws(t);
+   t = find_ws(s);
+   *t = '\0';
+   if (!strcmp(s, "unlimited")) 
+   new_rlim.rlim_max = RLIM_INFINITY;
+   else if (sscanf(s, "%lu", _rlim.rlim_max) != 1)
+   return -EINVAL;
+
+   /* validate the values; copied from sys_setrlimit() */
+   if (new_rlim.rlim_cur > new_rlim.rlim_max)
+   return -EINVAL;
+old_rlim = task->signal->rlim + i;
+   if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+   !capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+   if (i == RLIMIT_NOFILE && new_rlim.rlim_max > NR_OPEN)
+   return -EPERM;
+
+   /* 

Re: [RFC][PATCH] /proc/pid/rlimit

2005-01-20 Thread Bill Rugolsky Jr.
On Thu, Jan 20, 2005 at 03:43:58PM +0100, Pavel Machek wrote:
 It would be nice if you could make it value-per-file. That way,
 it could become writable in future. If max nice level ever becomes rlimit,
 this would be very usefull.

Agreed, though write support present difficulties.

My principal concern is that we don't want users changing resource limits
of privileged processes.  If we want an ordinary user to be allowed to
change limits, the rules would have to be similar to those allowed for
ptrace(), e.g., no-setuid processes, etc.  [With ptrace(), one can of
course attach to the process and invoke the setrlimit() syscall directly].
Additionally, sys_setrlimit() has an LSM hook:

security_task_setrlimit(unsigned int resource, struct rlimit *)

One would need to take account of changing the limit from a different
context.  It's a bit of a mess, and outside of the standard API; that's
why I didn't bother.

Anyway, for Jan, here's my incomplete and unmergeable cut-n-paste hack
to implement write on top of my previous patch.  Format is as was
suggested by Jan:

name|[rlimit-]%u %u|unlimited %u|unlimited

E.g.,
echo  memlock 65536 65536  /proc/1/rlimit

Writing is limited to root (i.e. CAP_SYS_PTRACE), though see
fs/proc/base.c:may_ptrace_attach() for an idea of how to change that.

-Bill


--- linux-2.6.11-rc1-bk6/fs/proc/base.c.proc-pid-rlimit-write
+++ linux-2.6.11-rc1-bk6/fs/proc/base.c
@@ -23,6 +23,7 @@
 #include linux/init.h
 #include linux/file.h
 #include linux/string.h
+#include linux/ctype.h
 #include linux/seq_file.h
 #include linux/namei.h
 #include linux/namespace.h
@@ -127,7 +128,7 @@
E(PROC_TGID_ROOT,  root,S_IFLNK|S_IRWXUGO),
E(PROC_TGID_EXE,   exe, S_IFLNK|S_IRWXUGO),
E(PROC_TGID_MOUNTS,mounts,  S_IFREG|S_IRUGO),
-   E(PROC_TGID_RLIMIT,rlimit,  S_IFREG|S_IRUGO),
+   E(PROC_TGID_RLIMIT,rlimit,  S_IFREG|S_IRUGO|S_IWUSR),
 #ifdef CONFIG_SECURITY
E(PROC_TGID_ATTR,  attr,S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -153,7 +154,7 @@
E(PROC_TID_ROOT,   root,S_IFLNK|S_IRWXUGO),
E(PROC_TID_EXE,exe, S_IFLNK|S_IRWXUGO),
E(PROC_TID_MOUNTS, mounts,  S_IFREG|S_IRUGO),
-   E(PROC_TID_RLIMIT, rlimit,  S_IFREG|S_IRUGO),
+   E(PROC_TID_RLIMIT, rlimit,  S_IFREG|S_IRUGO|S_IWUSR),
 #ifdef CONFIG_SECURITY
E(PROC_TID_ATTR,   attr,S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -595,9 +596,99 @@
return single_release(inode, file);
 }
 
+static inline char *skip_ws(char *s)
+{
+   while (isspace(*s))
+   s++;
+   return s;
+}
+
+static inline char *find_ws(char *s)
+{
+   while (!isspace(*s)  *s != '\0')
+   s++;
+   return s;
+}
+
+#define MAX_RLIMIT_WRITE 79
+static ssize_t rlimit_write(struct file * file, const char * buf,
+ size_t count, loff_t *ppos)
+{
+   struct task_struct *task = proc_task(file-f_dentry-d_inode);
+   struct rlimit new_rlim, *old_rlim;
+   unsigned int i;
+   char *s, *t, kbuf[MAX_RLIMIT_WRITE+1];
+
+   /* changing resources limits can crash or subvert a process */
+   if (!capable(CAP_SYS_PTRACE) || security_ptrace(current,task))
+   return -ESRCH;
+
+if (count  MAX_RLIMIT_WRITE)
+return -EINVAL;
+if (copy_from_user(kbuf, buf, count))
+return -EFAULT;
+kbuf[MAX_RLIMIT_WRITE] = '\0'; 
+
+   /* parse the resource id */
+   s = skip_ws(kbuf);
+   t = find_ws(s);
+   if (*t == '\0')
+   return -EINVAL;
+   *t++ = '\0';
+   for (i = 0 ; i  RLIM_NLIMITS ; i++)
+   if (rlim_name[i]  !strcmp(s,rlim_name[i]))
+   break;
+   if (i = RLIM_NLIMITS) {
+   if (!strncmp(s, rlimit-,7))
+   s += 7;
+   if (sscanf(s, %u, i) != 1 || i = RLIM_NLIMITS)
+   return -EINVAL;
+   }
+
+   /* parse the soft limit */
+   s = skip_ws(t);
+   t = find_ws(s);
+   if (*t == '\0')
+   return -EINVAL;
+   *t++ = '\0';
+   if (!strcmp(s, unlimited)) 
+   new_rlim.rlim_cur = RLIM_INFINITY;
+   else if (sscanf(s, %lu, new_rlim.rlim_cur) != 1)
+   return -EINVAL;
+
+   /* parse the hard limit */
+   s = skip_ws(t);
+   t = find_ws(s);
+   *t = '\0';
+   if (!strcmp(s, unlimited)) 
+   new_rlim.rlim_max = RLIM_INFINITY;
+   else if (sscanf(s, %lu, new_rlim.rlim_max) != 1)
+   return -EINVAL;
+
+   /* validate the values; copied from sys_setrlimit() */
+   if (new_rlim.rlim_cur  new_rlim.rlim_max)
+   return -EINVAL;
+old_rlim = task-signal-rlim + i;
+   if ((new_rlim.rlim_max  old_rlim-rlim_max) 
+   !capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+   if (i == RLIMIT_NOFILE  new_rlim.rlim_max  

Re: [RFC][PATCH] /proc//rlimit

2005-01-19 Thread Bill Rugolsky Jr.
On Wed, Jan 19, 2005 at 11:38:03AM -0800, Chris Wright wrote:
> * Jan Knutar ([EMAIL PROTECTED]) wrote:
> > A "cool feature" would be if you could do
> > echo nofile 8192 8192 >/proc/`pidof thatserverproess`/rlimit
> > :-)
> 
> This is security sensitive, and is currently only expected to be changed
> by current.

Sure, I had thought of implementing it, paused to consider the security
implications, and then punted.

Chris, on the other point that you made regarding UGO read access to "rlimit",
the same is true of "maps" (at least sans SELinux policy), so I don't
see an issue.  Certainly the map information is more security sensitive.

Regards,

-Bill
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/rlimit

2005-01-19 Thread Bill Rugolsky Jr.
On Wed, Jan 19, 2005 at 11:38:03AM -0800, Chris Wright wrote:
 * Jan Knutar ([EMAIL PROTECTED]) wrote:
  A cool feature would be if you could do
  echo nofile 8192 8192 /proc/`pidof thatserverproess`/rlimit
  :-)
 
 This is security sensitive, and is currently only expected to be changed
 by current.

Sure, I had thought of implementing it, paused to consider the security
implications, and then punted.

Chris, on the other point that you made regarding UGO read access to rlimit,
the same is true of maps (at least sans SELinux policy), so I don't
see an issue.  Certainly the map information is more security sensitive.

Regards,

-Bill
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] consolidate arch specific resource.h headers

2005-01-18 Thread Bill Rugolsky Jr.
On Tue, Jan 18, 2005 at 04:10:56PM -0800, Chris Wright wrote:
> +#define INIT_RLIMITS \
> +{\
> + { RLIM_INFINITY, RLIM_INFINITY },   \
> + { RLIM_INFINITY, RLIM_INFINITY },   \
> + { RLIM_INFINITY, RLIM_INFINITY },   \
> + {  _STK_LIM, _STK_LIM_MAX  },   \
> + { 0, RLIM_INFINITY },   \
> + { RLIM_INFINITY, RLIM_INFINITY },   \
> + { 0, 0 },   \
> + {  INR_OPEN, INR_OPEN  },   \
> + {   MLOCK_LIMIT,   MLOCK_LIMIT },   \
> + { RLIM_INFINITY, RLIM_INFINITY },   \
> + { RLIM_INFINITY, RLIM_INFINITY },   \
> + { MAX_SIGPENDING, MAX_SIGPENDING }, \
> + { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
> +}

While you are rooting around in there, perhaps this block
should be converted to C99 initializer syntax, to avoid
problems if arch-specific changes are later introduced?

Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH] /proc//rlimit

2005-01-18 Thread Bill Rugolsky Jr.
This patch against 2.6.11-rc1-bk6 adds /proc//rlimit to export
per-process resource limit settings.  It was written to help analyze
daemon core dump size settings, but may be more generally useful.
Tested on 2.6.10. Sample output:

[EMAIL PROTECTED] ~ # cat /proc/$$/rlimit
cpu unlimited unlimited
fsize unlimited unlimited
data unlimited unlimited
stack 8388608 unlimited
core 0 unlimited
rss unlimited unlimited
nproc 111 111
nofile 1024 1024
memlock 32768 32768
as unlimited unlimited
locks unlimited unlimited
sigpending 1024 1024
msgqueue 819200 819200

Feedback welcome.

Signed-off-by: Bill Rugolsky <[EMAIL PROTECTED]>

--- linux-2.6.11-rc1-bk6/fs/proc/base.c.rlimit  2005-01-18 15:01:10.120960254 
-0500
+++ linux-2.6.11-rc1-bk6/fs/proc/base.c 2005-01-18 15:07:28.102661832 -0500
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /*
@@ -61,6 +62,7 @@
PROC_TGID_MAPS,
PROC_TGID_MOUNTS,
PROC_TGID_WCHAN,
+   PROC_TGID_RLIMIT,
 #ifdef CONFIG_SCHEDSTATS
PROC_TGID_SCHEDSTAT,
 #endif
@@ -87,6 +89,7 @@
PROC_TID_MAPS,
PROC_TID_MOUNTS,
PROC_TID_WCHAN,
+   PROC_TID_RLIMIT,
 #ifdef CONFIG_SCHEDSTATS
PROC_TID_SCHEDSTAT,
 #endif
@@ -124,6 +127,7 @@
E(PROC_TGID_ROOT,  "root",S_IFLNK|S_IRWXUGO),
E(PROC_TGID_EXE,   "exe", S_IFLNK|S_IRWXUGO),
E(PROC_TGID_MOUNTS,"mounts",  S_IFREG|S_IRUGO),
+   E(PROC_TGID_RLIMIT,"rlimit",  S_IFREG|S_IRUGO),
 #ifdef CONFIG_SECURITY
E(PROC_TGID_ATTR,  "attr",S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -149,6 +153,7 @@
E(PROC_TID_ROOT,   "root",S_IFLNK|S_IRWXUGO),
E(PROC_TID_EXE,"exe", S_IFLNK|S_IRWXUGO),
E(PROC_TID_MOUNTS, "mounts",  S_IFREG|S_IRUGO),
+   E(PROC_TID_RLIMIT, "rlimit",  S_IFREG|S_IRUGO),
 #ifdef CONFIG_SECURITY
E(PROC_TID_ATTR,   "attr",S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -496,6 +501,107 @@
.release= mounts_release,
 };
 
+const char * const rlim_name[RLIM_NLIMITS] = {
+#ifdef RLIMIT_CPU
+   [RLIMIT_CPU] = "cpu",
+#endif
+#ifdef RLIMIT_FSIZE
+   [RLIMIT_FSIZE] = "fsize",
+#endif
+#ifdef RLIMIT_DATA
+   [RLIMIT_DATA] =  "data",
+#endif
+#ifdef RLIMIT_STACK
+   [RLIMIT_STACK] = "stack",
+#endif
+#ifdef RLIMIT_CORE
+   [RLIMIT_CORE] = "core",
+#endif
+#ifdef RLIMIT_RSS
+   [RLIMIT_RSS] = "rss",
+#endif
+#ifdef RLIMIT_NPROC
+   [RLIMIT_NPROC] = "nproc",
+#endif
+#ifdef RLIMIT_NOFILE
+   [RLIMIT_NOFILE] = "nofile",
+#endif
+#ifdef RLIMIT_MEMLOCK
+   [RLIMIT_MEMLOCK] = "memlock",
+#endif
+#ifdef RLIMIT_AS
+   [RLIMIT_AS] = "as",
+#endif
+#ifdef RLIMIT_LOCKS
+   [RLIMIT_LOCKS] = "locks",
+#endif
+#ifdef RLIMIT_SIGPENDING
+   [RLIMIT_SIGPENDING] = "sigpending",
+#endif
+#ifdef RLIMIT_MSGQUEUE
+   [RLIMIT_MSGQUEUE] = "msgqueue",
+#endif
+};
+
+static int rlimit_show(struct seq_file *s, void *v)
+{
+   struct rlimit *rlim = (struct rlimit *) s->private;
+   int i;
+
+   for (i = 0 ; i < RLIM_NLIMITS ; i++) {
+   if (rlim_name[i] != NULL)
+   seq_puts(s, rlim_name[i]);
+   else
+   seq_printf(s, "rlimit-%d", i);
+
+   if (rlim[i].rlim_cur == RLIM_INFINITY)
+   seq_puts(s, " unlimited");
+   else
+   seq_printf(s, " %lu", (unsigned long)rlim[i].rlim_cur);
+
+   if (rlim[i].rlim_max == RLIM_INFINITY)
+   seq_puts(s, " unlimited\n");
+   else
+   seq_printf(s, " %lu\n", (unsigned 
long)rlim[i].rlim_max);
+   }
+   return 0;
+}
+
+static int rlimit_open(struct inode *inode, struct file *file)
+{
+   struct task_struct *task = proc_task(inode);
+   struct rlimit *rlim = kmalloc(RLIM_NLIMITS * sizeof (struct rlimit), 
GFP_KERNEL);
+   int ret;
+
+   if (!rlim)
+   return -ENOMEM;
+
+   task_lock(task->group_leader);
+   memcpy(rlim, task->signal->rlim, RLIM_NLIMITS * sizeof (struct rlimit));
+   task_unlock(task->group_leader);
+
+   ret = single_open(file, rlimit_show, rlim);
+
+   if (ret)
+   kfree(rlim);
+
+   return ret;
+}
+
+static int rlimit_release(struct inode *inode, struct file *file)
+{
+   struct seq_file *s = file->private_data;
+   kfree(s->private);
+   return single_release(inode, file);
+}
+
+static struct file_operations proc_rlimit_operations = {
+   .open   = rlimit_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= rlimit_release,
+};
+
 #define PROC_BLOCK_SIZE(3*1024)/* 4K page size but our 
output routines use some slack for overruns */
 
 static ssize_t proc_info_read(struct 

[RFC][PATCH] /proc/pid/rlimit

2005-01-18 Thread Bill Rugolsky Jr.
This patch against 2.6.11-rc1-bk6 adds /proc/pid/rlimit to export
per-process resource limit settings.  It was written to help analyze
daemon core dump size settings, but may be more generally useful.
Tested on 2.6.10. Sample output:

[EMAIL PROTECTED] ~ # cat /proc/$$/rlimit
cpu unlimited unlimited
fsize unlimited unlimited
data unlimited unlimited
stack 8388608 unlimited
core 0 unlimited
rss unlimited unlimited
nproc 111 111
nofile 1024 1024
memlock 32768 32768
as unlimited unlimited
locks unlimited unlimited
sigpending 1024 1024
msgqueue 819200 819200

Feedback welcome.

Signed-off-by: Bill Rugolsky [EMAIL PROTECTED]

--- linux-2.6.11-rc1-bk6/fs/proc/base.c.rlimit  2005-01-18 15:01:10.120960254 
-0500
+++ linux-2.6.11-rc1-bk6/fs/proc/base.c 2005-01-18 15:07:28.102661832 -0500
@@ -32,6 +32,7 @@
 #include linux/mount.h
 #include linux/security.h
 #include linux/ptrace.h
+#include linux/resource.h
 #include internal.h
 
 /*
@@ -61,6 +62,7 @@
PROC_TGID_MAPS,
PROC_TGID_MOUNTS,
PROC_TGID_WCHAN,
+   PROC_TGID_RLIMIT,
 #ifdef CONFIG_SCHEDSTATS
PROC_TGID_SCHEDSTAT,
 #endif
@@ -87,6 +89,7 @@
PROC_TID_MAPS,
PROC_TID_MOUNTS,
PROC_TID_WCHAN,
+   PROC_TID_RLIMIT,
 #ifdef CONFIG_SCHEDSTATS
PROC_TID_SCHEDSTAT,
 #endif
@@ -124,6 +127,7 @@
E(PROC_TGID_ROOT,  root,S_IFLNK|S_IRWXUGO),
E(PROC_TGID_EXE,   exe, S_IFLNK|S_IRWXUGO),
E(PROC_TGID_MOUNTS,mounts,  S_IFREG|S_IRUGO),
+   E(PROC_TGID_RLIMIT,rlimit,  S_IFREG|S_IRUGO),
 #ifdef CONFIG_SECURITY
E(PROC_TGID_ATTR,  attr,S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -149,6 +153,7 @@
E(PROC_TID_ROOT,   root,S_IFLNK|S_IRWXUGO),
E(PROC_TID_EXE,exe, S_IFLNK|S_IRWXUGO),
E(PROC_TID_MOUNTS, mounts,  S_IFREG|S_IRUGO),
+   E(PROC_TID_RLIMIT, rlimit,  S_IFREG|S_IRUGO),
 #ifdef CONFIG_SECURITY
E(PROC_TID_ATTR,   attr,S_IFDIR|S_IRUGO|S_IXUGO),
 #endif
@@ -496,6 +501,107 @@
.release= mounts_release,
 };
 
+const char * const rlim_name[RLIM_NLIMITS] = {
+#ifdef RLIMIT_CPU
+   [RLIMIT_CPU] = cpu,
+#endif
+#ifdef RLIMIT_FSIZE
+   [RLIMIT_FSIZE] = fsize,
+#endif
+#ifdef RLIMIT_DATA
+   [RLIMIT_DATA] =  data,
+#endif
+#ifdef RLIMIT_STACK
+   [RLIMIT_STACK] = stack,
+#endif
+#ifdef RLIMIT_CORE
+   [RLIMIT_CORE] = core,
+#endif
+#ifdef RLIMIT_RSS
+   [RLIMIT_RSS] = rss,
+#endif
+#ifdef RLIMIT_NPROC
+   [RLIMIT_NPROC] = nproc,
+#endif
+#ifdef RLIMIT_NOFILE
+   [RLIMIT_NOFILE] = nofile,
+#endif
+#ifdef RLIMIT_MEMLOCK
+   [RLIMIT_MEMLOCK] = memlock,
+#endif
+#ifdef RLIMIT_AS
+   [RLIMIT_AS] = as,
+#endif
+#ifdef RLIMIT_LOCKS
+   [RLIMIT_LOCKS] = locks,
+#endif
+#ifdef RLIMIT_SIGPENDING
+   [RLIMIT_SIGPENDING] = sigpending,
+#endif
+#ifdef RLIMIT_MSGQUEUE
+   [RLIMIT_MSGQUEUE] = msgqueue,
+#endif
+};
+
+static int rlimit_show(struct seq_file *s, void *v)
+{
+   struct rlimit *rlim = (struct rlimit *) s-private;
+   int i;
+
+   for (i = 0 ; i  RLIM_NLIMITS ; i++) {
+   if (rlim_name[i] != NULL)
+   seq_puts(s, rlim_name[i]);
+   else
+   seq_printf(s, rlimit-%d, i);
+
+   if (rlim[i].rlim_cur == RLIM_INFINITY)
+   seq_puts(s,  unlimited);
+   else
+   seq_printf(s,  %lu, (unsigned long)rlim[i].rlim_cur);
+
+   if (rlim[i].rlim_max == RLIM_INFINITY)
+   seq_puts(s,  unlimited\n);
+   else
+   seq_printf(s,  %lu\n, (unsigned 
long)rlim[i].rlim_max);
+   }
+   return 0;
+}
+
+static int rlimit_open(struct inode *inode, struct file *file)
+{
+   struct task_struct *task = proc_task(inode);
+   struct rlimit *rlim = kmalloc(RLIM_NLIMITS * sizeof (struct rlimit), 
GFP_KERNEL);
+   int ret;
+
+   if (!rlim)
+   return -ENOMEM;
+
+   task_lock(task-group_leader);
+   memcpy(rlim, task-signal-rlim, RLIM_NLIMITS * sizeof (struct rlimit));
+   task_unlock(task-group_leader);
+
+   ret = single_open(file, rlimit_show, rlim);
+
+   if (ret)
+   kfree(rlim);
+
+   return ret;
+}
+
+static int rlimit_release(struct inode *inode, struct file *file)
+{
+   struct seq_file *s = file-private_data;
+   kfree(s-private);
+   return single_release(inode, file);
+}
+
+static struct file_operations proc_rlimit_operations = {
+   .open   = rlimit_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= rlimit_release,
+};
+
 #define PROC_BLOCK_SIZE(3*1024)/* 4K page size but our 
output routines use some slack for overruns */
 
 static ssize_t proc_info_read(struct file * 

Re: [RFC][PATCH] consolidate arch specific resource.h headers

2005-01-18 Thread Bill Rugolsky Jr.
On Tue, Jan 18, 2005 at 04:10:56PM -0800, Chris Wright wrote:
 +#define INIT_RLIMITS \
 +{\
 + { RLIM_INFINITY, RLIM_INFINITY },   \
 + { RLIM_INFINITY, RLIM_INFINITY },   \
 + { RLIM_INFINITY, RLIM_INFINITY },   \
 + {  _STK_LIM, _STK_LIM_MAX  },   \
 + { 0, RLIM_INFINITY },   \
 + { RLIM_INFINITY, RLIM_INFINITY },   \
 + { 0, 0 },   \
 + {  INR_OPEN, INR_OPEN  },   \
 + {   MLOCK_LIMIT,   MLOCK_LIMIT },   \
 + { RLIM_INFINITY, RLIM_INFINITY },   \
 + { RLIM_INFINITY, RLIM_INFINITY },   \
 + { MAX_SIGPENDING, MAX_SIGPENDING }, \
 + { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
 +}

While you are rooting around in there, perhaps this block
should be converted to C99 initializer syntax, to avoid
problems if arch-specific changes are later introduced?

Regards,

Bill Rugolsky
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Disturbing news..

2001-03-28 Thread Bill Rugolsky Jr.

On Wed, Mar 28, 2001 at 04:32:44PM +0200, Romano Giannetti wrote:
> But with the new VFS semantics, wouldn't be possible for a MUA to make a
> thing like the following: 
> 
> spawn a process with a private namespace. Here a minimun subset of the
> "real" tree (maybe all / except /dev) is mounted readonly. The private /tmp
> and /home/user are substituted by read-write directory that are in the
> "real" tree /home/user/mua/fakehome and /home/user/mua/faketmp. In this
> private namespace, run the "untrusted" binary. 

Possible and desirable.  You have to turn off access to all the other
dangerous namespaces though, like socket() and shmat(), and make sure
that nosuid and devices are handled properly. Done right, the only thing
that untrusted code can do is consume a little memory, CPU, and disk,
but that's why there are limits and a scheduler. :-)

One might even want to add back limited access to those other namespaces
by implementing a filesystem interface, ala Plan-9/Inferno.

Regards,

   Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Disturbing news..

2001-03-28 Thread Bill Rugolsky Jr.

On Wed, Mar 28, 2001 at 04:32:44PM +0200, Romano Giannetti wrote:
 But with the new VFS semantics, wouldn't be possible for a MUA to make a
 thing like the following: 
 
 spawn a process with a private namespace. Here a minimun subset of the
 "real" tree (maybe all / except /dev) is mounted readonly. The private /tmp
 and /home/user are substituted by read-write directory that are in the
 "real" tree /home/user/mua/fakehome and /home/user/mua/faketmp. In this
 private namespace, run the "untrusted" binary. 

Possible and desirable.  You have to turn off access to all the other
dangerous namespaces though, like socket() and shmat(), and make sure
that nosuid and devices are handled properly. Done right, the only thing
that untrusted code can do is consume a little memory, CPU, and disk,
but that's why there are limits and a scheduler. :-)

One might even want to add back limited access to those other namespaces
by implementing a filesystem interface, ala Plan-9/Inferno.

Regards,

   Bill Rugolsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: v2.4.0test9 NFSv3 server woes Linux-->Solaris

2000-10-05 Thread Bill Rugolsky Jr.

On Thu, Oct 05, 2000 at 04:58:39PM +0200, David Weinehall wrote:
> Using the NFSv3 server in the v2.4.0test9 kernel (I haven't tested any
> earlier v2.3.xx or v2.4.0testx kernels) I'm having problems with
> (for instance) compile glib.
> 
> The setups I've tried are:
> 
> wsize = rsize = 1kB
> Linux NFSv3 server --> Linux NFSv3 client (UDP mounted) -- WORKS
> 
> wsize = rsize = 32kB
> Linux NFSv3 server --> Solaris NFSv3 client (UDP mounted) -- BROKEN!
> Linux NFSv3 server --> Solaris NFSv3 client (TCP mounted) -- BROKEN!
> 
> wsize = rsize = 2kB
> Linux NFSv3 server --> Solaris NFSv3 client (UDP mounted) -- BROKEN!
> Linux NFSv3 server --> Solaris NFSv3 client (TCP mounted) -- BROKEN!

What do you mean by "BROKEN" ?  Anything in syslog?  tcpdumps?
 
Why not test wsize=rsize=1K for the Linux/Solaris combo?
Also, I was unaware that TCP server was supposed to work in 2.4.0-test9.
(It isn't in the 2.2.x patches.)  Are you sure that Solaris is not falling
back to UDP mounts?
 
> Oh, by the way, is there ANY sane reason whatsoever behind the decision
> that the Linux NFSv3 client in the v2.2.18pre15 kernel defaults to wsize
> = rsize = 1kB and the NFSv3 client in v2.4.0test9 defaults to
> wsize = rsize = 4kB?! Every (?) other implementation of NFSv3 defaults
> to 32kB... At least when mounting Solaris NFSv3 server --> Linux NFSv3
> client, 32kB rsize & wsize works perfectly fine (at least for
> v2.2.18pre15, but I hope that v2.4.0test9 isn't worse in this regard.)

The conservatism is similar to that for IDE tuning: many folks have
broken hardware/drivers/networks, and sizes above 1K result in
fragmentation/potential packet loss/RPC timeouts/write errors/corruption.
So for 2.2.x, Alan has decreed 1K size.  2.4.0-test is a bit more aggressive.

32K is fine, if you are using TCP.  But I just went through a day-long
session with NetApp after they updated their default UDP size from 8K
to 32K.  32K UDP == 23 fragments.  On a switched network that may be
fine, but with a hub and router, it is seeming death.  Our Solaris
clients were generating numerous RPC timeouts on writes.  After setting
the NetApp F720 server default back to 8K, the timeouts went away.

You may want to take this over to [EMAIL PROTECTED]; also, tcpdumps
are helpful.

Regards,

   Bill Rugolsky
   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: v2.4.0test9 NFSv3 server woes Linux--Solaris

2000-10-05 Thread Bill Rugolsky Jr.

On Thu, Oct 05, 2000 at 04:58:39PM +0200, David Weinehall wrote:
 Using the NFSv3 server in the v2.4.0test9 kernel (I haven't tested any
 earlier v2.3.xx or v2.4.0testx kernels) I'm having problems with
 (for instance) compile glib.
 
 The setups I've tried are:
 
 wsize = rsize = 1kB
 Linux NFSv3 server -- Linux NFSv3 client (UDP mounted) -- WORKS
 
 wsize = rsize = 32kB
 Linux NFSv3 server -- Solaris NFSv3 client (UDP mounted) -- BROKEN!
 Linux NFSv3 server -- Solaris NFSv3 client (TCP mounted) -- BROKEN!
 
 wsize = rsize = 2kB
 Linux NFSv3 server -- Solaris NFSv3 client (UDP mounted) -- BROKEN!
 Linux NFSv3 server -- Solaris NFSv3 client (TCP mounted) -- BROKEN!

What do you mean by "BROKEN" ?  Anything in syslog?  tcpdumps?
 
Why not test wsize=rsize=1K for the Linux/Solaris combo?
Also, I was unaware that TCP server was supposed to work in 2.4.0-test9.
(It isn't in the 2.2.x patches.)  Are you sure that Solaris is not falling
back to UDP mounts?
 
 Oh, by the way, is there ANY sane reason whatsoever behind the decision
 that the Linux NFSv3 client in the v2.2.18pre15 kernel defaults to wsize
 = rsize = 1kB and the NFSv3 client in v2.4.0test9 defaults to
 wsize = rsize = 4kB?! Every (?) other implementation of NFSv3 defaults
 to 32kB... At least when mounting Solaris NFSv3 server -- Linux NFSv3
 client, 32kB rsize  wsize works perfectly fine (at least for
 v2.2.18pre15, but I hope that v2.4.0test9 isn't worse in this regard.)

The conservatism is similar to that for IDE tuning: many folks have
broken hardware/drivers/networks, and sizes above 1K result in
fragmentation/potential packet loss/RPC timeouts/write errors/corruption.
So for 2.2.x, Alan has decreed 1K size.  2.4.0-test is a bit more aggressive.

32K is fine, if you are using TCP.  But I just went through a day-long
session with NetApp after they updated their default UDP size from 8K
to 32K.  32K UDP == 23 fragments.  On a switched network that may be
fine, but with a hub and router, it is seeming death.  Our Solaris
clients were generating numerous RPC timeouts on writes.  After setting
the NetApp F720 server default back to 8K, the timeouts went away.

You may want to take this over to [EMAIL PROTECTED]; also, tcpdumps
are helpful.

Regards,

   Bill Rugolsky
   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux 2.2.18pre1

2000-09-01 Thread Bill Rugolsky Jr.

On Fri, Sep 01, 2000 at 12:05:03PM +0100, Alan Cox wrote:
> People would appreciate lots of things but stability happens to come first.
> Thats why its primarily focussed on driver stuff not on revamping the 
> internals. Right now Im not happy with the nfsv3 stuff I last looked at and
> it seems to still contain things Linus rejected a while back.
 
Alan, would you please describe in a few words what items are
problematic?  Are the changes simply too extensive for your comfort?
Are there still areas of compatibility in userland that you want ironed out?

If the problem is more specific than that, is it Trond's client code:
SunRPC rewrite?  caching?  Credentials?

This is not a *me too!* request that you put it in; those of us with
environments heavily dependent on NFS have been patching for so long,
it hardly matters any more, especially now that the patches have been
consolidated by Trond and Dave Higgens, and 2.4 is shaping up.  I'm just
curious as to what the perceived problems are, since you inevitably see
lots of reports of breakage that never find their way onto these lists.

Lately I just point everybody who asks me about NFS breakage to H.J. Lu's
rpms, and they go away happy.

Regards,

   Bill Rugolsky
   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux 2.2.18pre1

2000-09-01 Thread Bill Rugolsky Jr.

On Fri, Sep 01, 2000 at 12:05:03PM +0100, Alan Cox wrote:
 People would appreciate lots of things but stability happens to come first.
 Thats why its primarily focussed on driver stuff not on revamping the 
 internals. Right now Im not happy with the nfsv3 stuff I last looked at and
 it seems to still contain things Linus rejected a while back.
 
Alan, would you please describe in a few words what items are
problematic?  Are the changes simply too extensive for your comfort?
Are there still areas of compatibility in userland that you want ironed out?

If the problem is more specific than that, is it Trond's client code:
SunRPC rewrite?  caching?  Credentials?

This is not a *me too!* request that you put it in; those of us with
environments heavily dependent on NFS have been patching for so long,
it hardly matters any more, especially now that the patches have been
consolidated by Trond and Dave Higgens, and 2.4 is shaping up.  I'm just
curious as to what the perceived problems are, since you inevitably see
lots of reports of breakage that never find their way onto these lists.

Lately I just point everybody who asks me about NFS breakage to H.J. Lu's
rpms, and they go away happy.

Regards,

   Bill Rugolsky
   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/