date:20050128

Re: Patch 4/6 randomize the stack pointer

2005-01-28 Thread John Richard Moser

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alright, I'll bite.

Someone told me to bring this up after reading all the complaints about
breakage, so again we get back to PaX.  I'm more interested in "this
patch is bad" than "PaX is better" for this argument, but whatever.


Compatibility has been repetedly mentioned here.  Breaking things has
been mentioned.  Things inside the distro won't break becaues the distro
maintainers mark them; third party vendors should mark them too.  If
they don't, they STILL won't break, if the distribution is crafted to do
really ugly things I hate to enter ultimate-global-super-compatibility-mode.

Last year I started on my master's thesis for computer security.
Granted, next semester i'm *hoping* to get my AS in computer science,
but I wanted to start writing early.  So out of the 18 pages, I'll pull
one little bit from section 4 (Deployment) subsection 2 (Executable
Space Protections).  This'll be a big read.

- ---CUT---
2.  Executable Space Protections

Executable Space Protections can be deployed on many architectures using
PaX.  A number of methods of deployment could be used, each ranging its
own ratio of security vs. compatibility.  The recommended course of
action is to allow the administrator to control how protections are
applied, either by setting an automatic default method or by being asked
where protections should be applied on a case by case basis.

Any binary which may function under full restrictions should be set to
function under full restrictions automatically, without asking.  There
may be an option to ask the administrator in every case including those
where the greatest security is used by default; but in most cases, the
administrator will not want to be bothered unless a security concern is
raised.

There are three states for restrictions.  In the Default state, the
restriction is not explicitly enabled or disabled; PaX decides whether
to use the restriction based on the Softmode setting.  If the system is
in Softmode, PaX does not enable restrictions in the Default state; if
the system is not in Softmode, PaX enables restrictions in the Default
state.  Contrastingly, restrictions in the Enabled state are enabled
under PaX regardless of Softmode, while restrictions in the Disabled
state are disabled under PaX regardless of Softmode.

Here, the term "compatibility" is used to indicate how much software
doesn't work.  A system with low compatibility will have software that
does not run due to security restrictions; while a system with high
compatibility will run most if not all software, including third party
software.

There are four basic methods of PaX flag control, each detailed briefly
below.  As stated above, the administrator should choose which method to
employ.

A.  Manual Control

  Manual Control is not recommended as a default.  Under Manual Control,
  all restrictions remain in the Default state on all binaries at
  installation time.  This imposes the most added administrative duty
  and the least compatibility.

B.  Selective Disable

  Selective Disable is the most basic form of control, allowing the
  implementation to ship with everything working.  Under Selective
  Disable, binaries known to break due to PaX restrictions have those
  restrictions set to the Disabled state when installed, leaving the
  rest in the Default state.  This relieves most administrative duty and
  increases compatibility, although third party binaries may not come
  marked.

C.  Inheritive Selective Disable

  Inheritive Selective Disable is similar to Selective Disable, except
  that libraries are also marked and tabs are kept on these.  When
  software is installed which uses a library, the Disabled features of
  the executable and each library are masked together to come up with
  the final mask to apply to the executable.  These masks can later be
  generated for third party programs with an administrative tool in
  order to enhance compatibility further; although third party programs
  and libraries requiring other markings in themselves not also needed
  by other libraries will still break.

D.  Selective Enable

  Selective Enable is the only method leveraging Softmode to enhance
  compatibility.  It is also the only method which will leave third
  party binaries completely exposed with no reason aside from that they
  are not explicitly packaged with a set of listed restrictions.  Under
  Selective Enable, executable binaries have all restrictions except
  those known to break them set to Enabled, leaving the rest in the
  Default state.  Third party binaries which come with no markings will
  have no restrictions in Softmode, and so full compatibility is reached
  with the maximum justifiable trade-off in the range of executables
  protected by PaX.

The above methods become progressively more compatible, but at the same
time less secure.  Both the standard and Inheritive variations of the
Selective Disable method are about on par in principle;

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread linux

> It adds support for advanced networking-related randomization, in
> concrete it adds support for TCP ISNs randomization

Er... did you read the existing Linux TCP ISN generation code?
Which is quite thoroughly randomized already?

I'm not sure how the OpenBSD code is better in any way.  (Notice that it
uses the same "half_md4_transform" as Linux; you just added another copy.)
Is there a design note on how the design was chosen?


I don't wish to be *too* discouraging to someone who's *trying* to help,
but could you *please* check a little more carefully in future to
make sire it's actually an improvement?

I fear there's some ignorance of what the TCP ISN does, why it's chosen
the way it is, and what the current Linux algorithm is designed to do.
So here's a summary of what's going on.  But even as a summary, it's
pretty long...


First, a little background on the selection of the TCP ISN...

TCP is designed to work in an environment where packets are delayed.
If a packet is delayed enough, TCP will retransmit it.  If one of
the copies floats around the Internet for long enough and then arrives
long after it is expected, this is a "delayed duplicate".

TCP connections are between (host, port, host port) quadruples, and
packets that don't match some "current connection" in all four fields
will have no effect on the current connection.  This is why systems try
to avoid re-using source port numbers when making connections to
well-known destination ports.

However, sometimes the source port number is explicitly specified and
must be reused.  The problem then arises, how do we avoid having any
possible delayed packets from the previous use of this address pair show
up during the current connection and confuse the heck out of things by
acknowledging data that was never received, or shutting down a connection
that's supposed to stay open, or something like that?

First of all, protocols assume a maximum packet lifetime in the Internet.
The "Maximum Segment Lifetime" was originally specified as 120 seconds,
but many implementations optimize this to 60 or 30 seconds.  The longest
time that a response can be delayed is 2*MSL - one delay for the packet
eliciting the response, and another for the response.

In truth, there are few really-hard guarantees on how long a packet can
be delayed.  IP does have a TTL field, and a requirement that a packet's
TTL field be decremented for each hop between routers *or each second of
delay within a router*, but that latter portion isn't widely implemented.
Still, it is an identified design goal, and is pretty reliable in
practice.


The solution is twofold: First, refuse to accept packets whose
acks aren't in the current transmission window.  That is, if the
last ack I got was for byte 1000, and I have sent 1100 bytes
(numbers 0 through 1099), then if the incoming packet's ack isn't
somewhere between 1000 and 1100, it's not relevant.  If it's
950, it might be an old ack from the current connection (which
doesn't include anything interesting), but in any case it can be
safely ignored, and should be.

The only remaining issue is, how to choose the first sequence number
to use in a connection, the Initial Sequence Number (ISN)?

If you start every connection at zero, then you have the risk that
packets from an old connection between the same endpoints will
show up at a bad time, with in-range sequence numbers, and confuse
the current connection.

So what you do is, start at a sequence number higher than the
last one used in the old connection.  Then there can't be any
confusion.  But this requires remembering the last sequence number
used on every connection ever.  And there are at least 2^48 addresses
allowed to connect to each port on the local machine.  At 4 bytes
per sequence number, that's a Petabyte of storage...

Well, first of all, after 2*MSL, you can forget about it and use
whatever sequence number you like, because you know that there won't
be any old packets floating around to crash the party.

But still, it can be quite a burden on a busy web server.  And you might
crash and lose all your notes.  Do you want to have to wait 2*MSL before
rebooting?


So the TCP designers (I'm not on page 27 of RFC 793, if you want to follow
along) specified a time of day based ISN.  If you use a clock to generate
an ISN which counts up faster than your network connection can send
data (and thus crank up its sequence numbers), you can be sure that your
ISN is always higher than the last one used by an old connection without
having to remember it explicitly.

RFC 793 specifies a 250,000 bytes/second counting rate.  Most
implementations since Ethernet used a 1,000,000 byte/second counting
rate, which matches the capabilities of 10base5 and 10base2 quite well,
and is easy to get from the gettimeofday() call.

Note that there are two risks with this.  First, if the connection actually
manages to go faster than the ISN clock, the next connection's ISN will
be in the middle of the space the

help me to know when ethernet header added to packet by eth_header function

2005-01-28 Thread linux lover


Hello,
  Can anybody explain me how ethernet header is
added to every packet outgoing? I check eth.c file and
found eth_header that is used for adding ethernet
header on skbuff packet. But does each packet calls
this function? I think not as theres a cache header
function used that cache ethernet header entry.
  So my main question is that when my machine
first contacted to any other pc in LAN does it calls
eth_header and when require to send any type of packet
to same machine i thnik eth_cache_header is used is
that right???
  Then can it be possible that if my machine not
contacted to any cached entry machine then eth_header
will be called again to build eth header for that
machine?
  In an all when functions in eth.c will be
called/not called
eth_header,eth_header_cache,eth_header_parse,eth_header_cache_update???
  Please kindly help me to identify it.
Thanks in advance.
regards,
linux_lover.




__ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace vs. kernelspace address

2005-01-28 Thread Om

On Fri, Jan 28, 2005 at 01:40:51PM -0800, Rock Gordon wrote:
> Hi everbody,
> 
> Thanks for your replies.
> 
> However I think my copy_to_user and copy_from_user are
> failing since the kernel-mode thread is copying data
> into another process's address space, and I am not
> sure how to do this. Do the get_fs() and set_fs()
> combinations let you do that? If not, then how do I do
My idea is on kernel thread is limited. But I think it is not possible to
any userspace address from any kernel thread because they do not have access
to it. Their proc_struct->mm field is empty.
I am not sure whether set_fs and get_fs help in this case.

HTH,
Om
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-01-28 Thread Jack O'Quin

Peter Williams <[EMAIL PROTECTED]> writes:
>
> If the average usage rate is estimated over longer periods it will be
> lower allowing lower limits to be used.  Also if the task's own usage
> rate estimates are used to test the limits then the limit can be lower.
>
> If the default limits can be made sufficiently small then the
> temptation to use this feature by "ordinary" applications will
> disappear.
>
> I'm not an expert but I imagine that the CPU usage rates of most RT
> tasks taken over reasonably long time intervals is quite low and
> therefore the default limits could also be quite low without adversely
> effecting the programs that this mechanism is meant to help.

True for some, but definitely not for all.  

When a system was purchased specifically to do some realtime job, it
often makes sense to dedicate large chunks of the main processor to
realtime number crunching.  Mass-produced general-purpose processors
have excellent price/performance ratios.  There's no good reason not
to take advantage of that.

People commonly run heavy Fast Fourier Transform or reverb
calculations in realtime threads.  They may use up as much of the CPU
as the user/owner is willing to allocate.  With soft realtime, its
hard to push this reliably beyond about 70-80%.  But, those numbers
are definitely practical.
-- 
  joq
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread Andi Kleen

Stephen Hemminger <[EMAIL PROTECTED]> writes:

> On Fri, 28 Jan 2005 12:45:17 -0800
> "David S. Miller" <[EMAIL PROTECTED]> wrote:
>
>> On Fri, 28 Jan 2005 21:34:52 +0100
>> Lorenzo Hernández García-Hierro <[EMAIL PROTECTED]> wrote:
>> 
>> > Attached the new patch following Arjan's recommendations.
>> 
>> No SMP protection on the SBOX, better look into that.
>> The locking you'll likely need to add will make this
>> routine serialize many networking operations which is
>> one thing we've been trying to avoid.
>> 
>
> per-cpu would be the way to go here.

I don't think so no - just doing per cpu counters you
risk nearby duplicates, which can cause even easier data corruption 
(e.g. during ip fragment reassembly - it is already very weak
and making it weaker is probably not a good idea) 

If you want SMP performance for ipids you can resurrect
the old "cookie jar" approach I used in 2.4 time frame to get
rid of inetpeers. The idea was that you have global state,
and each CPU would regenerate some numbers from the state,
then store them in a private "jar" and use them use, then
look at the global state with locking again etc.

This can be tuned on how big the jar is - the bigger the
smaller the sequence space (risky for 16bit ipids), but
the better the scalability.

But before doing anything like this I would recommend
that someone skilled in cryptography evaluates the security
of these functions carefully and see if it actually has any 
advantages. I remember that Andrey S. broke
some of the "cool" "secure" openbsd IDs easily some years ago.

At least for ipids I'm utterly sceptical. 16bits are just
hopeless.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: compat ioctl for submiting URB

2005-01-28 Thread Andi Kleen

Christopher Li <[EMAIL PROTECTED]> writes:

> VMware is a big user of the usbdevfs, we translate guest USB
> IO to usbdevfs, by submitting URB. On the x86_64 system, we
> need those compatible ioctl for submitting URBs. For now we
> make a hack to submit it through the vmmon driver. But that
> is very ugly. 
>
> I do want this problem get fixed in the linux kernel eventually.
> I have been toying with two different ways to solve it. It seems
> that it is unavoidable to get hands dirty in the usbdevfs internals.
> The first one is just educate the usbdevfs to know about the 32 bit
> URB ioctls. So it don't need to keep around a bounce buffer.

Looks reasonable from a first look.

Issues:
- Should use CONFIG_COMPAT, not x86-64 specific symbols
- Why can't you set URB_COMPAT transparently in the emulation
layer?  Then existing applications would hopefully work without
changes, right?

You may also want to preserve the __user casts, otherwise
Al Viro and other sparse users will be unhappy.

Thanks for attacking this long standing problem.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch 4/6 randomize the stack pointer

2005-01-28 Thread John Richard Moser

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Rik van Riel wrote:
> On Thu, 27 Jan 2005, John Richard Moser wrote:
> 
>> Arjan van de Ven wrote:
> 
> 
 Is this one any worse?
>>>
>>> yes.
>>>
>>> oracle, db2 and similar like to mmap 2Gb or more *in one chunk*.
>>
>>
>> Special case?
> 
> 
> Absolutely, but ...
> 
>> Can I get this put into perspective?  How much more important is "Good"
>> randomization versus "not breaking Oracle," which becomes "No
>> randomization"
> 
> 
> 1) quite a lot of Linux users do use Oracle, DB2 or do
>scientific calculations - distributions cannot afford
>to break those applications, the default has to work
>for everybody
> 

So package oracle marked to not use the randomization.

> 2) "weaker" randomization (2MB) is still effective if the
>stack is non-executable, so the "load a bunch of NOPs"
>approach won't work - this is what Fedora and RHEL use
> 

"In some cases, this does nothing, so we'll leverage those cases as an
argument for why this should go in, even though we're effectively saying
'please add useless junk to the kernel'"

No dear, please, real ASLR has a point, try not to castrate it.

> 3) it is not as theoretically strong as what you propose,
>but having the "weaker" scheme enabled is definitely
>more secure than having the "stronger" scheme disabled
>because it breaks applications
> 

*takes the glass pipe away*

Well, I'm going to give random constructive criticism on red hat as a
whole now, so try learning something from it instead of taking it as
flamebait.  I just ate and feel particularly like talking for no reason
about half-relavent topics.

I actually just tried to paxtest a fresh Fedora Core 3, unadultered,
that I installed, and it FAILED every test.  After a while, spender
reminded me about PT_GNU_STACK.  It failed everything but the Executable
Stack test after execstack -c *.  The randomization tests gave
13(heap-etexec), 16(heap-etdyn), 17(stack), and none for main exec
(etexec,et_dyn) or shared library randomization.

Also, before you say it, I read, comprehended, and anylized the source.
 This was PaXtest 0.9.6, and I did specific traces (after changing
body.c to prevent it from forking) to look for mprotect() and mmap()
calls and find out what they do (I saw probably glibc getting mmap()ed
in, there wasn't anything in the source doing the mmap() calls I saw).
There were no dirty tricks to mprotect() a high area of memory, which is
something Ingo called foul on in 0.9.5.

My point isn't that ES failed (the above discourse was to preempt Ingo
calling a technical foul on paxtest again); but that I forgot about
PT_GNU_STACK.  How many vendors are going to forget about PT_GNU_STACK
and its automatic markings and think they're protected?  Do they even
know/care?  "it works so we'll just keep doing what we're doing, if we
break the protection it'll adjust to let us" is pretty good strategy to
a lot of people who don't want to be assed with your security crap.

Another concern of mine, execstack gives X for PT_GNU_STACK and - for
cleared PT_GNU_STACK.  With many binaries i get shipped (flash and java
plug-ins), there's a ? when I check them, so I clear the flag and they
work.  Note that I'm referring to the Java PLUG-IN, not the JRE itself;
you can have full PaX restrictions on Firefox and have working Java in
Firefox, because java_vm is a separate process :) (you have to chpax
java itself).  Firefox happens to be a high-risk application too IMHO
(it's pointed at the net and exposes Gecko bugs for HTML and Javascript
parsing, libjpeg and libpng bugs, and God knows what else), and I don't
want it accidentally getting an executable stack.

Finally, although an NX stack is nice, you should probably take into
account IBM's stack smash protector, ProPolice.  Any attack that can
evade SSP reliably can evade an NX stack; but ProPolice protects from
other overflows.  Now I'm sure RH is over there inventing something that
detects buffer overflows at compile time and misses or warns about the
ones it can't identify:

if (strlen(a) > 4)
  a[5] = '\0';
foo(a);

void foo(char *a) {
   char b[5];
   strcpy(b,a);
}

This code is safe, but you can't tell from looking at foo().  You don't
get a look at every other object being compiled against this one that
may call foo() either.  So compile time buffer overflow detection is a
best-effort at best.

ProPolice protects local variables with 0 overhead; passed arguments
with a few instructions; and the return pointer and stack frame pointer
with a couple instructions.  At runtime.  Want to impress me?  Actually
deploy ProPolice instead of showing up 3 years from now waving around
your own patch that you wrote that half-impliments half of it.  If you
want "something better," it's GPL, so grab it and start hacking.

Anyway, that's my far-far-far offtopic rant for the day.
- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly

Re: compat ioctl for submiting URB

2005-01-28 Thread Andi Kleen

Christopher Li <[EMAIL PROTECTED]> writes:

> This patch is for the case that running 32 bit application on
> a 64 bit kernel. So far only x86_64 allow you to do that.
>
> I am not aware of other 64bit architecture need the 32bit
> emulation.

A lot of them do. Just use CONFIG_COMPAT instead.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Discuss][i386] Platform SMIs and their interferance with tsc based delay calibration

2005-01-28 Thread Andi Kleen

Venkatesh Pallipadi <[EMAIL PROTECTED]> writes:
> +
> + /*
> +  * If the upper limit and lower limit of the tsc_rate is more than
> +  * 12.5% apart.
> +  */
> + if (pre_start == 0 || pre_end == 0 ||
> + (tsc_rate_max - tsc_rate_min) > (tsc_rate_max >> 3)) {
> + printk(KERN_WARNING "TSC calibration may not be precise. " 
> +"Too many SMIs? "
> +"Consider running with \"lpj=\" boot option\n");
> + return 0;
> + }

I think it would be better to rerun it a few times automatically
before giving up. This way it would hopefully work transparently but slower
for most users. The message is too obscure too to be usable and needs
more explanation.

And also in case the platforms in questions support EM64T 
x86-64 would need to be changed too :)

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Compactflash (Sandisk 512) hangs on access

2005-01-28 Thread Prashant Viswanathan

> > I have been trying unsuccessfully over the last 2 weeks to get
> > compactflash working on my Linux system based on mini-ITX (Via CL
> > motherboard, pentium compatible).
> >
> > I use a CF->IDE adapter to access it just like a IDE hard disk. My
> > compactflash is Sandisk SDCFH-512. Linux can detect it. I can even
> > mount it and do a fdisk on it. However, the moment I try to do
> > anything substantial like copy multiple files or copy 1000 blocks
> > using dd, I lose access to it. Linux loses access to it totally. I
> > can't even do a fdisk on it. I get an error like "Unable to open
> > /dev/hdc".


On Thu, 27 Jan 2005 22:07:35 +0100, Willy Tarreau <[EMAIL PROTECTED]> wrote:
> Have you checked that the power connector really provides 5V to the
> IDE-CF adapter ? I had the exact same behaviour 5 years ago with a power
> wire cut. Signal lines were powerful enough to bring power to the cheap
> flash (16 MB), I could even read it, most times. The kernel almost always
> booted from it, and when it turned to mount the ext2 fs R/W, it hanged. I
> finally partially destroyed it this way, and it got several defects which
> could not be cleaned with a simple write or format.
> 
> Other than that, I have lots of CF cards on IDE adapters (some on motherboard,
> some hand-made, some bought to serious makers), and never ran into such
> problems since.
> 
> Willy

The power connector is fine.

I also disabled DMA (some suggestion on this newsgroup to a similar
error) and now I can't turn it back on.


everest root # hdparm -d1 /dev/hdc

/dev/hdc:
setting using_dma to 1 (on)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch 4/6 randomize the stack pointer

2005-01-28 Thread John Richard Moser

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ingo Molnar wrote:
> * Paulo Marques <[EMAIL PROTECTED]> wrote:
> 
> 
>>I really shouldn't feed the trolls, but this must be the most silly
>>piece of code I saw on this mailing list in a very long time (and
>>there have been some good examples over time).
> 
> 
> yeah.
> 
> 
>>The stack randomization doesn't prevent some sort of attacks (like
>>return into libc, etc.) and given a small randomization it might be
>>possible to write an exploit with a long sequence of NOP's and a
>>return address somewhere in there (the attacker wouldn't know exactly
>>where, but it wouldn't matter anyway). If we are able to write 'N'
>>NOP's then we get a 'N'/64k chance that the exploit works.
> 
> 
> yeah. NOP techniques can always be used to 'chop off bits' from any
> randomization, in case the address of the payload is not known. Both
> instruction NOPs for shellcode and 'parameter NOPs' ("././././" strings
> or "/bin/sh\0/bin/sh\0" strings) can be used.
> 
> but there is no fundamental theoretical difference between a 256 MB
> randomization (as PaX uses) and a 2 MB randomization (Fedora) in terms
> of characteristics: what is brute-force in one is brute-force in the
> other as well, with a factor of overhead difference of 128. (which makes
> the attack 128 times longer, but has no real difference to security.)
> 

You said:

  yeah. NOP techniques can always be used to 'chop off bits' from any
  randomization, in case the address of the payload is not known. Both
  instruction NOPs for shellcode and 'parameter NOPs' ("././././"
  strings or "/bin/sh\0/bin/sh\0" strings) can be used.

Bear with me here, I'm out of things I've studied and researched, so now
we're going to go into "junk coming out of my head."  It's either going
to be very painful, or very funny, or both at the same time.  No, I
don't care that I'm about to look like an ass.

You're starting with 64K of randomization, and moving to 2M later.  The
stack is how big?  4-8M?  I don't know, I'm guessing; I saw earlier some
code that said that the stack was defined as having at least 8M in some
header, which "should be enough for most people" so I assume it's almost
if not over 2M.

Cut off however much data you know is going to be pushed already (which
is what we've been calling 'the size of the stack'), compare that with
the randomization, if it's bigger than the randomization period, you
have chopped off all randomization.  If not, you've probably got better
than a 50-50.

Because the size of a 'bit' grows as your entropy grows, chopping 2 megs
off the randomization at 256M is significantly less than 1 bit (128M is
1 bit), while it's about 9 bits when considering 2 megs of randomization.

Short version:  I've got a better chance of finding an exploit that lets
me just knock-off a couple megs of randomization than I do of brute
forcing it.  I've got a WAY better chance of brute-forcing in one or two
tries if I can knock most of the randomization off.

> so the attempt of our beloved troll to paint 2 MB of randomization as
> 'weak' and 256 MB randomization as 'strong' is i believe misguided: both
> are 'weak' in most of the threat models! (and even for threat types
> where they might be considered 'strong' (e.g. whether a hole is suitable
> to feed a destructive worm), they'll both be considered 'strong'.)
> 

Let's look at GrSecurity's brute force deterrance real quick.  I know
you don't want to hear it, but maybe you should.

The basic idea, and it's an ugly one but you have to forgive people for
trying to do stupid shit like LET BROKEN CODE RUN SAFELY, is to detect a
segfault (jump into unmapped ram, probably miss due to ASLR) or PaX kill
(should also detect a SIGILL) and then flag the highest parent (who is
found via magic I won't get into here).

When flagged with this particular flag, all fork() calls are queued so
that one fork() occurs every 30 seconds.  This is annoying and ugly as
shit, but we're trying to do the unspeakable:  Make broken,
security-hole ridden code safe to run in a hostile environment.

Suddenly the 216 second cycle to brute force PaX' ASLR becomes something
like 3 weeks!  :)

This randomization, after accounting for knocking off all the bits we
can, may take two or three, maybe ten or twenty tries.  This is what,
300-600 oh hell TEN MINUTES.  Yes, you did better than 216 seconds.

When brad first tried to bash the concept of his brute force deterrance
through my head, I kept poking at the 30 second interval and the idea of
making about 200 connections BEFORE slamming the server.  The server
will wait about a minute or two before timing you out, so this is fine,
as it takes 3-4 seconds.  He eventually got it through my skull that you
can do the first 200 hits; but then every fork() afterwards is QUEUED,
not executed in batch every 30 seconds.

This makes a difference.  It means you get a little boost with huge
randomization, but not that much.  In your model, however, that "little

Re: compat ioctl for submiting URB

2005-01-28 Thread Al Viro

On Fri, Jan 28, 2005 at 08:33:05PM -0500, Christopher Li wrote:
> This patch is for the case that running 32 bit application on
> a 64 bit kernel. So far only x86_64 allow you to do that.
> 
> I am not aware of other 64bit architecture need the 32bit
> emulation.

Huh???
a) ppc64 runs ppc32 userland
b) sparc64 runs sparc32 userland (as the matter of fact, very
few userland programs are normally built 64bit there - no benefits in
doing that for most applications, it only bloats the memory footprint)
c) mips64 runs mips32 userland
d) itanic, IIRC, runs i386 userland
e) s390x runs s390 userland
f) parisc64 runs parisc32 userland

It's normal situation, not an exception.  The only pair I'm not sure about
is sh64/sh.  AFAICS, the only other supported 64bit platform without 32bit
emulation is alpha - and in that case there's no corresponding 32bit
processor to emulate.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: compat ioctl for submiting URB

2005-01-28 Thread Roland Dreier

Christopher> This patch is for the case that running 32 bit
Christopher> application on a 64 bit kernel. So far only x86_64
Christopher> allow you to do that.

Actually, at least ia64, mips, parisc, ppc64, s390 and sparc64 also
support 32-bit applications on a 64-bit kernel.  All of those
architectures except s390 can use USB.  I guess vmware doesn't run on
most of those architectures but any solution in the mainline kernel
should be generic enough to handle them all.

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: compat ioctl for submiting URB

2005-01-28 Thread Christopher Li

This patch is for the case that running 32 bit application on
a 64 bit kernel. So far only x86_64 allow you to do that.

I am not aware of other 64bit architecture need the 32bit
emulation.

Chris

On Sat, Jan 29, 2005 at 04:29:51AM +, Gianni Tedesco wrote:
> On Fri, 2005-01-28 at 16:23 -0500, Christopher Li wrote:
> > +#ifdef CONFIG_IA32_EMULATION
> > +
> > +   case USBDEVFS_SUBMITURB32:
> > +   snoop(>dev, "%s: SUBMITURB32\n", __FUNCTION__);
> > +   ret = proc_submiturb_compat(ps, p);
> > +   if (ret >= 0)
> > +   inode->i_mtime = CURRENT_TIME;
> > +   break;
> > +#endif
> 
> Why don't other 64bit architectures need this chunk?
> 
> -- 
> // Gianni Tedesco (gianni at scaramanga dot co dot uk)
> lynx --source www.scaramanga.co.uk/scaramanga.asc | gpg --import
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Discuss][i386] Platform SMIs and their interferance with tsc based delay calibration

2005-01-28 Thread Andrew Morton


Please don't send emails which contain 500-column lines?

Venkatesh Pallipadi <[EMAIL PROTECTED]> wrote:
>
> Current tsc based delay_calibration can result in significant errors in
> loops_per_jiffy count when the platform events like SMIs (System Management
> Interrupts that are non-maskable) are present.

This seems like an unsolveable problem.

>  Solution:
>  The patch below makes the calibration routine aware of asynchronous events
> like SMIs. We increase the delay calibration time and also identify any
> significant errors (greater than 12.5%) in the calibration and notify it
> to user. Like to know your comments on this.

I find calibrate_delay_tsc() quite confusing.  Are you sure that the
variable names are correct?

 +  tsc_rate_max = (post_end - pre_start) / DELAY_CALIBRATION_TICKS;
 +  tsc_rate_min = (pre_end - post_start) / DELAY_CALIBRATION_TICKS;

that looks strange.  I'm sure it all makes sense if one understands the
algorithm, but it shouldn't be this hard.  Please reissue the patch with
adequate comments which describe what the code is doing.


Shouldn't calibrate_delay_tsc() be __devinit?  (That may generate warnings
from reference_discarded.pl, but they're false positives)


>From a maintainability POV it's not good that x86 is no longer using the
generic calibrate_delay() code.  Can you rework the code so that all
architectures must implement arch_calibrate_delay(), then provide stubs for
all except x86?  After all, other architectures/platforms may have the same
problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] shared subtrees

2005-01-28 Thread raven

On Fri, 28 Jan 2005, Mike Waychison wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Al Viro wrote:
OK, here comes the first draft of proposed semantics for subtree
sharing.  What we want is being able to propagate events between
the parts of mount trees.  Below is a description of what I think
might be a workable semantics; it does *NOT* describe the data
structures I would consider final and there are considerable
areas where we still need to figure out the right behaviour.
Okay, I'm not convinced that shared subtrees as proposed will work well
with autofs.
OK. I've read the thread but haven't digested it so you'll have to put up 
with some stupid questions.

The idea discussed off-line was this:
When you install an autofs mountpoint, on say /home, a daemon is started
to service the requests.  As far as the admin is concerned, an fs is
mounted in the current namespace, call it namespaceA.  The daemon
actually runs in it's one private namespace: call it namespaceB.
namespaceB receives a new autofs filesystem: call it autofsB.  autofsB
is in it's own p-node.  namespaceA gets an autofsA on /home as well, and
autofsA is 'owned' by autofsB's p-node.
So:
autofsB -> autofsB
and
autofsB -> autofsA
Effectively, namespaceA has a private instance of autofsB in its tree.
The problem is this:
Assume /home/mikew is accessed in namespaceA.  The daemon running in
namespaceB gets the event, and mounts an nfs vfsmount on autofsB.  This
event is propagated back to autofsA.
Which condition (or action) in the definition implies
autofsB -> autofsA
(Problem 1: how do you block access to /home/mikew in namespaceA?)
Next, a CLONE_NS is done in namespaceA, creating namespaceA'.  the
homedir on /home/mikew is also copied.
Now, in namespaceA', what happens when a user umount's /home/mikew?  We
haven't yet determined how to handle umount event propagation, but it
appears likely that it will be *a hard thing to do*.
No I haven't spent enough time on the RFC buy into this one.
So I'll just say it looks like something is missing in this argument.
Perhaps the later is namespaceC?
Assuming the nfs umount succeeds, /home/mikew is accessed again in
namespaceA'.
namespaceC?
(Problem 2: The daemon in namespaceB will see the event, but it already
has something mounted on it's version of /home/mikew.  How does it
'send' a mountpoint to namespaceB.)
- ---
Shared subtrees may help in some adminstrative situations, but don't
look like the right solution for autofs.
Autofs will work with namespaces if the following functionality is added
to the kernel:  The ability to perform mount(2) operations on a
directory fd.
This has been discussed before and quickly vetoed, citing that it is a
security risk.  I still fail to understand how allowing a mount to
happen cross-namespace given a dirfd target is any worse than what is
already possible given a dirfd.  If you don't want someone to play with
your namespace, don't give them a dirfd.
Thoughts?
- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice
~~
NOTICE:  The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB+r1OdQs4kOxk3/MRAmSpAJ96ix25fjze6o7viCq2DCET9J/AlQCfYlC1
CoLKusJXjL+fYxgwggOCW+w=
=8bTv
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kernel oops on integrating a module with obj-y option

2005-01-28 Thread selvakumar nagendran

Hello everyone,

 I am using Fedora core 1. I am doing
my project in the linux kernel 2.4.28. In my project,
I am intercepting system calls. I am doing all these
things from a module. Now, I installed this module
with the main kernel and I found it working nice when
I used 'modprobe' to load it.
 Then I changed obj-m of my module to
obj-y and then I compiled my module object file with
the core kernel files like fs.o net.o kernel.o. So, my
target kernel binary code contains my module. Then I
booted my system. Now, the kernel oops sometimes and
sometimes it prompts for checking the disk and opens
the file system as a read only device.
  To integrate my module, I created a
new subdirectory under the kernel source directory
named 'rsched' and I icreated my own make file for
that. The makefile contains the following lines
  obj-y := rsched.o ( previously obj-m := rsched.o)
  include $(TOPDIR)/Rules.make

   then I changed the following lines in the top level
make file.
  SUBDIRS := fs net kernel rsched
  CORE_FILES := kernel/kernel.o fs/fs.o 
rsched/rsched.o

 How can I rectify this error so that I can
integrate my module with the main kernel image?

Thanks in advance and regards,
selva




__ 
Do you Yahoo!? 
Yahoo! Mail - Helps protect you from nasty viruses. 
http://promotions.yahoo.com/new_mail
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug in keyboard.c (2.6.10)

2005-01-28 Thread Al Viro

On Fri, Jan 28, 2005 at 11:59:37AM +0100, Vojtech Pavlik wrote:
> I'm very sorry about the locking, but the thing grew up in times of
> kernel 2.0, which didn't require any locking. There are a few possible

Incorrect.  You have blocking allocations in critical areas and they
required locking all way back.

> races with device registration/unregistration, and it's on my list to
> fix that, however under normal operation there shouldn't be any need for
> locks, as there are no complex structures built that'd become
> inconsistent. 

Um-hm...  Vojtech, meet USB mouse; USB mouse, meet Vojtech.  Now watch
a disconnect and reconnect happening when luser suddenly gets overexcited
and jerks the wrong hand a bit too hard while browsing the most profitable
sort of website...

> If you find scenarios which will lead to trouble in the event delivery
> system, please tell me, and I'll try to fix that as soon as possible.

See above.  Devices appearing and disappearing *are* normal.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Restrict procfs permissions

2005-01-28 Thread Al Viro

On Sat, Jan 29, 2005 at 03:45:42AM +0100, Rene Scharfe wrote:
> The patch is inspired by the /proc restriction parts of the GrSecurity
> patch.  The main difference is the ability to configure the restrictions
> dynamically.  You can change the umask setting by running
> 
># mount -o remount,umask=007 /proc
> 
> Testing has been *very* light so far -- it compiles and boots.  Patch is
> against 2.6.11-rc2-bk6.
> 
> Comments are very welcome.

It leaves already existing inodes with whatever mode they used to have.
_IF_ you want to do that sort of things, do it right - add ->permission()
that would apply that umask before checks and if you want it to be seen
in results of stat(2) - add ->gettattr() and do the same there.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] pci: Arch hook to determine config space size

2005-01-28 Thread Greg KH

On Fri, Jan 28, 2005 at 06:52:34PM +, Christoph Hellwig wrote:
> > +int __attribute__ ((weak)) pcibios_exp_cfg_space(struct pci_dev *dev) { 
> > return 1; }
> 
>  - prototypes belong to headers
>  - weak linkage is the perfect way for total obsfucation
> 
> please make this a regular arch hook

I agree.  Also, when sending PCI related patches, please cc the
linux-pci mailing list.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: compat ioctl for submiting URB

2005-01-28 Thread Gianni Tedesco

On Fri, 2005-01-28 at 16:23 -0500, Christopher Li wrote:
> +#ifdef CONFIG_IA32_EMULATION
> +
> +   case USBDEVFS_SUBMITURB32:
> +   snoop(>dev, "%s: SUBMITURB32\n", __FUNCTION__);
> +   ret = proc_submiturb_compat(ps, p);
> +   if (ret >= 0)
> +   inode->i_mtime = CURRENT_TIME;
> +   break;
> +#endif

Why don't other 64bit architectures need this chunk?

-- 
// Gianni Tedesco (gianni at scaramanga dot co dot uk)
lynx --source www.scaramanga.co.uk/scaramanga.asc | gpg --import

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Disabling IRQ #xx, because nobody cared!

2005-01-28 Thread William Park

I'm runing 2.6.10 SMP.  I usually use APM, but I decided to try ACPI.
On my machine, USB (integrated) and Audio (PCI card) shares IRQ:

   CPU0   CPU1   
  0:   19281733   19952671IO-APIC-edge  timer
  1:  51751  53105IO-APIC-edge  i8042
  4:16135591503569IO-APIC-edge  serial
  7:  0  0IO-APIC-edge  parport0
  8:  2  0IO-APIC-edge  rtc
  9:  0  0   IO-APIC-level  acpi
 11: 149496 150504   IO-APIC-level  uhci_hcd, uhci_hcd
 12:  54518  50376IO-APIC-edge  i8042
 14:  63398  63535IO-APIC-edge  ide0
 15:  1  1IO-APIC-edge  ide1
169:  11440  11565   IO-APIC-level  ide2
177: 456415 456480   IO-APIC-level  eth0
185:  50307  49693   IO-APIC-level  Ensoniq AudioPCI
NMI:  0  0 
LOC:   39235997   39236069 
ERR:  1
MIS:  0

After a while, I get

irq 185: nobody cared!
 [] __report_bad_irq+0x22/0x90
 [] note_interrupt+0x58/0x90
 [] __do_IRQ+0x128/0x130
 [] do_IRQ+0x1a/0x30
 [] common_interrupt+0x1a/0x20
 [] default_idle+0x0/0x40
 [] default_idle+0x2a/0x40
 [] cpu_idle+0x40/0x70
handlers:
[] (snd_audiopci_interrupt+0x0/0xc0 [snd_ens1371])
Disabling IRQ #185

Then, after some more time, I get

irq 11: nobody cared!
 [] __report_bad_irq+0x22/0x90
 [] note_interrupt+0x58/0x90
 [] __do_IRQ+0x128/0x130
 [] do_IRQ+0x1a/0x30
 [] common_interrupt+0x1a/0x20
 [] default_idle+0x0/0x40
 [] default_idle+0x2a/0x40
 [] cpu_idle+0x40/0x70
 [] start_kernel+0x147/0x170
handlers:
[] (usb_hcd_irq+0x0/0x60)
[] (usb_hcd_irq+0x0/0x60)

At which point, USB is dead.

Do you know if 'acpi' is responsible for this?

-- 
William Park <[EMAIL PROTECTED]>, Toronto, Canada
Slackware Linux -- because I can type.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

USB HID events and Microsoft wheel mouse

2005-01-28 Thread Jon Smirl

Something changed in the Linus BK kernel in the last few days so that
I get "drivers/usb/input/hid-input.c: event field not found" in dmesg
everytime I move my MS Wheel mouse. Any ideas on how to get rid of
this?

The events are EV_MISC:
type 4 code 4 value 65585
type 4 code 4 value 65584
type 4 code 4 value 589825

-- 
Jon Smirl
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

HID warning messages fills the logs

2005-01-28 Thread Marcel Holtmann

Hi,

when running 2.6.11-rc2-bk6 with my USB HID v1.00 Mouse [Microsoft
Microsoft Wheel Mouse Optical®] the logs get filled with this message:

kernel: drivers/usb/input/hid-input.c: event field not found
last message repeated 459 times
last message repeated 1157 times

Regards

Marcel


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Impossible to renice threaded NPTL programs on 2.6.10

2005-01-28 Thread Jan Knutar

For those times when a threaded program runs amok, and I still have
some hope that it will eventually stop being a pig, but would like to
actually use my computer in the meanwhile, the idea of renicing this
runaway program to nice 19 comes to mind.

Except, it doesn't actually work. Only the main thread seems to get
reniced, and the threads created with pthread_create seem to merrily
go on with their plundering of CPU cycles.

Test code at the end of the mail.

To reproduce this, I start the test program, and observe in top that it
is indeed consuming all CPU like it was intended to. Then I renice it
in top, to nice 19. Effect is, '% ni value in top still stays the same, and
these hog threads are still consuming nearly all CPU and not sharing
with other nice 19 processes, indicating that they were not reniced
to 19.

Tested on Fedora Core 2's  kernel 2.6.10-1.9 + procps 3.2.5 from sf.net
Tested on kernel 2.6.10-ck4 + procps 3.2.1

Here for the test case:

#include 
#include 
#include 
#include 
#define THREADS 10
void *hog(void*p);
int main(int argc, char** argv)
{
  pthread_t *threads;
  int i;
  threads = malloc(sizeof(pthread_t) * THREADS);
  for(i=0;ihttp://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

slab BUG in FC devel kernel, x86-64

2005-01-28 Thread Bill Nottingham

The kernel in question is based on 2.6.11-rc2-bk4, FWIW.

Transcribed by hand. Happened when rsyncing data onto
a LVM-on-RAID1, sata_via controller. (root FS is on generic
VIA IDE).

slab: double free detected in cache 'size-128', objp 81000340bba8.
Kernel BUG at slab:2188
invalid operand:  [1]
CPU 0
Modules linked in: md5 ipv6 parport_pc lp parport sunrpc ipt_REJECT ipt_state
 ip-contrack iptable_filter ip_tables dm_mod video button battery ac raid1
 ohci1394 ieee1394 uhci_hcd ehci_hcd i2c_viapro i2c_core snd_via82xx
 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc
 gameport snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore via_rhine
 mii floppy ext3 jbd sata_via libata sd_mod scsi_mod
Pid: 161, comm: kswapd0 Not tainted 2.6.10-1.1115_FC4
RIP: 0010:[] {free_block+208}
RSP: 0018:81003bde9cd8  EFLAGS: 00010092
RAX: 004a RBX: 81000340b000 RCX: 8042e010
RDX: 8042e010 RSI: 0001 RDI: 81003a82a7d0
RBP: 81003bfef640 R08: 8042e010 R09: 81001dafdd78
R10: 0001 R11: 8044bd20 R12: 81000340bba8
R13: 0013 R14: 81000340b028 R15: 0013
FS:  2aaba3a0() GS:8050d880() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 2aaac000 CR3: 0b36d000 CR4: 06e0
Process kswapd0 (pid: 161, threadinfo 81003bde8000, task 81003bd5d070)
Stack: 0001 81003bfe9698 00101a285a78 81003bfef640
   0010 81003bfe9688 81003bfe9698 
   0080 801690c1
Call Trace:{cache_flusharray+242} 
{kfree+156}
   {destroy_inode+41} {dispose_list+95}
   {shrink_icache_memory+993} 
{shrink_slab+188}
   {balance_pgdat+547} {kswapd+260}
   {autoremove_wake_function+0} 
{autoremove_wake_function+0}
   {autoremove_wake_function+0} 
{schedule_tail+11}
   {child_rip+8} {kswapd+0}
   {child_rip+0}

Code: 0f 0b 21 f1 36 80 ff ff ff ff 8c 08 0f b7 43 24 48 89 de 48
RIP {free_block+208} RSP 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2.4] ata_piix on ich6r in RAID mode

2005-01-28 Thread Martins Krikis


--- Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Martins Krikis wrote:
> > Without this patch, if the BIOS of an ICH6R box has IDE set to
> "RAID"
> > mode then ata_piix will not find any SATA disks because it
> incorrectly
> > tries the legacy mode. With the patch all 4 SATA drives become
> visible.
> > I don't think it would break any other vendor's SATA, but you can
> be
> > the judge of that. If so, perhaps we can restrict the test some
> more
> > by checking vendor/device IDs.
> 
> > --- linux-2.4.29/drivers/scsi/libata-core.c 2005-01-28
> 12:07:56.0 -0500
> > +++ linux-2.4.29-iswraid/drivers/scsi/libata-core.c 2005-01-28
> 12:14:43.0 -0500
> > @@ -3605,6 +3605,9 @@ int ata_pci_init_one (struct pci_dev *pd
> > legacy_mode = (1 << 3);
> > }
> >  
> > +   if ((pdev->class >> 8) == PCI_CLASS_STORAGE_RAID)
> > +   legacy_mode = 0;
> > +
> > /* FIXME... */
> > if ((!legacy_mode) && (n_ports > 1)) {
> > printk(KERN_ERR "ata: BUG: native mode, n_ports > 1\n");
> 
> 
> hmm.  Maybe "!= PCI_CLASS_STORAGE_IDE" instead?

Yes, that's much better. No need to even read the programming IF
byte unless the class code identifies it as an IDE controller.

> Overall, however, I am worried about your report of the driver's 
> behavior based on that BIOS's configuration.  The driver follows the
> PCI 
> IDE standard (previously SFF 8038i), where a register indicates
> whether 
> its in legacy or native mode.  As it see it, either
> a) the driver logic for reading that register is wrong, or
> b) BIOS incorrectly configuring the device, or
> c) that register is only applicable for PCI_CLASS_STORAGE_IDE
> devices.
> 
> Comments either way?

I'd say "c". I don't have the spec, but my PCI course-book
seems to imply so. I could send a new patch but I can't
verify it just yet---the board decided to stop booting...

  Martins




__ 
Do you Yahoo!? 
Yahoo! Mail - Easier than ever with enhanced search. Learn more.
http://info.mail.yahoo.com/mail_250
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Discuss][i386] Platform SMIs and their interferance with tsc based delay calibration

2005-01-28 Thread Venkatesh Pallipadi



Issue: 
Current tsc based delay_calibration can result in significant errors in 
loops_per_jiffy count when the platform events like SMIs (System Management 
Interrupts that are non-maskable) are present. This could lead to potential 
kernel panic(). This issue is becoming more visible with 2.6 kernel (as default 
HZ is 1000) and on platforms with higher SMI handling latencies. During the 
boot time, SMIs are mostly used by BIOS (for things like legacy keyboard 
emulation). 

Description:
The psuedocode for current delay calibration with tsc based delay looks like
(0) Estimate a value for loops_per_jiffy
(1) While (loops_per_jiffy estimate is accurate enough)
(2)   wait for jiffy transition (jiffy1)
(3)   Note down current tsc (tsc1)
(4)   loop until tsc becomes tsc1 + loops_per_jiffy
(5)   check whether jiffy changed since jiffy1 or not and refine 
loops_per_jiffy estimate

Consider the following cases
Case 1:
If SMIs happen between (2) and (3) above, we can end up with a loops_per_jiffy 
value that is too low. This results in shorted delays and kernel can panic () 
during boot (Mostly at IOAPIC timer initialization timer_irq_works() as we 
don't have enough timer interrupts in a specified interval).

Case 2:
If SMIs happen between (3) and (4) above, then we can end up with a 
loops_per_jiffy value that is too high. And with current i386 code, too high 
lpj value (greater than 17M) can result in a overflow in 
delay.c:__const_udelay() again resulting in shorter delay and panic().


Solution:
The patch below makes the calibration routine aware of asynchronous events like 
SMIs. We increase the delay calibration time and also identify any significant 
errors (greater than 12.5%) in the calibration and notify it to user. Like to 
know your comments on this.

Thanks,
Venki

Signed-off-by: Venkatesh Pallipadi <[EMAIL PROTECTED]>

--- linux-2.6.10/./arch/i386/kernel/timers/timer_tsc.c.org  2005-01-05 
16:06:52.0 -0800
+++ linux-2.6.10/./arch/i386/kernel/timers/timer_tsc.c  2005-01-19 
12:38:20.0 -0800
@@ -552,6 +552,7 @@ static struct timer_opts timer_tsc = {
.get_offset = get_offset_tsc,
.monotonic_clock = monotonic_clock_tsc,
.delay = delay_tsc,
+   .calibrate_delay = calibrate_delay_tsc,
 };
 
 struct init_timer_opts __initdata timer_tsc_init = {
--- linux-2.6.10/./arch/i386/kernel/timers/common.c.org 2005-01-11 
17:51:28.0 -0800
+++ linux-2.6.10/./arch/i386/kernel/timers/common.c 2005-01-19 
12:38:20.0 -0800
@@ -158,3 +158,49 @@ void __init init_cpu_khz(void)
}
}
 }
+
+unsigned long calibrate_delay_tsc(void)
+{
+   unsigned long pre_start, start, post_start;
+   unsigned long pre_end, end, post_end;
+   unsigned long start_jiffies;
+   unsigned long tsc_rate_min, tsc_rate_max;
+
+   if (!cpu_has_tsc)
+   return 0;
+
+#define DELAY_CALIBRATION_TICKS((HZ < 100) ? 1 : (HZ/100))
+   pre_start = 0;
+   rdtscl(start);
+   start_jiffies = jiffies;
+   while (jiffies <= (start_jiffies + 1)) {
+   pre_start = start;
+   rdtscl(start);
+   }
+   rdtscl(post_start);
+   pre_end = 0;
+   end = post_start;
+   while (jiffies <= (start_jiffies + 1 + DELAY_CALIBRATION_TICKS)) {
+   pre_end = end;
+   rdtscl(end);
+   }
+   rdtscl(post_end);
+
+   tsc_rate_max = (post_end - pre_start) / DELAY_CALIBRATION_TICKS;
+   tsc_rate_min = (pre_end - post_start) / DELAY_CALIBRATION_TICKS;
+
+   /*
+* If the upper limit and lower limit of the tsc_rate is more than
+* 12.5% apart.
+*/
+   if (pre_start == 0 || pre_end == 0 ||
+   (tsc_rate_max - tsc_rate_min) > (tsc_rate_max >> 3)) {
+   printk(KERN_WARNING "TSC calibration may not be precise. " 
+  "Too many SMIs? "
+  "Consider running with \"lpj=\" boot option\n");
+   return 0;
+   }
+
+   return tsc_rate_max;
+}
+
--- linux-2.6.10/./arch/i386/kernel/timers/timer_hpet.c.org 2005-01-11 
17:52:31.0 -0800
+++ linux-2.6.10/./arch/i386/kernel/timers/timer_hpet.c 2005-01-19 
12:38:20.0 -0800
@@ -183,6 +183,7 @@ static struct timer_opts timer_hpet = {
.get_offset =   get_offset_hpet,
.monotonic_clock =  monotonic_clock_hpet,
.delay =delay_hpet,
+   .calibrate_delay =  calibrate_delay_tsc,
 };
 
 struct init_timer_opts __initdata timer_hpet_init = {
--- linux-2.6.10/./arch/i386/kernel/timers/timer_pm.c.org   2005-01-11 
17:55:55.0 -0800
+++ linux-2.6.10/./arch/i386/kernel/timers/timer_pm.c   2005-01-19 
12:38:20.0 -0800
@@ -246,6 +246,7 @@ static struct timer_opts timer_pmtmr = {
.get_offset = get_offset_pmtmr,
.monotonic_clock= monotonic_clock_pmtmr,
.delay  = delay_pmtmr,
+

Re: Patch 4/6 randomize the stack pointer

2005-01-28 Thread Rik van Riel

On Thu, 27 Jan 2005, John Richard Moser wrote:
Arjan van de Ven wrote:

Is this one any worse?
yes.
oracle, db2 and similar like to mmap 2Gb or more *in one chunk*.
Special case?
Absolutely, but ...
Can I get this put into perspective?  How much more important is "Good"
randomization versus "not breaking Oracle," which becomes "No
randomization"
1) quite a lot of Linux users do use Oracle, DB2 or do
   scientific calculations - distributions cannot afford
   to break those applications, the default has to work
   for everybody
2) "weaker" randomization (2MB) is still effective if the
   stack is non-executable, so the "load a bunch of NOPs"
   approach won't work - this is what Fedora and RHEL use
3) it is not as theoretically strong as what you propose,
   but having the "weaker" scheme enabled is definitely
   more secure than having the "stronger" scheme disabled
   because it breaks applications
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Restrict procfs permissions

2005-01-28 Thread Rene Scharfe

Hi all,

this patch adds a umask option to the proc filesystem.  It can be used
to restrict the permission of users to view each others process
information.  E.g. on a multi-user shell server one could use a setting
of umask=077 to allow all users to view info about their own processes,
only.  It should prevent "command line snooping" and generally increases
privacy on the server.

Top and ps can cope with such restrictions, they simply are quiet about
files they cannot access.

The umask option affects permissions of the numerical directories in
/proc, only (the process info).  And root can see everything, of course,
even with a umask setting of 0777.  Default umask is 0, i.e. unchanged
permissions.

The patch is inspired by the /proc restriction parts of the GrSecurity
patch.  The main difference is the ability to configure the restrictions
dynamically.  You can change the umask setting by running

   # mount -o remount,umask=007 /proc

Testing has been *very* light so far -- it compiles and boots.  Patch is
against 2.6.11-rc2-bk6.

Comments are very welcome.

Thanks,
Rene


diff -rup linux-2.6.11-rc2-bk6/fs/proc/base.c l/fs/proc/base.c
--- linux-2.6.11-rc2-bk6/fs/proc/base.c 2005-01-28 23:42:44.0 +
+++ l/fs/proc/base.c2005-01-28 23:58:38.0 +
@@ -1222,7 +1222,7 @@ static struct dentry *proc_pident_lookup
goto out;
 
ei = PROC_I(inode);
-   inode->i_mode = p->mode;
+   inode->i_mode = p->mode & ~proc_umask;
/*
 * Yes, it does not scale. And it should not. Don't add
 * new entries into /proc// without very good reasons.
@@ -1537,7 +1537,7 @@ struct dentry *proc_pid_lookup(struct in
put_task_struct(task);
goto out;
}
-   inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
+   inode->i_mode = (S_IFDIR|S_IRUGO|S_IXUGO) & ~proc_umask;
inode->i_op = _tgid_base_inode_operations;
inode->i_fop = _tgid_base_operations;
inode->i_nlink = 3;
@@ -1592,7 +1592,7 @@ static struct dentry *proc_task_lookup(s
 
if (!inode)
goto out_drop_task;
-   inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
+   inode->i_mode = (S_IFDIR|S_IRUGO|S_IXUGO) & ~proc_umask;
inode->i_op = _tid_base_inode_operations;
inode->i_fop = _tid_base_operations;
inode->i_nlink = 3;
diff -rup linux-2.6.11-rc2-bk6/fs/proc/inode.c l/fs/proc/inode.c
--- linux-2.6.11-rc2-bk6/fs/proc/inode.c2005-01-28 23:42:44.0 
+
+++ l/fs/proc/inode.c   2005-01-28 23:56:11.0 +
@@ -22,6 +22,8 @@
 
 extern void free_proc_entry(struct proc_dir_entry *);
 
+umode_t proc_umask = 0;
+
 static inline struct proc_dir_entry * de_get(struct proc_dir_entry *de)
 {
if (de)
@@ -127,9 +129,14 @@ int __init proc_init_inodecache(void)
return 0;
 }
 
+static int parse_options(char *, uid_t *, gid_t *);
 static int proc_remount(struct super_block *sb, int *flags, char *data)
 {
+   uid_t dummy_uid;
+   gid_t dummy_gid;
+
*flags |= MS_NODIRATIME;
+   parse_options(data, _uid, _gid);
return 0;
 }
 
@@ -144,12 +151,13 @@ static struct super_operations proc_sops
 };
 
 enum {
-   Opt_uid, Opt_gid, Opt_err
+   Opt_uid, Opt_gid, Opt_umask, Opt_err
 };
 
 static match_table_t tokens = {
{Opt_uid, "uid=%u"},
{Opt_gid, "gid=%u"},
+   {Opt_umask, "umask=%o"},
{Opt_err, NULL}
 };
 
@@ -181,6 +189,11 @@ static int parse_options(char *options,u
return 0;
*gid = option;
break;
+   case Opt_umask:
+   if (match_octal(args, ))
+   return 0;
+   proc_umask = option;
+   break;
default:
return 0;
}
diff -rup linux-2.6.11-rc2-bk6/fs/proc/internal.h l/fs/proc/internal.h
--- linux-2.6.11-rc2-bk6/fs/proc/internal.h 2005-01-28 23:42:44.0 
+
+++ l/fs/proc/internal.h2005-01-28 23:58:29.0 +
@@ -16,6 +16,8 @@ struct vmalloc_info {
unsigned long   largest_chunk;
 };
 
+extern umode_t proc_umask;
+
 #ifdef CONFIG_MMU
 #define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
 extern void get_vmalloc_info(struct vmalloc_info *vmi);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] add driver matching priorities

2005-01-28 Thread Dmitry Torokhov

On Friday 28 January 2005 19:11, Al Viro wrote:
> On Fri, Jan 28, 2005 at 06:23:26PM -0500, Dmitry Torokhov wrote:
> > On Friday 28 JanuarDy 2005 17:30, Adam Belay wrote:
> > > Of course this patch is not going to be effective alone.  We also need
> > > to change the init order.  If a driver is registered early but isn't the
> > > best available, it will be bound to the device prematurely.  This would
> > > be a problem for carbus (yenta) bridges.
> > > 
> > > I think we may have to load all in kernel drivers first, and then begin
> > > matching them to hardware.  Do you agree?  If so, I'd be happy to make a
> > > patch for that too.
> > > 
> > 
> > I disagree. The driver core should automatically unbind generic driver
> > from a device when native driver gets loaded. I think the only change is
> > that we can no longer skip devices that are bound to a driver and match
> > them all over again when a new driver is loaded.  
> 
> And what happens if we've already got the object busy?
> 

Mark it as dead and release structures when holder lets it go. With hotplug
pretty much everywhere more and more systems can handle it. Plus one could
argue that if an object needs a special driver to function properly it will
unlikely be busy before native driver is loaded.

Also, one still can do what Adam offers by pre-loading native drivers in
cases whent is required but still support more flexible default scheme.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.10 USB devices generate descriptor read error?

2005-01-28 Thread Parag Warudkar

Known one - It's non fatal. All your devices should work fine.  If you 
want you can try loading usbcore.ko with module parameter 
old_scheme_first=y and see if it goes away.
Parag
Jeff Wiegley wrote:

Is anybody else having a similar problem as the
following...
My USB keydrives use to work fine in 2.6.9.
Since I upgraded to 2.6.10 now they just
generate a device descriptor read error.
Specifically:
/var/log/kern.log.0:Jan 26 18:18:18 mail kernel: usb 4-2.1:
device descriptor read/64, error -32
Also I noticed that a new Sigmatel based USB IRDA
device also produces similar messages...
/var/log/kern.log:Jan 27 12:31:19 mail kernel: usb 2-2: device
descriptor read/64, error -71
Is this a known problem or is it just me?
I noticed that the precompiled debian 2.6.10 kernel
works with at least the usb flash drive ok.  But my
compiled version produces the above.
But I don't think I changed any relevant kernel config
items from 2.6.9 to 2.6.10 and I've compiled lots of
USB enabled kernels before so I'd like to think I'm
not an idiot but maybe I missed a new option or
something.
Please help,
- Jeff
-
To unsubscribe from this list: send the line "unsubscribe 
linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.10 USB devices generate descriptor read error?

2005-01-28 Thread Jeff Wiegley

Is anybody else having a similar problem as the
following...
My USB keydrives use to work fine in 2.6.9.
Since I upgraded to 2.6.10 now they just
generate a device descriptor read error.
Specifically:
/var/log/kern.log.0:Jan 26 18:18:18 mail kernel: usb 4-2.1:
device descriptor read/64, error -32
Also I noticed that a new Sigmatel based USB IRDA
device also produces similar messages...
/var/log/kern.log:Jan 27 12:31:19 mail kernel: usb 2-2: device
descriptor read/64, error -71
Is this a known problem or is it just me?
I noticed that the precompiled debian 2.6.10 kernel
works with at least the usb flash drive ok.  But my
compiled version produces the above.
But I don't think I changed any relevant kernel config
items from 2.6.9 to 2.6.10 and I've compiled lots of
USB enabled kernels before so I'd like to think I'm
not an idiot but maybe I missed a new option or
something.
Please help,
- Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why does the kernel need a gig of VM?

2005-01-28 Thread Andy Isaacson

On Fri, Jan 28, 2005 at 03:06:15PM -0500, John Richard Moser wrote:
> Can someone give me a layout of what exactly is up there?  I got the
> basic idea
> 
> K 4G
> A 3G
> A 2G
> A 1G
> 
> App has 3G, kernel has 1G at the top of VM on x86 (dunno about x86_64).
> 
> So what's the layout of that top 1G?  What's it all used for?  Is there
> some obscene restriction of 1G of shared memory or something that gets
> mapped up there?

By default, the bottom 1G of physical memory is mapped into the 1G of
KVA.  (If you have less than 1G, it's all mapped.)  Thus, the TLB
remains valid across the user/kernel switch, which makes system calls
much faster.

The 4G/4G patches (google for the lwn.net overview) change this,
introducing a TLB flush on every syscall.  Better for some things
because you get more VA space, worse for most things because it's
slower.  (But it's "lots better for a few" versus "a little worse for
everybody", so the tradeoff is often worthwhile.) [1]

So the answer to your question is, "What's up there?  Memory.  All of it."
(Until you get to highmem.)

[1] The 4G/4G patch's *primary* goal is to increase the amount of KVA
available to allow more "struct page" entries without exhausting
lowmem.  Trying to manage 32GB or 64GB of physical memory with only
896MB of lowmem is very difficult.  It has the additional advantage
of allowing userland to mmap almost 4GB of stuff (as compared to
almost 3GB without 4G/4G) which can be a nice win for database-type
apps.

-andy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 2.4] fix an oops in ata_to_sense_error

2005-01-28 Thread Jeff Garzik

Martins Krikis wrote:
Jeff,
This fixes an occasional oops in the libata-scsi code.
will apply, thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 2.4] fix an oops in ata_to_sense_error

2005-01-28 Thread Jeff Garzik

Martins Krikis wrote:
Jeff,
This fixes an occasional oops in the libata-scsi code.
  Martins Krikis
--- linux-2.4.29/drivers/scsi/libata-scsi.c 2005-01-28 12:07:56.0 
-0500
+++ linux-2.4.29-iswraid/drivers/scsi/libata-scsi.c 2005-01-28 
12:14:43.0 -0500

BTW, don't forget your signed-off-by line when submitting emails...
http://linux.yyz.us/patch-format.html
Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 2.4] ata_piix on ich6r in RAID mode

2005-01-28 Thread Jeff Garzik

Martins Krikis wrote:
Without this patch, if the BIOS of an ICH6R box has IDE set to "RAID"
mode then ata_piix will not find any SATA disks because it incorrectly
tries the legacy mode. With the patch all 4 SATA drives become visible.
I don't think it would break any other vendor's SATA, but you can be
the judge of that. If so, perhaps we can restrict the test some more
by checking vendor/device IDs.

--- linux-2.4.29/drivers/scsi/libata-core.c	2005-01-28 12:07:56.0 -0500
+++ linux-2.4.29-iswraid/drivers/scsi/libata-core.c	2005-01-28 12:14:43.0 -0500
@@ -3605,6 +3605,9 @@ int ata_pci_init_one (struct pci_dev *pd
 			legacy_mode = (1 << 3);
 	}
 
+	if ((pdev->class >> 8) == PCI_CLASS_STORAGE_RAID)
+		legacy_mode = 0;
+
 	/* FIXME... */
 	if ((!legacy_mode) && (n_ports > 1)) {
 		printk(KERN_ERR "ata: BUG: native mode, n_ports > 1\n");

hmm.  Maybe "!= PCI_CLASS_STORAGE_IDE" instead?
Overall, however, I am worried about your report of the driver's 
behavior based on that BIOS's configuration.  The driver follows the PCI 
IDE standard (previously SFF 8038i), where a register indicates whether 
its in legacy or native mode.  As it see it, either
a) the driver logic for reading that register is wrong, or
b) BIOS incorrectly configuring the device, or
c) that register is only applicable for PCI_CLASS_STORAGE_IDE devices.

Comments either way?
Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Patch] invalidate range of pages after direct IO write

2005-01-28 Thread Zach Brown


After a direct IO write only invalidate the pages that the write intersected.
invalidate_inode_pages2_range(mapping, pgoff start, pgoff end) is added and
called from generic_file_direct_IO().  This doesn't break some subtle agreement
with some other part of the code, does it?

While we're in there, invalidate_inode_pages2() was calling
unmap_mapping_range() with the wrong convention in the single page case.  It
was providing the byte offset of the final page rather than the length of the
hole being unmapped.  This is also fixed.

This was lightly tested with a 10k op fsx run with O_DIRECT on a 16MB file in
ext3 on a junky old IDE drive.  Totaling vmstat columns of blocks read and
written during the runs shows that read traffic drops significantly.  The run
time seems to have gone down a little.

Two runs before the patch gave the following user/real/sys times and total
blocks in and out:

0m28.029s 0m20.093s 0m3.166s 16673 125107 
0m27.949s 0m20.068s 0m3.227s 18426 126094

and after the patch:

0m26.775s 0m19.996s 0m3.060s 3505 124982
0m26.856s 0m19.935s 0m3.052s 3505 125279

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
---

 include/linux/fs.h |2 ++
 mm/filemap.c   |5 -
 mm/truncate.c  |   52 ++--
 3 files changed, 44 insertions(+), 15 deletions(-)

Index: 2.6-mm-odirinv/include/linux/fs.h
===
--- 2.6-mm-odirinv.orig/include/linux/fs.h  2005-01-28 14:14:19.0 
-0800
+++ 2.6-mm-odirinv/include/linux/fs.h   2005-01-28 14:14:35.0 -0800
@@ -1369,6 +1369,8 @@
invalidate_inode_pages(inode->i_mapping);
 }
 extern int invalidate_inode_pages2(struct address_space *mapping);
+extern int invalidate_inode_pages2_range(struct address_space *mapping,
+pgoff_t start, pgoff_t end);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
Index: 2.6-mm-odirinv/mm/filemap.c
===
--- 2.6-mm-odirinv.orig/mm/filemap.c2005-01-28 13:32:06.0 -0800
+++ 2.6-mm-odirinv/mm/filemap.c 2005-01-28 14:21:04.0 -0800
@@ -2325,7 +2325,10 @@
retval = mapping->a_ops->direct_IO(rw, iocb, iov,
offset, nr_segs);
if (rw == WRITE && mapping->nrpages) {
-   int err = invalidate_inode_pages2(mapping);
+   pgoff_t end = (offset + iov_length(iov, nr_segs) - 1)
+ >> PAGE_CACHE_SHIFT;
+   int err = invalidate_inode_pages2_range(mapping,
+   offset >> PAGE_CACHE_SHIFT, end);
if (err)
retval = err;
}
Index: 2.6-mm-odirinv/mm/truncate.c
===
--- 2.6-mm-odirinv.orig/mm/truncate.c   2005-01-28 13:32:06.0 -0800
+++ 2.6-mm-odirinv/mm/truncate.c2005-01-28 17:03:09.783939857 -0800
@@ -99,7 +99,7 @@
 }
 
 /**
- * truncate_inode_pages - truncate range of pages specified by start and
+ * truncate_inode_pages_range - truncate range of pages specified by start and
  * end byte offsets
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
@@ -279,28 +279,38 @@
 EXPORT_SYMBOL(invalidate_inode_pages);
 
 /**
- * invalidate_inode_pages2 - remove all pages from an address_space
+ * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping - the address_space
+ * @start: the page offset 'from' which to invalidate
+ * @end: the page offset 'to' which to invalidate (inclusive)
  *
  * Any pages which are found to be mapped into pagetables are unmapped prior to
  * invalidation.
  *
  * Returns -EIO if any pages could not be invalidated.
  */
-int invalidate_inode_pages2(struct address_space *mapping)
+int invalidate_inode_pages2_range(struct address_space *mapping,
+ pgoff_t start, pgoff_t end)
 {
struct pagevec pvec;
-   pgoff_t next = 0;
+   pgoff_t next;
int i;
int ret = 0;
-   int did_full_unmap = 0;
+   int did_range_unmap = 0;
 
pagevec_init(, 0);
-   while (!ret && pagevec_lookup(, mapping, next, PAGEVEC_SIZE)) {
+   next = start;
+   while (next <= end &&
+  !ret && pagevec_lookup(, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; !ret && i < pagevec_count(); i++) {
struct page *page = pvec.pages[i];
int was_dirty;
 
+   if (page->index > end) {
+   next = page->index;
+   break;
+   }
+

[ANNOUNCE] "iswraid" (ICHxR ataraid sub-driver) for 2.4.29

2005-01-28 Thread Martins Krikis

Version 0.1.5 of the Intel Sofware RAID driver (iswraid) is now
available for the 2.4 series kernels at
http://prdownloads.sourceforge.net/iswraid/2.4.29-iswraid.patch.gz?download

It is an ataraid "subdriver" but uses the SCSI subsystem to find the
RAID member disks. It depends on the libata library, particularly on
either the ata_piix or the ahci driver, that enable the Serial ATA 
capabilities in ICH5/ICH6/ICH7 chipsets. More information is available
at the project's home page at http://iswraid.sourceforge.net/.

Driver documentation is included in Documentation/iswraid.txt,
which is part of the patch. The license is GPL.

The changes WRT version 0.1.4.3 are the following:
* Resource deallocation bug fixed for failed initializations.
* Read IO resubmission to mirror bug fixed.
* RAID1E (covers 4-disk RAID10) code added.
* More aggressive marking disks as bad in metadata.
* Claiming disks for RAID "feature" removed.
* Option defaults now customizable from the build configuration.
* iswraid_never_fail "feature" watered down into iswraid_resist_failing.
* iswraid_halt_degraded now prevents degraded volumes from being registered.
* Debug printouts more customizable.
* Some code cleanup and optimization.
* Documentation changes.

Please consider this driver for inclusion in the 2.4 kernel tree.

  Martins Krikis
  Storage Components Division
  Intel Massachusetts



P.S. I've CC-d directly to the potential reviewers suggested a few months ago
 by Marcelo. I'll appreciate any feedback you (and others) can provide.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: I need a hardware wizard... I have been beating my head on the wall..

2005-01-28 Thread David Sims

Hi Paulo!

  Your patch generated the following:

Jan 28 19:11:51 linux kernel: vsc_sata int status: 0083
Jan 28 19:11:51 linux last message repeated 19 times
Jan 28 19:11:51 linux kernel: irq 7: nobody cared!
Jan 28 19:11:51 linux kernel:  [] __report_bad_irq+0x22/0x90
Jan 28 19:11:51 linux kernel:  [] note_interrupt+0x58/0x90
Jan 28 19:11:51 linux kernel:  [] __do_IRQ+0xd8/0xe0
.
.
.
.


Thanks for helping me... I hope this is useful info

Dave Sims

On Fri, 28 Jan 2005, Paulo Marques wrote:

> David Sims wrote:
> > On Thu, 27 Jan 2005, Jeff Garzik wrote:
> >>David Sims wrote:
> >>
> >>>[...]
> >>>  You can insert the module in a running kernel and after barking as
> >>>follows (once for each disk attached) it runs just fine.
> >>
> >>Basically nobody has ever had hardware to test sata_vsc with that 
> >>hardware.  We should probably remove the PCI ID until an engineer can 
> >>fix it...
> > 
> > Hi again,
> > 
> >   I am willing to make this hardware available to any engineer that wants
> > to help me solve this problem and I will do whatever I can to make it
> > an easy job... Please help me...
> 
> Well, I don't consider myself a hardware wizard, but at least I'm an 
> engineer, so I decided to give it a go :)
> 
> It seems that the driver is not acknowledging the interrupt from the 
> controller. It would be nice to know what kind of interrupt is 
> triggering this.
> 
> Could you run the attached patch and show the output from dmesg?
> 
> -- 
> Paulo Marques - www.grupopie.com
> 
> All that is necessary for the triumph of evil is that good men do nothing.
> Edmund Burke (1729 - 1797)
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

compat ioctl for submiting URB

2005-01-28 Thread Christopher Li

Hi,

The compatible ioctl is missing for submitting URB from 32 bit
application on a x86_64 system. For people who need to refresh
their mind, please read the big comment after do_usbdevfs_bulk
in fs/compat_ioctl.c

VMware is a big user of the usbdevfs, we translate guest USB
IO to usbdevfs, by submitting URB. On the x86_64 system, we
need those compatible ioctl for submitting URBs. For now we
make a hack to submit it through the vmmon driver. But that
is very ugly. 

I do want this problem get fixed in the linux kernel eventually.
I have been toying with two different ways to solve it. It seems
that it is unavoidable to get hands dirty in the usbdevfs internals.
The first one is just educate the usbdevfs to know about the 32 bit
URB ioctls. So it don't need to keep around a bounce buffer.

The second idea is have a bounce buffer, and let the usbdevfs internals
to know about his bounce buffer and free it when the async structure
destroyed (except for reap).

I attach a patch just implement the first approach. Any comment are
welcome. 

Chris

Index: linux-2.5/include/linux/compat_ioctl.h
===
--- linux-2.5.orig/include/linux/compat_ioctl.h 2005-01-26 17:23:57.0 
-0800
+++ linux-2.5/include/linux/compat_ioctl.h  2005-01-28 16:35:14.0 
-0800
@@ -692,6 +692,7 @@
 COMPATIBLE_IOCTL(USBDEVFS_CONNECTINFO)
 COMPATIBLE_IOCTL(USBDEVFS_HUB_PORTINFO)
 COMPATIBLE_IOCTL(USBDEVFS_RESET)
+COMPATIBLE_IOCTL(USBDEVFS_SUBMITURB32)
 COMPATIBLE_IOCTL(USBDEVFS_CLEAR_HALT)
 /* MTD */
 COMPATIBLE_IOCTL(MEMGETINFO)
Index: linux-2.5/include/linux/usbdevice_fs.h
===
--- linux-2.5.orig/include/linux/usbdevice_fs.h 2005-01-25 12:08:02.0 
-0800
+++ linux-2.5/include/linux/usbdevice_fs.h  2005-01-28 16:35:14.0 
-0800
@@ -32,6 +32,7 @@
 #define _LINUX_USBDEVICE_FS_H
 
 #include 
+#include 
 
 /* - */
 
@@ -123,6 +124,22 @@
char port [127];/* e.g. port 3 connects to device 27 */
 };
 
+struct usbdevfs_urb32 {
+   unsigned char type;
+   unsigned char endpoint;
+   compat_int_t status;
+   compat_uint_t flags;
+   compat_caddr_t buffer;
+   compat_int_t buffer_length;
+   compat_int_t actual_length;
+   compat_int_t start_frame;
+   compat_int_t number_of_packets;
+   compat_int_t error_count;
+   compat_uint_t signr;
+   compat_caddr_t usercontext; /* unused */
+   struct usbdevfs_iso_packet_desc iso_frame_desc[0];
+};
+
 #define USBDEVFS_CONTROL   _IOWR('U', 0, struct usbdevfs_ctrltransfer)
 #define USBDEVFS_BULK  _IOWR('U', 2, struct usbdevfs_bulktransfer)
 #define USBDEVFS_RESETEP   _IOR('U', 3, unsigned int)
@@ -130,6 +147,7 @@
 #define USBDEVFS_SETCONFIGURATION  _IOR('U', 5, unsigned int)
 #define USBDEVFS_GETDRIVER _IOW('U', 8, struct usbdevfs_getdriver)
 #define USBDEVFS_SUBMITURB _IOR('U', 10, struct usbdevfs_urb)
+#define USBDEVFS_SUBMITURB32   _IOR('U', 10, struct usbdevfs_urb32)
 #define USBDEVFS_DISCARDURB_IO('U', 11)
 #define USBDEVFS_REAPURB   _IOW('U', 12, void *)
 #define USBDEVFS_REAPURBNDELAY _IOW('U', 13, void *)
@@ -143,5 +161,4 @@
 #define USBDEVFS_CLEAR_HALT_IOR('U', 21, unsigned int)
 #define USBDEVFS_DISCONNECT_IO('U', 22)
 #define USBDEVFS_CONNECT   _IO('U', 23)
-
 #endif /* _LINUX_USBDEVICE_FS_H */
Index: linux-2.5/include/linux/usb.h
===
--- linux-2.5.orig/include/linux/usb.h  2005-01-25 12:07:54.0 -0800
+++ linux-2.5/include/linux/usb.h   2005-01-28 16:35:14.0 -0800
@@ -608,6 +608,7 @@
 #define URB_NO_FSBR0x0020  /* UHCI-specific */
 #define URB_ZERO_PACKET0x0040  /* Finish bulk OUTs with short 
packet */
 #define URB_NO_INTERRUPT   0x0080  /* HINT: no non-error interrupt needed 
*/
+#define URB_COMPAT 0x0100  /* compat mode */
 
 struct usb_iso_packet_descriptor {
unsigned int offset;
Index: linux-2.5/fs/compat_ioctl.c
===
--- linux-2.5.orig/fs/compat_ioctl.c2005-01-25 12:08:12.0 -0800
+++ linux-2.5/fs/compat_ioctl.c 2005-01-28 16:35:14.0 -0800
@@ -2570,228 +2570,19 @@
 return sys_ioctl(fd, USBDEVFS_BULK, (unsigned long)p);
 }
 
-/* This needs more work before we can enable it.  Unfortunately
- * because of the fancy asynchronous way URB status/error is written
- * back to userspace, we'll need to fiddle with USB devio internals
- * and/or reimplement entirely the frontend of it ourselves. -DaveM
- *
- * The issue is:
- *
- * When an URB is submitted via usbdevicefs it is put onto an
- * asynchronous queue.  When the URB completes, it may be reaped
- * via another ioctl.  During this

[patch 1/1] fix syscallN() macro errno value checking for i386

2005-01-28 Thread blaisorblade


From: Paolo 'Blaisorblade' Giarrusso <[EMAIL PROTECTED]>
Cc: David Howells <[EMAIL PROTECTED]>

The errno values which are visible for userspace are actually in the range 
-1 - -129, not until -128 (): this value was added:

#define EKEYREJECTED129 /* Key was rejected by service */

And this would break ucLibc (for what I heard).

This is just a quick-fix, because putting a macro inside errno.h instead of
having it copied in two places would be probably nicer.

However, I've heard by D. Howells it wasn't accepted, so this is the solution 
for now.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <[EMAIL PROTECTED]>
---

 linux-2.6.11-paolo/include/asm-i386/unistd.h |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff -puN include/asm-i386/unistd.h~fix-syscall-macro include/asm-i386/unistd.h
--- linux-2.6.11/include/asm-i386/unistd.h~fix-syscall-macro2005-01-29 
00:42:48.0 +0100
+++ linux-2.6.11-paolo/include/asm-i386/unistd.h2005-01-29 
00:44:51.0 +0100
@@ -298,12 +298,12 @@
 #define NR_syscalls 289
 
 /*
- * user-visible error numbers are in the range -1 - -128: see
- * 
+ * user-visible error numbers are in the range -1 - -129: see
+ *  (currently it includes )
  */
 #define __syscall_return(type, res) \
 do { \
-   if ((unsigned long)(res) >= (unsigned long)(-(128 + 1))) { \
+   if ((unsigned long)(res) >= (unsigned long)(-(129 + 1))) { \
errno = -(res); \
res = -1; \
} \
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.11-r1 freezes dual 2.5 GHz PowerMac G5

2005-01-28 Thread Maurice Volaski

The patch below works. Thanks.
Maurice Volaski writes:
 > I am running Gentoo with a fresh 2.6.11-r1. I have all the kernel
 > debugging options turned on. Occasionally, I can get past the boot
 > process, but half the time it freezes somewhere along the way. If
 > not, I do get to boot, it doesn't take very long for it to freeze.
Did 2.6.10 work Ok? Try the patch below, it fixes 2.6.11-rc1 boot
lockups on both my Beige G3 (locks up in ADB driver) and my G4 eMac
(locks up in radeonfb).
--- linux-2.6.11-rc1/init/main.c.~1~2005-01-15 03:30:25.0 +0100
+++ linux-2.6.11-rc1/init/main.c2005-01-15 03:31:44.0 +0100
@@ -377,7 +377,7 @@ static void noinline rest_init(void)
 * Re-enable preemption but disable interrupts to make sure
 * we dont get preempted until we schedule() in cpu_idle().
 */
-   local_irq_disable();
+// local_irq_disable();
preempt_enable_no_resched();
unlock_kernel();
cpu_idle();

--
Maurice Volaski, [EMAIL PROTECTED]
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-01-28 Thread Lee Revell

On Fri, 2005-01-28 at 10:11 +0100, Ingo Molnar wrote:
> * Jack O'Quin <[EMAIL PROTECTED]> wrote:
> 
> > > thus after a couple of years we'd end up with lots of desktop apps
> > > running as SCHED_FIFO, and latency would go down the drain again.
> > 
> > I wonder how Mac OS X and Windows deal with this priority escalation
> > problem?  Is it real or only theoretical?
> 
> no idea. Anyone with MacOSX/Windows application writing experience? :-|
> 

Here's the description from Apple.
(from
http://developer.apple.com/documentation/Darwin/Conceptual/KernelProgramming/scheduler/chapter_8_section_4.html):

However, according Stéphane Letz who ported JACK to OSX, this does NOT
describe the reality of the current implementation - it's not a real
deadline scheduler.  "period" and "constraint" are ignored, RT tasks are
scheduled round robin, and the scheduler just uses "computation" as the
timeslice.  If an RT task repeatedly uses its entire timeslice without
blocking, the scheduler can demote the task to SCHED_NORMAL.

Audio apps do not normally set these parameters directly, the CoreAudio
backend handles it.

(quoting Stéphane Letz)

> For example in CoreAudio, the computation value is directly related 
> to the audio buffer size in the following way:
> 
> buffer size   computation
> 
> 64 frames 500 us
> 128   300 us
> >= 256100 us
> 
> The idea is that threads with smaller buffer size will get a larger  
> computation slice so that there is a chance they can complete their  
> jobs. Threads with larger buffer size are more interruptible. The  
> CoreMidi thread (to handle incoming Midi events) also has a 
> computation value of 500 us.

> Other RT threads like Firewire and various system threads computation  
> value are also carefully chosen.

(This was from a private mail thread, that lead to Con's SCHED_ISO patches, 
if all the participants agree I will post a link to the full thread because 
it answers many questions that are sure to come up on LKML)

So this system *requires* an app to tell the kernel in advance what its
RT constraints are, then revokes isochronous scheduling privileges if
the task lied.  This would require a new API.  Furthermore I suspect
that these "System" threads aren't subject to having their RT privileges
revoked, and that the GUI gets special treatment, etc.

The upshot is while the OSX system works in that environment, it's
largely due to Apple controlling the kernel and a lot of userspace.  OSX
is useful as a model of what a good API for soft realtime support in a
desktop OS would look like.  But we are a general purpose OS so we
certainly need a more general solution.

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PNP and bus association

2005-01-28 Thread Pierre Ossman

Adam Belay wrote:
Hi Pierre,
The platform bus does not show the actual physical relationship either.  For
x86, ACPI is typically needed to determine this. It would be easy to bind to
spawn pnp devices off of an ISA bridge device, attached to the pci bus, but
whether it's the actual physical parent would be very difficult to determine
without firmware assistance.
At the moment the pnp bus is only showing a logical bus relationship.  If we
were to use ACPI to aid in the generation of the physical device tree, we
could put these devices in the correct physical location.
 

So it is correct behaviour that the device shows up under /sys/bus/pnp 
when found using PNP, and /sys/bus/platform when scanned for?
I'm trying to get it to work well with HAL and it would be nice if it 
could be found in a consistent way.

Rgds
Pierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.10 ACPI on dell inspiron 8100

2005-01-28 Thread Wakko Warner

I noticed something strange with ACPI and the battery:
/proc/acpi/battery/BAT1$ cat info 
present: yes
design capacity: 57420 mWh
last full capacity:  57420 mWh
battery technology:  rechargeable
design voltage:  14800 mV
design capacity warning: 3000 mWh
design capacity low: 1000 mWh
capacity granularity 1:  200 mWh
capacity granularity 2:  200 mWh
model number:LIP8084DLP
serial number:   20495
battery type:LION
OEM info:Sony Corp.
/proc/acpi/battery/BAT1$ cat state 
present: yes
capacity state:  ok
charging state:  charging
present rate:unknown
remaining capacity:  59040 mWh
present voltage: 16716 mV
/proc/acpi/battery/BAT1$

Is my laptop messed up or is ACPI not seeing proper values?  How can I have
59040 remaining capacity when it the full capacity is 57420?  Also the
system didn't display the charging light so I know it's not charging.

I yanked the battery and I saw this:
/proc/acpi/battery/BAT1$ cat state
present: yes
capacity state:  ok
charging state:  charged
present rate:unknown
remaining capacity:  0 mWh
present voltage: 0 mV
/proc/acpi/battery/BAT1$

Under BAT0, info and state show one line, present: no

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

driver model: fix u32 vs. pm_message_t in OSS

2005-01-28 Thread Pavel Machek

Hi!

This fixes u32 vs. pm_message_t in OSS. [I tried to go through alsa
developers, but Takashi told me they do not have control over
sound/oss.] No real code changes, please apply,

Pavel
(all bugs are mine :-).


From: Bernard Blackham <[EMAIL PROTECTED]>
Signed-off-by: Pavel Machek <[EMAIL PROTECTED]>

--- clean/sound/oss/ali5455.c   2005-01-22 02:48:45.0 +0100
+++ linux/sound/oss/ali5455.c   2005-01-28 19:18:10.0 +0100
@@ -3528,7 +3528,7 @@
 }
 
 #ifdef CONFIG_PM
-static int ali_pm_suspend(struct pci_dev *dev, u32 pm_state)
+static int ali_pm_suspend(struct pci_dev *dev, pm_message_t pm_state)
 {
struct ali_card *card = pci_get_drvdata(dev);
struct ali_state *state;
--- clean/sound/oss/cs4281/cs4281_wrapper-24.c  2005-01-22 02:47:48.0 
+0100
+++ linux/sound/oss/cs4281/cs4281_wrapper-24.c  2005-01-28 19:18:10.0 
+0100
@@ -27,7 +27,7 @@
 #include 
 
 static int cs4281_resume_null(struct pci_dev *pcidev) { return 0; }
-static int cs4281_suspend_null(struct pci_dev *pcidev, u32 state) { return 0; }
+static int cs4281_suspend_null(struct pci_dev *pcidev, pm_message_t state) { 
return 0; }
 
 #define free_dmabuf(state, dmabuf) \
pci_free_consistent(state->pcidev, \
--- clean/sound/oss/cs46xx.c2005-01-22 02:49:21.0 +0100
+++ linux/sound/oss/cs46xx.c2005-01-28 19:18:10.0 +0100
@@ -388,7 +388,7 @@
 static int cs46xx_powerup(struct cs_card *card, unsigned int type);
 static int cs461x_powerdown(struct cs_card *card, unsigned int type, int 
suspendflag);
 static void cs461x_clear_serial_FIFOs(struct cs_card *card, int type);
-static int cs46xx_suspend_tbl(struct pci_dev *pcidev, u32 state);
+static int cs46xx_suspend_tbl(struct pci_dev *pcidev, pm_message_t state);
 static int cs46xx_resume_tbl(struct pci_dev *pcidev);
 
 #ifndef CS46XX_ACPI_SUPPORT
@@ -5774,7 +5774,7 @@
 #endif
 
 #if CS46XX_ACPI_SUPPORT
-static int cs46xx_suspend_tbl(struct pci_dev *pcidev, u32 state)
+static int cs46xx_suspend_tbl(struct pci_dev *pcidev, pm_message_t state)
 {
struct cs_card *s = PCI_GET_DRIVER_DATA(pcidev);
CS_DBGOUT(CS_PM | CS_FUNCTION, 2, 
--- clean/sound/oss/cs46xxpm-24.h   2005-01-22 02:48:58.0 +0100
+++ linux/sound/oss/cs46xxpm-24.h   2005-01-28 19:18:10.0 +0100
@@ -36,7 +36,7 @@
 * for now (12/22/00) only enable the pm_register PM support.
 * allow these table entries to be null.
 */
-static int cs46xx_suspend_tbl(struct pci_dev *pcidev, u32 state);
+static int cs46xx_suspend_tbl(struct pci_dev *pcidev, pm_message_t state);
 static int cs46xx_resume_tbl(struct pci_dev *pcidev);
 #define cs_pm_register(a, b, c)  NULL
 #define cs_pm_unregister_all(a) 
--- clean/sound/oss/esssolo1.c  2005-01-22 02:47:15.0 +0100
+++ linux/sound/oss/esssolo1.c  2005-01-28 19:18:10.0 +0100
@@ -2257,7 +2257,7 @@
 }
 
 static int
-solo1_suspend(struct pci_dev *pci_dev, u32 state) {
+solo1_suspend(struct pci_dev *pci_dev, pm_message_t state) {
struct solo1_state *s = (struct solo1_state*)pci_get_drvdata(pci_dev);
if (!s)
return 1;
--- clean/sound/oss/i810_audio.c2005-01-22 02:48:35.0 +0100
+++ linux/sound/oss/i810_audio.c2005-01-28 19:18:10.0 +0100
@@ -3457,7 +3457,7 @@
 }
 
 #ifdef CONFIG_PM
-static int i810_pm_suspend(struct pci_dev *dev, u32 pm_state)
+static int i810_pm_suspend(struct pci_dev *dev, pm_message_t pm_state)
 {
 struct i810_card *card = pci_get_drvdata(dev);
 struct i810_state *state;
--- clean/sound/oss/maestro3.c  2005-01-22 02:48:48.0 +0100
+++ linux/sound/oss/maestro3.c  2005-01-28 19:18:10.0 +0100
@@ -375,7 +375,7 @@
  * I'm not very good at laying out functions in a file :)
  */
 static int m3_notifier(struct notifier_block *nb, unsigned long event, void 
*buf);
-static int m3_suspend(struct pci_dev *pci_dev, u32 state);
+static int m3_suspend(struct pci_dev *pci_dev, pm_message_t state);
 static void check_suspend(struct m3_card *card);
 
 static struct notifier_block m3_reboot_nb = {
@@ -2777,12 +2777,12 @@
 
 for(card = devs; card != NULL; card = card->next) {
 if(!card->in_suspend)
-m3_suspend(card->pcidev, 3); /* XXX legal? */
+m3_suspend(card->pcidev, PMSG_SUSPEND); /* XXX legal? */
 }
 return 0;
 }
 
-static int m3_suspend(struct pci_dev *pci_dev, u32 state)
+static int m3_suspend(struct pci_dev *pci_dev, pm_message_t state)
 {
 unsigned long flags;
 int i;
--- clean/sound/oss/trident.c   2005-01-22 02:48:35.0 +0100
+++ linux/sound/oss/trident.c   2005-01-28 19:18:10.0 +0100
@@ -487,7 +487,7 @@
 static struct trident_channel *ali_alloc_pcm_channel(struct trident_card 
*card);
 static void ali_restore_regs(struct trident_card *card);
 static void ali_save_regs(struct trident_card *card);
-static int trident_suspend(struct pci_dev *dev, u32 unused);
+static

Re: [RFC][PATCH] add driver matching priorities

2005-01-28 Thread Al Viro

On Fri, Jan 28, 2005 at 06:23:26PM -0500, Dmitry Torokhov wrote:
> On Friday 28 January 2005 17:30, Adam Belay wrote:
> > Of course this patch is not going to be effective alone.  We also need
> > to change the init order.  If a driver is registered early but isn't the
> > best available, it will be bound to the device prematurely.  This would
> > be a problem for carbus (yenta) bridges.
> > 
> > I think we may have to load all in kernel drivers first, and then begin
> > matching them to hardware.  Do you agree?  If so, I'd be happy to make a
> > patch for that too.
> > 
> 
> I disagree. The driver core should automatically unbind generic driver
> from a device when native driver gets loaded. I think the only change is
> that we can no longer skip devices that are bound to a driver and match
> them all over again when a new driver is loaded.  

And what happens if we've already got the object busy?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] add driver matching priorities

2005-01-28 Thread Adam Belay

On Fri, 2005-01-28 at 18:51 -0500, Dmitry Torokhov wrote:
> If generic driver binds to a device that is has no idea how to drive
> _at all_ then I will argue that the generic driver is broken. It should
> not bind to begin with.
> 

In the case of pci bridges, sometimes we can't really tell if we can
drive the hardware entirely.  It's a classcode match.  Generic drivers
may support a portion of hardware in a limited fashion.  It's not that
they have no idea what they're doing with the hardware.  It's more a
matter of not always doing the best or most complete thing.  For some
hardware this may work fine.  Because we don't support generic drivers
in the current driver model, we haven't had a chance to see how well
they would work, or where they could be used.

Also, consider this.  If the pci bridge driver binds to yenta, it will
(in theory, it also might explode) enumerate all of the cardbus devices.
If then later, it is discovered that there is a better driver for the
bridge, all of the bridge's children will have to be torn down.  Thier
drivers will be released, and  the devices removed.  This might increase
the odds of something going wrong.

Thanks,
Adam

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/10] UML - compile fixes

2005-01-28 Thread Blaisorblade

On Friday 28 January 2005 23:10, Andrew Morton wrote:
> Blaisorblade <[EMAIL PROTECTED]> wrote:
> > On Monday 17 January 2005 08:27, Andrew Morton wrote:
> > > Jeff Dike <[EMAIL PROTECTED]> wrote:
> > > > This fixes some warnings, and changes the system call table so that
> > > > it will compile in -linus, where the vperf system calls are not yet
> > > > merged.
> > >
> > > methinks we already fixed this.
> > >
> > > > Signed-off-by: Jeff Dike <[EMAIL PROTECTED]>
> >
> > No, incorrect, this is not applied, current bitkeeper snapshots don't
> > compile for this reason too.
> >
> > Jeff, I think you should resend the patch anyway.
>
> I don't know what this is about.
Yes, it was from some days ago... so I guess either I or Jeff will have to 
resend it...

Andrew, when do you plan to release 2.6.11?

Jeff, you should send your queued fixes, and also resend this one (indeed, it 
was not applied). If I find the time I'll select the interesting ones and 
send them (with a mail to request their prompt merge).

> The only UML patch I have pending is 
>
> uml-kconfig_arch-little-cleanup-to-merge-before-2611.patch
Ok, please merge it ASAP (as the title suggests).

> From: [EMAIL PROTECTED]



-- 
Paolo Giarrusso, aka Blaisorblade
Linux registered user n. 292729
http://www.user-mode-linux.org/~blaisorblade
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bug 4081] New: OpenOffice crashes while starting due to a threading error

2005-01-28 Thread Stephen Hemminger

On Fri, 28 Jan 2005 18:46:13 -0500
Parag Warudkar <[EMAIL PROTECTED]> wrote:

> Lee Revell wrote:
> 
> >  
> >
> >>munmap(0x955838, 8192)  = -1 EINVAL (Invalid argument)
> >>munmap(0x80d7ff0, 3221222108)   = -1 EINVAL (Invalid argument)
> >>--- SIGSEGV (Segmentation fault) @ 0 (0) ---
> >>
> >>
> >
> >No, it really looks like OO tried to munmap() something incorrectly.
> >3,221,222,108 bytes at offset 0x80d7ff0?
> >
> >Lee
> >
> >  
> >
> May be that's another OO.o bug which gets triggered by failure to open 
> /dev/dri? Actually Stephen had OO working fine with earlier kernels, 
> where possibly /dev/dri/* permissions were appropriate and it was able 
> to open it - With new kernel the permissions seem to be improper which 
> is confirmed by strace --
> 
> open("/dev/dri/card0", O_RDWR)  = -1 EACCES (Permission denied)
> 
> Should be filed as a bug with OO.org - it shouldnt segfault due to DRI 
> permissions..
> 
> Parag

Note: on 2.6.10
/dev/dri/card0  crw-rw-rw-
on 2.6.11-rc2
/dev/dri/card0  crw-rw
/dev/dri/card1  crw-rw

Changing permissions seems to fix (it for startup), will try more and see
if udev remembers not to turn them back.

-- 
Stephen Hemminger   <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] add driver matching priorities

2005-01-28 Thread Dmitry Torokhov

On Friday 28 January 2005 18:33, Adam Belay wrote:
> On Fri, 2005-01-28 at 18:23 -0500, Dmitry Torokhov wrote:
> > On Friday 28 January 2005 17:30, Adam Belay wrote:
> > > Of course this patch is not going to be effective alone.  We also need
> > > to change the init order.  If a driver is registered early but isn't the
> > > best available, it will be bound to the device prematurely.  This would
> > > be a problem for carbus (yenta) bridges.
> > > 
> > > I think we may have to load all in kernel drivers first, and then begin
> > > matching them to hardware.  Do you agree?  If so, I'd be happy to make a
> > > patch for that too.
> > > 
> > 
> > I disagree. The driver core should automatically unbind generic driver
> > from a device when native driver gets loaded. I think the only change is
> > that we can no longer skip devices that are bound to a driver and match
> > them all over again when a new driver is loaded.  
> > 
> 
> That's another option.  My concern is that if a generic driver pokes
> around with hardware, it may fail to initialize properly when the actual
> driver is loaded.  There are other problems too.  If the system were to
> be suspended while the generic driver was loaded, the restore_state code
> may be incorrect, also rendering the device unusable.
> 

If generic driver binds to a device that is has no idea how to drive
_at all_ then I will argue that the generic driver is broken. It should
not bind to begin with.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] add driver matching priorities

2005-01-28 Thread Adam Belay

On Fri, 2005-01-28 at 18:23 -0500, Dmitry Torokhov wrote:
> On Friday 28 January 2005 17:30, Adam Belay wrote:
> > Of course this patch is not going to be effective alone.  We also need
> > to change the init order.  If a driver is registered early but isn't the
> > best available, it will be bound to the device prematurely.  This would
> > be a problem for carbus (yenta) bridges.
> > 
> > I think we may have to load all in kernel drivers first, and then begin
> > matching them to hardware.  Do you agree?  If so, I'd be happy to make a
> > patch for that too.
> > 
> 
> I disagree. The driver core should automatically unbind generic driver
> from a device when native driver gets loaded. I think the only change is
> that we can no longer skip devices that are bound to a driver and match
> them all over again when a new driver is loaded.  
> 

That's another option.  My concern is that if a generic driver pokes
around with hardware, it may fail to initialize properly when the actual
driver is loaded.  There are other problems too.  If the system were to
be suspended while the generic driver was loaded, the restore_state code
may be incorrect, also rendering the device unusable.

I'd like to leave the option of unloading generic driver open.  I just
think we need to be aware of potential problems it might cause, before
deciding to go that direction.

Thanks,
Adam

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

2005-01-28 Thread hui

On Fri, Jan 28, 2005 at 08:45:46PM +0100, Ingo Molnar wrote:
> * Trond Myklebust <[EMAIL PROTECTED]> wrote:
> > If you do have a highest interrupt case that causes all activity to
> > block, then rwsems may indeed fit the bill.
> > 
> > In the NFS client code we may use rwsems in order to protect stateful
> > operations against the (very infrequently used) server reboot recovery
> > code. The point is that when the server reboots, the server forces us
> > to block *all* requests that involve adding new state (e.g. opening an
> > NFSv4 file, or setting up a lock) while our client and others are
> > re-establishing their existing state on the server.
> 
> it seems the most scalable solution for this would be a global flag plus
> per-CPU spinlocks (or per-CPU mutexes) to make this totally scalable and
> still support the requirements of this rare event. An rwsem really
> bounces around on SMP, and it seems very unnecessary in the case you
> described.
> 
> possibly this could be formalised as an rwlock/rwlock implementation
> that scales better. brlocks were such an attempt.

>From how I understand it, you'll have to have a global structure to
denote an exclusive operation and then take some additional cpumask_t
representing the spinlocks set and use it to iterate over when doing a
PI chain operation.

Locking of each individual parametric typed spinlock might require
a raw_spinlock manipulate lists structures, which, added up, is rather
heavy weight.

No only that, you'd have to introduce a notion of it being counted
since it could also be aquired/preempted  by another higher priority
thread on that same procesor.  Not having this semantic would make the
thread in that specific circumstance effectively non-preemptable (PI
scheduler indeterminancy), where the mulipule readers portion of a
real read/write (shared-exclusve) lock would have permitted this.

http://people.lynuxworks.com/~bhuey/rt-share-exclusive-lock/rtsem.tgz.1208

Is our attempt at getting real shared-exclusive lock semantics in a
blocking lock and may still be incomplete and buggy. Igor is still
working on this and this is the latest that I have of his work. Getting
comments on this approach would be a good thing as I/we (me/Igor)
believed from the start that this approach is correct.

Assuming that this is possible with the current approach, optimizing
it to avoid CPU ping-ponging is an important next step

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] idle thread preemption fix

2005-01-28 Thread Ingo Molnar

* Olaf Hering <[EMAIL PROTECTED]> wrote:

> Whats the purpose of local_irq_disable() here? Locks up my toys in
> atkbd_init or IP hash foo functions.

fix already posted a couple of days ago, see:

--
* Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote:

> Hi Ingo !
> 
> Could you explain me precisely what is the race you are fixing by
> adding local_irq_disable() to rest_init() ?

it can be bad for the idle task to hold the BKL and to have preemption
enabled - in such a situation the scheduler will get confused if an
interrupt triggers a forced preemption in that small window. But it's
not necessary to keep IRQs disabled after the BKL has been dropped. In
fact i think IRQ-disabling doesnt have to be done at all, the patch
below ought to solve this scenario equally well, and should solve the
PPC side-effects too.

Tested ontop of 2.6.11-rc2 on x86 PREEMPT+SMP and PREEMPT+!SMP (which
IIRC were the config variants that triggered the original problem), on
an SMP and on a UP system.

Ingo

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>

--- linux/init/main.c.orig
+++ linux/init/main.c
@@ -373,14 +373,9 @@ static void noinline rest_init(void)
 {
kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);
numa_default_policy();
-   /*
-* Re-enable preemption but disable interrupts to make sure
-* we dont get preempted until we schedule() in cpu_idle().
-*/
-   local_irq_disable();
-   preempt_enable_no_resched();
unlock_kernel();
-   cpu_idle();
+   preempt_enable_no_resched();
+   cpu_idle();
 } 

 /* Check for early params. */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] add driver matching priorities

2005-01-28 Thread Dmitry Torokhov

On Friday 28 January 2005 17:30, Adam Belay wrote:
> Of course this patch is not going to be effective alone. ÂWe also need
> to change the init order. ÂIf a driver is registered early but isn't the
> best available, it will be bound to the device prematurely. ÂThis would
> be a problem for carbus (yenta) bridges.
> 
> I think we may have to load all in kernel drivers first, and then begin
> matching them to hardware. ÂDo you agree? ÂIf so, I'd be happy to make a
> patch for that too.
> 

I disagree. The driver core should automatically unbind generic driver
from a device when native driver gets loaded. I think the only change is
that we can no longer skip devices that are bound to a driver and match
them all over again when a new driver is loaded.  

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] document atkbd.softraw

2005-01-28 Thread Andries.Brouwer

Document atkbd.softraw (and shorten a few long lines nearby).

diff -uprN -X /linux/dontdiff a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt   2004-12-29 03:39:42.0 
+0100
+++ b/Documentation/kernel-parameters.txt   2005-01-29 00:21:07.0 
+0100
@@ -222,15 +222,19 @@ running once the system is up.
 
atascsi=[HW,SCSI] Atari SCSI
 
-   atkbd.extra=[HW] Enable extra LEDs and keys on IBM RapidAccess, 
EzKey
-   and similar keyboards
+   atkbd.extra=[HW] Enable extra LEDs and keys on IBM RapidAccess,
+   EzKey and similar keyboards
 
atkbd.reset=[HW] Reset keyboard during initialization
 
atkbd.set=  [HW] Select keyboard code set 
Format:  (2 = AT (default) 3 = PS/2)
 
-   atkbd.scroll=   [HW] Enable scroll wheel on MS Office and similar 
keyboards
+   atkbd.scroll=   [HW] Enable scroll wheel on MS Office and similar
+   keyboards
+
+   atkbd.softraw=  [HW] Choose between synthetic and real raw mode
+   Format:  (0 = real, 1 = synthetic (default))

atkbd.softrepeat=
[HW] Use software keyboard repeat
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bug 4081] New: OpenOffice crashes while starting due to a threading error

2005-01-28 Thread Lee Revell

On Fri, 2005-01-28 at 09:31 -0800, Stephen Hemminger wrote:
> Here is the strace output of the part that SEGV's, looks like a DRI issue??

[snip]

> munmap(0x955838, 8192)  = -1 EINVAL (Invalid argument)
> munmap(0x80d7ff0, 3221222108)   = -1 EINVAL (Invalid argument)
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---

No, it really looks like OO tried to munmap() something incorrectly.
3,221,222,108 bytes at offset 0x80d7ff0?

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bug 4081] New: OpenOffice crashes while starting due to a threading error

2005-01-28 Thread Parag Warudkar

Stephen Hemminger wrote:
Here is the strace output of the part that SEGV's, looks like a DRI issue??
 

Yep.. If you haven't already, just change the permissions on 
/dev/dri/card0 to give access to your user id and it should be fine. 
(Reporter of this bug had to do the same in order to get it working)

Something in the kernel changes as far as DRI goes - Dont know what. And 
I know why I wasn't affected - NVIDIA driver which doesnt use 
/dev/dri/*. Though Trever seems to be having an entirely different 
problem - one oddity being the continuos -EINTRs that his OO.o gets on 
startup.

Trever - If your problem isn't solved yet - Can you run gdb 
/path/to/ooffice from commandline and then when it segfaults, post the 
backtrace?

Parag
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

PROBLEM: SysV semaphore race vs SIGSTOP

2005-01-28 Thread Ove Kaaven

There seem to be a race when SIGSTOP-ing a process waiting for a SysV
semaphore. Even if it could not possibly have owned the semaphore when
the signal was sent (because the sender of the signal owned it at the
time), it still occasionally happens that it both stops execution *and*
acquires the semaphore, with a deadlocked application as the result.
This is a problem for some of the high-performance stuff I'm working on.

A sample test program exhibiting the problem is available at
http://www.ping.uio.no/~ovehk/sembug.c

For me, it will show "ACQUIRE FAILED!! DEADLOCK!!" almost every time I
run it. Occasionally it will run fine; if it does for you, just try
again a couple of times.

The kernel I currently use is:

Linux version 2.4.27-1-k7 ([EMAIL PROTECTED]) (gcc
version 3.3.5 (Debian 1:3.3.5-2)) #1 Wed Dec 1 20:12:01 JST 2004

and I run it on a uniprocessor system (AMD Athlon, 1.9GHz) with Debian
"sid" installed.

I'm not a kernel hacker, but from a quick peruse of the 2.4 code, it
didn't seem to me like the semaphore code in the kernel (ipc/sem.c) even
try to handle suspended threads (though I wouldn't know how to do so).
The 2.6 semaphore code looked almost the same to me, too, so it might be
a problem there as well.

Please Cc me on any questions or comments, since I am too wimpy to
subscribe yet.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: panic in raid1_end_write_request

2005-01-28 Thread Norman Gaywood

Thanks Mark,

On Fri, Jan 28, 2005 at 04:34:01PM -0600, Mark Rustad wrote:
> I used to get these running SuSE SLES 9 and also with a variety of 
> kernel.org kernels. The crash was triggered by a media error on a 
> RAID1.

Were there any media errors logged? My system does not log any such errors.

>A patch that I got from SuSE fixed it for me. The patch is below 
> your message excerpt.

That looks like the "bio clone memory corruption" patch which is
supposed to be in 2.6.10-1.747_FC3smp via 2.6.10-ac10 being included in
that kernel.

I was hoping that would solve my problem as well, but it didn't.

-- 
Norman Gaywood, Systems Administrator
School of Mathematics, Statistics and Computer Science
University of New England, Armidale, NSW 2351, Australia

[EMAIL PROTECTED]Phone: +61 (0)2 6773 2412
http://turing.une.edu.au/~normFax:   +61 (0)2 6773 3312

Please avoid sending me Word or PowerPoint attachments.
See http://www.fsf.org/philosophy/no-word-attachments.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PNP and bus association

2005-01-28 Thread Adam Belay

Hi Pierre,

The platform bus does not show the actual physical relationship either.  For
x86, ACPI is typically needed to determine this. It would be easy to bind to
spawn pnp devices off of an ISA bridge device, attached to the pci bus, but
whether it's the actual physical parent would be very difficult to determine
without firmware assistance.

At the moment the pnp bus is only showing a logical bus relationship.  If we
were to use ACPI to aid in the generation of the physical device tree, we
could put these devices in the correct physical location.

Thanks,
Adam

On Thu, Jan 27, 2005 at 10:16:50PM +0100, Pierre Ossman wrote:
> I recently tried out adding PNP support to my driver to remove the 
> hassle of finding the correct parameters for it. This, however, causes 
> it to show up under the pnp bus, where as it previously was located 
> under the platform bus.
> 
> Is the idea that PNP devices should only reside on the PNP bus or is 
> there some magic available to get the device to appear on several buses? 
> It's a bit of a hassle to search in two different places in sysfs 
> depending on if PNP is used or not.
> 
> Also, the PNP bus doesn't really say that much about where the device is 
> physically connected. The other bus types usually give a hint about this.

It's normal for ISA devices to not tell us much about their physical
properties.

> 
> Rgds
> Pierre
> -
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: multiple neighbour cache tables for AF_INET

2005-01-28 Thread YOSHIFUJI Hideaki / $B5HF#1QL@(B

In article <[EMAIL PROTECTED]> (at Sat, 29 Jan 2005 09:19:49 +1100), Herbert Xu 
<[EMAIL PROTECTED]> says:

> IMHO you need to give the user a way to specify which table they want
> to operate on.  If they don't specify one, then the current behaviour
> of choosing the first table found is reasonble.

We have dev. Isn't is sufficient?

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] idle thread preemption fix

2005-01-28 Thread Olaf Hering

 On Sat, Jan 08, Linux Kernel Mailing List wrote:

> ChangeSet 1.2316, 2005/01/08 13:53:41-08:00, [EMAIL PROTECTED]
> 
>   [PATCH] idle thread preemption fix
>   
>   The early bootup stage is pretty fragile because the idle thread is not 
> yet
>   functioning as such and so we need preemption disabled.  Whether the 
> bootup
>   fails or not seems to depend on timing details so e.g.  the presence of
>   SCHED_SMT makes it go away.
>   
>   Disabling preemption explicitly has another advantage: the atomicity 
> check
>   in schedule() will catch early-bootup schedule() calls from now on.
>   
>   The patch also fixes another preempt-bkl buglet: interrupt-driven
>   forced-preemption didnt go through preempt_schedule() so it resulted in
>   auto-dropping of the BKL.  Now we go through preempt_schedule() which
>   properly deals with the BKL.
>   
>   Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
>   Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
>   Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>

> diff -Nru a/init/main.c b/init/main.c
> --- a/init/main.c 2005-01-08 15:18:18 -08:00
> +++ b/init/main.c 2005-01-08 15:18:18 -08:00
> @@ -373,6 +373,12 @@
>  {
>   kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);
>   numa_default_policy();
> + /*
> +  * Re-enable preemption but disable interrupts to make sure
> +  * we dont get preempted until we schedule() in cpu_idle().
> +  */
> + local_irq_disable();
> + preempt_enable_no_resched();
>   unlock_kernel();
>   cpu_idle();
>  } 

Whats the purpose of local_irq_disable() here? Locks up my toys in
atkbd_init or IP hash foo functions.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH] add driver matching priorities

2005-01-28 Thread Adam Belay

Hi,

This patch adds initial support for driver matching priorities to the
driver model.  It is needed for my work on converting the pci bridge
driver to use "struct device_driver".  It may also be helpful for driver
with more complex (or long id lists as I've seen in many cases) matching
criteria. 

"match" has been added to "struct device_driver".  There are now two
steps in the matching process.  The first step is a bus specific filter
that determines possible driver candidates.  The second step is a driver
specific match function that verifies if the driver will work with the
hardware, and returns a priority code (how well it is able to handle the
device).  The bus layer could override the driver's match function if
necessary (similar to how it passes *probe through it's layer and then
on to the actual driver).

The current priorities are as follows:

enum {
MATCH_PRIORITY_FAILURE = 0,
MATCH_PRIORITY_GENERIC,
MATCH_PRIORITY_NORMAL,
MATCH_PRIORITY_VENDOR,
};

let me know if any of this would need to be changed.  For example, the
"struct bus_type" match function could return a priority code.

Of course this patch is not going to be effective alone.  We also need
to change the init order.  If a driver is registered early but isn't the
best available, it will be bound to the device prematurely.  This would
be a problem for carbus (yenta) bridges.

I think we may have to load all in kernel drivers first, and then begin
matching them to hardware.  Do you agree?  If so, I'd be happy to make a
patch for that too.

Thanks,
Adam


--- a/drivers/base/bus.c2005-01-20 17:37:46.0 -0500
+++ b/drivers/base/bus.c2005-01-28 16:59:00.0 -0500
@@ -286,6 +286,9 @@
if (drv->bus->match && !drv->bus->match(dev, drv))
return -ENODEV;
 
+   if (drv->match && !drv->match(dev))
+   return -ENODEV;
+
dev->driver = drv;
if (drv->probe) {
int error = drv->probe(dev);
@@ -299,6 +302,42 @@
return 0;
 }
 
+/**
+ * driver_probe_device_priority - attempt to bind device & driver with a
+ *given match level priority 
+ * @drv:   driver.
+ * @dev:   device.
+ * @priority   the match level priority
+ */
+
+static int driver_probe_device_priority(struct device_driver * drv,
+   struct device * dev, int priority)
+{
+   int matchp;
+
+   if (drv->bus->match && !drv->bus->match(dev, drv))
+   return -ENODEV;
+
+   if (drv->match) {
+   matchp = drv->match(dev);
+   } else
+   matchp = MATCH_PRIORITY_NORMAL;
+
+   if (matchp != priority)
+   return -ENODEV;
+
+   dev->driver = drv;
+   if (drv->probe) {
+   int error = drv->probe(dev);
+   if (error) {
+   dev->driver = NULL;
+   return error;
+   }
+   }
+
+   device_bind_driver(dev);
+   return 0;
+}
 
 /**
  * device_attach - try to attach device to a driver.
@@ -312,17 +351,20 @@
 {
struct bus_type * bus = dev->bus;
struct list_head * entry;
-   int error;
+   int error, matchp = MATCH_PRIORITY_VENDOR;
 
if (dev->driver) {
device_bind_driver(dev);
return 1;
}
 
-   if (bus->match) {
+   if (!bus->match)
+   return 0;
+   
+   while (matchp > 0) {
list_for_each(entry, >drivers.list) {
struct device_driver * drv = to_drv(entry);
-   error = driver_probe_device(drv, dev);
+   error = driver_probe_device_priority(drv, dev, matchp);
if (!error)
/* success, driver matched */
return 1;
@@ -332,6 +374,7 @@
"%s: probe of %s failed with error %d\n",
drv->name, dev->bus_id, error);
}
+   matchp--;
}
 
return 0;
--- a/include/linux/device.h2005-01-20 17:37:26.0 -0500
+++ b/include/linux/device.h2005-01-28 16:40:22.0 -0500
@@ -41,6 +41,13 @@
RESUME_ENABLE,
 };
 
+enum {
+   MATCH_PRIORITY_FAILURE = 0,
+   MATCH_PRIORITY_GENERIC,
+   MATCH_PRIORITY_NORMAL,
+   MATCH_PRIORITY_VENDOR,
+};
+
 struct device;
 struct device_driver;
 struct class;
@@ -108,6 +115,7 @@
 
struct module   * owner;
 
+   int (*match)(struct device * dev);
int (*probe)(struct device * dev);
int (*remove)   (struct device * dev);
void(*shutdown) (struct device * dev);



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info

[WATCHDOG] 2.6.11-rc2 watchdog patches

2005-01-28 Thread Wim Van Sebroeck

Hi Andrew,

please do a

bk pull http://linux-watchdog.bkbits.net/linux-2.6-watchdog-mm

This will update the following files:

 drivers/char/watchdog/i8xx_tco.c|   34 +++---
 drivers/char/watchdog/ixp2000_wdt.c |2 +-
 drivers/char/watchdog/ixp4xx_wdt.c  |2 +-
 drivers/char/watchdog/sa1100_wdt.c  |2 +-
 drivers/char/watchdog/scx200_wdt.c  |2 +-
 5 files changed, 31 insertions(+), 11 deletions(-)

through these ChangeSets:

<[EMAIL PROTECTED]> (05/01/28 1.1984)
   [WATCHDOG] i8xx_tco.c-ICH4/6/7-patch
   
   Added support for the ICH4-M, ICH6, ICH6R, ICH6-M, ICH6W and ICH6RW
   chipsets. Also added support for the "undocumented" ICH7.

<[EMAIL PROTECTED]> (05/01/28 1.1985)
   [WATCHDOG] correct sysfs name for watchdog devices
   
   While looking for possible candidates for our udev.rules package,
   I found a few odd ->name properties. /dev/watchdog has minor 130
   according to devices.txt. Since all watchdog drivers use the
   misc_register() call, they will end up in /sys/class/misc/$foo.
   udev may create the /dev/watchdog node if the driver is loaded.
   I dont have such a device, so I cant test it.
   The drivers below provide names with spaces and even with / in it.
   Not a big deal, but apps may expect /dev/watchdog.
   
   Signed-off-by: Olaf Hering <[EMAIL PROTECTED]>
   Signed-off-by: Wim Van Sebroeck <[EMAIL PROTECTED]>


The ChangeSets can also be looked at on:
http://linux-watchdog.bkbits.net:8080/linux-2.6-watchdog-mm

For completeness, I added the patches below.

Greetings,
Wim.


diff -Nru a/drivers/char/watchdog/i8xx_tco.c b/drivers/char/watchdog/i8xx_tco.c
--- a/drivers/char/watchdog/i8xx_tco.c  2005-01-28 23:29:58 +01:00
+++ b/drivers/char/watchdog/i8xx_tco.c  2005-01-28 23:29:58 +01:00
@@ -1,5 +1,5 @@
 /*
- * i8xx_tco 0.06:  TCO timer driver for i8xx chipsets
+ * i8xx_tco 0.07:  TCO timer driver for i8xx chipsets
  *
  * (c) Copyright 2000 kernel concepts <[EMAIL PROTECTED]>, All Rights 
Reserved.
  * http://www.kernelconcepts.de
@@ -22,11 +22,22 @@
  *
  * The TCO timer is implemented in the following I/O controller hubs:
  * (See the intel documentation on http://developer.intel.com.)
- * 82801AA & 82801AB  chip : document number 290655-003, 290677-004,
- * 82801BA & 82801BAM chip : document number 290687-002, 298242-005,
- * 82801CA & 82801CAM chip : document number 290716-001, 290718-001,
- * 82801DB & 82801E   chip : document number 290744-001, 273599-001,
- * 82801EB & 82801ER  chip : document number 252516-001
+ * 82801AA  (ICH): document number 290655-003, 290677-014,
+ * 82801AB  (ICHO)   : document number 290655-003, 290677-014,
+ * 82801BA  (ICH2)   : document number 290687-002, 298242-027,
+ * 82801BAM (ICH2-M) : document number 290687-002, 298242-027,
+ * 82801CA  (ICH3-S) : document number 290733-003, 290739-013,
+ * 82801CAM (ICH3-M) : document number 290716-001, 290718-007,
+ * 82801DB  (ICH4)   : document number 290744-001, 290745-020,
+ * 82801DBM (ICH4-M) : document number 252337-001, 252663-005,
+ * 82801E   (C-ICH)  : document number 273599-001, 273645-002,
+ * 82801EB  (ICH5)   : document number 252516-001, 252517-003,
+ * 82801ER  (ICH5R)  : document number 252516-001, 252517-003,
+ * 82801FB  (ICH6)   : document number 301473-002, 301474-007,
+ * 82801FR  (ICH6R)  : document number 301473-002, 301474-007,
+ * 82801FBM (ICH6-M) : document number 301473-002, 301474-007,
+ * 82801FW  (ICH6W)  : document number 301473-001, 301474-007,
+ * 82801FRW (ICH6RW) : document number 301473-001, 301474-007
  *
  *  2710 Nils Faerber
  * Initial Version 0.01
@@ -49,6 +60,9 @@
  *  20030921 Wim Van Sebroeck <[EMAIL PROTECTED]>
  * 0.06 change i810_margin to heartbeat, use module_param,
  *      added notify system support, renamed module to i8xx_tco.
+ *  20050128 Wim Van Sebroeck <[EMAIL PROTECTED]>
+ * 0.07 Added support for the ICH4-M, ICH6, ICH6R, ICH6-M, ICH6W and ICH6RW
+ *  chipsets. Also added support for the "undocumented" ICH7 chipset.
  */
 
 /*
@@ -73,7 +87,7 @@
 #include "i8xx_tco.h"
 
 /* Module and version information */
-#define TCO_VERSION "0.06"
+#define TCO_VERSION "0.07"
 #define TCO_MODULE_NAME "i8xx TCO timer"
 #define TCO_DRIVER_NAME   TCO_MODULE_NAME ", v" TCO_VERSION
 #define PFX TCO_MODULE_NAME ": "
@@ -360,8 +374,14 @@
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_0,   PCI_ANY_ID, 
PCI_ANY_ID, },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_12,  PCI_ANY_ID, 
PCI_ANY_ID, },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801DB_0,   PCI_ANY_ID, 
PCI_ANY_ID, },
+

Re: panic in raid1_end_write_request

2005-01-28 Thread Mark Rustad

Norman,
I used to get these running SuSE SLES 9 and also with a variety of 
kernel.org kernels. The crash was triggered by a media error on a 
RAID1. A patch that I got from SuSE fixed it for me. The patch is below 
your message excerpt.

On Jan 28, 2005, at 3:23 PM, Norman Gaywood wrote:
I have a Dell PE2650, Dual Xeon, 1G memory and several software raid1
partitions, ext3. Main duties include NFS, DHCP and samba. A Fedora
kernel 2.6.10-1.747_FC3smp which includes 2.6.10-ac10.
This system panics frequently, between several hours to several days. 
It
does not seem to be related to load. Hardware and memory tests indicate
a good system.

Panic messages are similar to:
Unable to handle kernel NULL pointer dereference at virtual address 
0038
 printing eip:
f882940f
*pde = 379c9001
Oops:  [#1]

Here is the patch:
--- linux-2.6.5/fs/bio.c~   2004-11-24 12:42:10.532343678 +0100
+++ linux-2.6.5/fs/bio.c2004-11-24 12:46:49.308021403 +0100
@@ -98,12 +98,7 @@
BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
-   /*
-* cloned bio doesn't own the veclist
-*/
-   if (!bio_flagged(bio, BIO_CLONED))
-   mempool_free(bio->bi_io_vec, bp->pool);
-
+   mempool_free(bio->bi_io_vec, bp->pool);
mempool_free(bio, bio_pool);
 }
@@ -212,7 +207,9 @@
  */
 inline void __bio_clone(struct bio *bio, struct bio *bio_src)
 {
-	bio->bi_io_vec = bio_src->bi_io_vec;
+	request_queue_t *q = bdev_get_queue(bio_src->bi_bdev);
+
+	memcpy(bio->bi_io_vec, bio_src->bi_io_vec, bio_src->bi_max_vecs * 
sizeof(struct bio_vec));

bio->bi_sector = bio_src->bi_sector;
bio->bi_bdev = bio_src->bi_bdev;
@@ -224,21 +221,9 @@
 * for the clone
 */
bio->bi_vcnt = bio_src->bi_vcnt;
-   bio->bi_idx = bio_src->bi_idx;
-   if (bio_flagged(bio, BIO_SEG_VALID)) {
-   bio->bi_phys_segments = bio_src->bi_phys_segments;
-   bio->bi_hw_segments = bio_src->bi_hw_segments;
-   bio->bi_flags |= (1 << BIO_SEG_VALID);
-   }
bio->bi_size = bio_src->bi_size;
-
-   /*
-* cloned bio does not own the bio_vec, so users cannot fiddle with
-* it. clear bi_max_vecs and clear the BIO_POOL_BITS to make this
-* apparent
-*/
-   bio->bi_max_vecs = 0;
-   bio->bi_flags &= (BIO_POOL_MASK - 1);
+   bio_phys_segments(q, bio);
+   bio_hw_segments(q, bio);
 }
 /**
@@ -250,7 +235,7 @@
  */
 struct bio *bio_clone(struct bio *bio, int gfp_mask)
 {
-   struct bio *b = bio_alloc(gfp_mask, 0);
+   struct bio *b = bio_alloc(gfp_mask, bio->bi_max_vecs);
if (b)
__bio_clone(b, bio);
--
Mark Rustad, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] shared subtrees

2005-01-28 Thread Mike Waychison

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Al Viro wrote:

> OK, here comes the first draft of proposed semantics for subtree
> sharing.  What we want is being able to propagate events between
> the parts of mount trees.  Below is a description of what I think
> might be a workable semantics; it does *NOT* describe the data
> structures I would consider final and there are considerable
> areas where we still need to figure out the right behaviour.
> 

Okay, I'm not convinced that shared subtrees as proposed will work well
with autofs.

The idea discussed off-line was this:

When you install an autofs mountpoint, on say /home, a daemon is started
to service the requests.  As far as the admin is concerned, an fs is
mounted in the current namespace, call it namespaceA.  The daemon
actually runs in it's one private namespace: call it namespaceB.
namespaceB receives a new autofs filesystem: call it autofsB.  autofsB
is in it's own p-node.  namespaceA gets an autofsA on /home as well, and
autofsA is 'owned' by autofsB's p-node.

So:

autofsB -> autofsB
and
autofsB -> autofsA

Effectively, namespaceA has a private instance of autofsB in its tree.

The problem is this:

Assume /home/mikew is accessed in namespaceA.  The daemon running in
namespaceB gets the event, and mounts an nfs vfsmount on autofsB.  This
event is propagated back to autofsA.

(Problem 1: how do you block access to /home/mikew in namespaceA?)

Next, a CLONE_NS is done in namespaceA, creating namespaceA'.  the
homedir on /home/mikew is also copied.

Now, in namespaceA', what happens when a user umount's /home/mikew?  We
haven't yet determined how to handle umount event propagation, but it
appears likely that it will be *a hard thing to do*.

Assuming the nfs umount succeeds, /home/mikew is accessed again in
namespaceA'.

(Problem 2: The daemon in namespaceB will see the event, but it already
has something mounted on it's version of /home/mikew.  How does it
'send' a mountpoint to namespaceB.)

- ---

Shared subtrees may help in some adminstrative situations, but don't
look like the right solution for autofs.

Autofs will work with namespaces if the following functionality is added
to the kernel:  The ability to perform mount(2) operations on a
directory fd.

This has been discussed before and quickly vetoed, citing that it is a
security risk.  I still fail to understand how allowing a mount to
happen cross-namespace given a dirfd target is any worse than what is
already possible given a dirfd.  If you don't want someone to play with
your namespace, don't give them a dirfd.

Thoughts?

- --
Mike Waychison
Sun Microsystems, Inc.
1 (650) 352-5299 voice
1 (416) 202-8336 voice

~~
NOTICE:  The opinions expressed in this email are held by me,
and may not represent the views of Sun Microsystems, Inc.
~~
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB+r1OdQs4kOxk3/MRAmSpAJ96ix25fjze6o7viCq2DCET9J/AlQCfYlC1
CoLKusJXjL+fYxgwggOCW+w=
=8bTv
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] relayfs redux, part 2

2005-01-28 Thread Andrew Morton

Tom Zanussi <[EMAIL PROTECTED]> wrote:
>
> This patch is the result of the latest round of liposuction on relayfs
>  - the patch size is now 44K, down from 110K and the 200K before that.
>  I'm posting it as a patch against 2.6.10 rather than -mm in order to
>  make it easier to review, but will create one for -mm once the changes
>  have settled down.

Actually, I'll drop all the relayfs and ltt patches from -mm.  They seem to
have done their job ;)

When things settle down and the code is ready for a new run, you know where
I sit.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel oops!

2005-01-28 Thread ierdnah

On Fri, 2005-01-28 at 12:28 -0800, Linus Torvalds wrote:


> I'm surprised that it makes _that_ much of a difference, but it sounds
> like you used to be borderline on CPU usage before, and this just made it
> much worse.

it's musch worst, I had a load of 5 with 250 VPN connections, and now, I
have a load of 200 with 150 connections

-- 
ierdnah <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: multiple neighbour cache tables for AF_INET

2005-01-28 Thread Herbert Xu

Wilfried Weissmann <[EMAIL PROTECTED]> wrote:
> 
> The kernels 2.4.28+ and 2.6.9+ with IPv4 and ATM-CLIP enabled have bugs in
> the neighbour cache code. neigh_delete() and neigh_add() only work properly
> if one cache table per address family exist. After ATM-CLIP installed a
> second cache table for AF_INET, neigh_delete() and neigh_add() only examine
> the first table (the ATM-CLIP table if IPv4 and ATM-CLIP are compiled into
> the kernel). neigh_dump_info() is also affected if the neigh_dump_table()
> call fails.

Indeed, this has been the case for a very long time.

IMHO you need to give the user a way to specify which table they want
to operate on.  If they don't specify one, then the current behaviour
of choosing the first table found is reasonble.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-01-28 Thread Ingo Molnar


* Peter Williams <[EMAIL PROTECTED]> wrote:

> I think part of the problem here is that by comparing each tasks limit
> to the runqueue's usage rate (and to some extent using a relatively
> short decay period) you're creating the need for the limits to be
> quite large i.e. it has to be big enough to be bigger than the
> combined usage rates of all the unprivileged real time tasks and also
> to handle the short term usage rate peaks of the task.

actually, at least for Jackd use, the current average worked out pretty
well - setting the limit 5-10% above that of the reported average CPU
use gave a result that was equivalent to unrestricted SCHED_FIFO
results.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

2005-01-28 Thread Peter Williams

Ingo Molnar wrote:
* Jack O'Quin <[EMAIL PROTECTED]> wrote:

i'm wondering, couldnt Jackd solve this whole issue completely in
user-space, via a simple setuid-root wrapper app that does nothing else
but validates whether the user is in the 'jackd' group and then keeps a
pipe open to to the real jackd process which it forks off, deprivileges
and exec()s? Then unprivileged jackd could request RT-priority changes
via that pipe in a straightforward way. Jack normally gets installed as
root/admin anyway, so it's not like this couldnt be done.
Perhaps.
Until recently, that didn't work because of the longstanding rlimits
bug in mlockall().  For scheduling only, it might be possible.
Of course, this violates your requirement that the user not be able to
lock up the CPU for DoS.  The jackd watchdog is not perfect.

there is a legitimate fear that if it's made "too easy" to acquire some
sort of SCHED_FIFO priority, that an "arm's race" would begin between
desktop apps, each trying to set themselves to SCHED_FIFO (or SCHED_ISO)
and advising users to 'raise the limit if they see delays' - just to get
snappier than the rest.
thus after a couple of years we'd end up with lots of desktop apps
running as SCHED_FIFO, and latency would go down the drain again.
(yeah, this feels like going back to the drawing board.)
I think part of the problem here is that by comparing each tasks limit 
to the runqueue's usage rate (and to some extent using a relatively 
short decay period) you're creating the need for the limits to be quite 
large i.e. it has to be big enough to be bigger than the combined usage 
rates of all the unprivileged real time tasks and also to handle the 
short term usage rate peaks of the task.

If the average usage rate is estimated over longer periods it will be 
lower allowing lower limits to be used.  Also if the task's own usage 
rate estimates are used to test the limits then the limit can be lower.

If the default limits can be made sufficiently small then the temptation 
to use this feature by "ordinary" applications will disappear.

I'm not an expert but I imagine that the CPU usage rates of most RT 
tasks taken over reasonably long time intervals is quite low and 
therefore the default limits could also be quite low without adversely 
effecting the programs that this mechanism is meant to help.

The sched_cpustats.[ch] files that are part of my SPA scheduler patches 
provide a cheap method of estimating per task usage rates.  They 
estimate usage rates for a task over its recent scheduling cycles but 
could be modified to provide updates every tick for the currently active 
task for use with this mechanism.

Peter
--
Peter Williams   [EMAIL PROTECTED]
"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread Lorenzo Hernández García-Hierro

El vie, 28-01-2005 a las 21:47 +0100, Arjan van de Ven escribió:
> as for obsd_get_random_long().. would it be possible to use the
> get_random_int() function from the patches I posted the other day? They
> use the existing random.c infrastructure instead of making a copy...

As seen at
http://www.kernel.org/pub/linux/kernel/people/arjan/execshield/00-randomize-A0 
you can suppose that there's no point to use that, we can easily maintain the 
functions at obsd_rand.c so we wouldn't need to add more maintenance overhead, 
I hope you can understand why I want it like that and not depending on random.c 
in more than the function exports (which make it even more independent as we 
don't need to use our proper header and add each proper include entry in the 
modified files, as most of them use or have already random.h included).

Attached you can find the new patch with the indentation fixes.

The tests on the patch are the following ones:
http://www.osdl.org/plm-cgi/plm?module=patch_info_id=4136
(above one shows that there are no SMP-related issues)
http://khack.osdl.org/stp/300417
http://khack.osdl.org/stp/300420

Cheers and thanks for the information,
-- 
Lorenzo Hernández García-Hierro <[EMAIL PROTECTED]> 
[1024D/6F2B2DEC] & [2048g/9AE91A22][http://tuxedo-es.org]
diff -Nur linux-2.6.11-rc2/include/linux/random.h linux-2.6.11-rc2.tx1/include/linux/random.h
--- linux-2.6.11-rc2/include/linux/random.h	2005-01-26 19:54:17.0 +0100
+++ linux-2.6.11-rc2.tx1/include/linux/random.h	2005-01-28 19:45:31.359923392 +0100
@@ -42,6 +42,12 @@
 
 #ifdef __KERNEL__
 
+/* OpenBSD Networking-related randomization functions - [EMAIL PROTECTED] */
+extern unsigned long obsd_get_random_long(void);
+extern __u16 ip_randomid(void);
+extern __u32 ip_randomisn(void);
+
+
 extern void rand_initialize_irq(int irq);
 
 extern void add_input_randomness(unsigned int type, unsigned int code,
diff -Nur linux-2.6.11-rc2/net/ipv4/tcp_ipv4.c linux-2.6.11-rc2.tx1/net/ipv4/tcp_ipv4.c
--- linux-2.6.11-rc2/net/ipv4/tcp_ipv4.c	2005-01-26 19:54:19.0 +0100
+++ linux-2.6.11-rc2.tx1/net/ipv4/tcp_ipv4.c	2005-01-28 22:28:24.991105608 +0100
@@ -539,10 +539,7 @@
 
 static inline __u32 tcp_v4_init_sequence(struct sock *sk, struct sk_buff *skb)
 {
-	return secure_tcp_sequence_number(skb->nh.iph->daddr,
-	  skb->nh.iph->saddr,
-	  skb->h.th->dest,
-	  skb->h.th->source);
+	return ip_randomisn();
 }
 
 /* called with local bh disabled */
@@ -834,13 +830,9 @@
 	tp->ext2_header_len = rt->u.dst.header_len;
 
 	if (!tp->write_seq)
-		tp->write_seq = secure_tcp_sequence_number(inet->saddr,
-			   inet->daddr,
-			   inet->sport,
-			   usin->sin_port);
-
-	inet->id = tp->write_seq ^ jiffies;
+		tp->write_seq = ip_randomisn();
 
+	inet->id = htons(ip_randomid());
 	err = tcp_connect(sk);
 	rt = NULL;
 	if (err)
@@ -1566,20 +1555,20 @@
 	newsk->sk_dst_cache = dst;
 	tcp_v4_setup_caps(newsk, dst);
 
-	newtp		  = tcp_sk(newsk);
-	newinet		  = inet_sk(newsk);
-	newinet->daddr	  = req->af.v4_req.rmt_addr;
-	newinet->rcv_saddr= req->af.v4_req.loc_addr;
-	newinet->saddr	  = req->af.v4_req.loc_addr;
-	newinet->opt	  = req->af.v4_req.opt;
-	req->af.v4_req.opt= NULL;
-	newinet->mc_index = tcp_v4_iif(skb);
-	newinet->mc_ttl	  = skb->nh.iph->ttl;
+	newtp = tcp_sk(newsk);
+	newinet = inet_sk(newsk);
+	newinet->daddr = req->af.v4_req.rmt_addr;
+	newinet->rcv_saddr = req->af.v4_req.loc_addr;
+	newinet->saddr = req->af.v4_req.loc_addr;
+	newinet->opt = req->af.v4_req.opt;
+	req->af.v4_req.opt = NULL;
+	newinet->mc_index = tcp_v4_iif(skb);
+	newinet->mc_ttl = skb->nh.iph->ttl;
 	newtp->ext_header_len = 0;
 	if (newinet->opt)
 		newtp->ext_header_len = newinet->opt->optlen;
 	newtp->ext2_header_len = dst->header_len;
-	newinet->id = newtp->write_seq ^ jiffies;
+	newinet->id = htons(ip_randomid());
 
 	tcp_sync_mss(newsk, dst_pmtu(dst));
 	newtp->advmss = dst_metric(dst, RTAX_ADVMSS);

diff -Nur linux-2.6.11-rc2/net/Makefile linux-2.6.11-rc2.tx1/net/Makefile
--- linux-2.6.11-rc2/net/Makefile	2005-01-26 19:50:49.0 +0100
+++ linux-2.6.11-rc2.tx1/net/Makefile	2005-01-28 21:01:21.870140688 +0100
@@ -11,6 +11,7 @@
 
 tmp-$(CONFIG_COMPAT) 		:= compat.o
 obj-$(CONFIG_NET)		+= $(tmp-y)
+obj-y+= obsd_rand.o
 
 # LLC has to be linked before the files in net/802/
 obj-$(CONFIG_LLC)		+= llc/
diff -Nur linux-2.6.11-rc2/net/obsd_rand.c linux-2.6.11-rc2.tx1/net/obsd_rand.c
--- linux-2.6.11-rc2/net/obsd_rand.c	1970-01-01 01:00:00.0 +0100
+++ linux-2.6.11-rc2.tx1/net/obsd_rand.c	2005-01-28 17:43:50.0 +0100
@@ -0,0 +1,269 @@
+/* $Id: openbsd-netrand-2.6.11-rc2.patch,v 1.6 2005/01/28 22:10:30 lorenzo Exp $
+ * Copyright (c) 2005 Lorenzo Hernandez Garcia-Hierro <[EMAIL PROTECTED]>.
+ * All rights reserved.
+ *
+ * Added some macros and stolen code from random.c, for individual and less
+ * "invasive" implementation.Also removed the get_random_long() macro definition,
+ * which is not good if we can

Re: Possible bug in keyboard.c (2.6.10)

2005-01-28 Thread Andries Brouwer

On Fri, Jan 28, 2005 at 12:10:05PM +0100, Vojtech Pavlik wrote:

> And, btw, raw mode in 2.6 is not badly broken. It works as it is
> intended to. If you want the 2.4 behavior on x86, you just need to
> specify "atkbd.softraw=0" on the kernel command line.

Thanks for pointing that out - I should have read patch-2.6.9 more
carefully. I'll add that to the setkeycodes.8 man page.

Nevertheless I disagree a bit. "raw mode" is by definition the mode
where scan codes are passed unmodified to user space.
So before 2.6.9 this was just broken, and since 2.6.9 it is broken
by default but there is a boot option to make it work.

What is the reason that you do not make this the default?
The current default is really messy and confusing, especially
when people have to map keys using setkeycodes.

Andries

BTW, now that I read the corresponding code:

if (atkbd_softrepeat)
atkbd_softraw = 1;

if (!atkbd_softrepeat) {
atkbd->dev.rep[REP_DELAY] = 250;
atkbd->dev.rep[REP_PERIOD] = 33;
} else atkbd_softraw = 1;

The "else" part is superfluous.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[WATCHDOG] 2.6.11-rc2 i8xx_tco.c-ICH4/6/7-patch

2005-01-28 Thread Wim Van Sebroeck

Hi Linus, Andrew,

please do a

bk pull http://linux-watchdog.bkbits.net/linux-2.6-watchdog

This will update the following files:

 drivers/char/watchdog/i8xx_tco.c |   34 +++---
 1 files changed, 27 insertions(+), 7 deletions(-)

through these ChangeSets:

<[EMAIL PROTECTED]> (05/01/28 1.1984)
   [WATCHDOG] i8xx_tco.c-ICH4/6/7-patch
   
   Added support for the ICH4-M, ICH6, ICH6R, ICH6-M, ICH6W and ICH6RW
   chipsets. Also added support for the "undocumented" ICH7.


The ChangeSets can also be looked at on:
http://linux-watchdog.bkbits.net:8080/linux-2.6-watchdog

For completeness, I added the patches below.

Greetings,
Wim.


diff -Nru a/drivers/char/watchdog/i8xx_tco.c b/drivers/char/watchdog/i8xx_tco.c
--- a/drivers/char/watchdog/i8xx_tco.c  2005-01-28 22:51:31 +01:00
+++ b/drivers/char/watchdog/i8xx_tco.c  2005-01-28 22:51:31 +01:00
@@ -1,5 +1,5 @@
 /*
- * i8xx_tco 0.06:  TCO timer driver for i8xx chipsets
+ * i8xx_tco 0.07:  TCO timer driver for i8xx chipsets
  *
  * (c) Copyright 2000 kernel concepts <[EMAIL PROTECTED]>, All Rights 
Reserved.
  * http://www.kernelconcepts.de
@@ -22,11 +22,22 @@
  *
  * The TCO timer is implemented in the following I/O controller hubs:
  * (See the intel documentation on http://developer.intel.com.)
- * 82801AA & 82801AB  chip : document number 290655-003, 290677-004,
- * 82801BA & 82801BAM chip : document number 290687-002, 298242-005,
- * 82801CA & 82801CAM chip : document number 290716-001, 290718-001,
- * 82801DB & 82801E   chip : document number 290744-001, 273599-001,
- * 82801EB & 82801ER  chip : document number 252516-001
+ * 82801AA  (ICH): document number 290655-003, 290677-014,
+ * 82801AB  (ICHO)   : document number 290655-003, 290677-014,
+ * 82801BA  (ICH2)   : document number 290687-002, 298242-027,
+ * 82801BAM (ICH2-M) : document number 290687-002, 298242-027,
+ * 82801CA  (ICH3-S) : document number 290733-003, 290739-013,
+ * 82801CAM (ICH3-M) : document number 290716-001, 290718-007,
+ * 82801DB  (ICH4)   : document number 290744-001, 290745-020,
+ * 82801DBM (ICH4-M) : document number 252337-001, 252663-005,
+ * 82801E   (C-ICH)  : document number 273599-001, 273645-002,
+ * 82801EB  (ICH5)   : document number 252516-001, 252517-003,
+ * 82801ER  (ICH5R)  : document number 252516-001, 252517-003,
+ * 82801FB  (ICH6)   : document number 301473-002, 301474-007,
+ * 82801FR  (ICH6R)  : document number 301473-002, 301474-007,
+ * 82801FBM (ICH6-M) : document number 301473-002, 301474-007,
+ * 82801FW  (ICH6W)  : document number 301473-001, 301474-007,
+ * 82801FRW (ICH6RW) : document number 301473-001, 301474-007
  *
  *  2710 Nils Faerber
  * Initial Version 0.01
@@ -49,6 +60,9 @@
  *  20030921 Wim Van Sebroeck <[EMAIL PROTECTED]>
  * 0.06 change i810_margin to heartbeat, use module_param,
  *  added notify system support, renamed module to i8xx_tco.
+ *  20050128 Wim Van Sebroeck <[EMAIL PROTECTED]>
+ * 0.07 Added support for the ICH4-M, ICH6, ICH6R, ICH6-M, ICH6W and ICH6RW
+ *  chipsets. Also added support for the "undocumented" ICH7 chipset.
  */
 
 /*
@@ -73,7 +87,7 @@
 #include "i8xx_tco.h"
 
 /* Module and version information */
-#define TCO_VERSION "0.06"
+#define TCO_VERSION "0.07"
 #define TCO_MODULE_NAME "i8xx TCO timer"
 #define TCO_DRIVER_NAME   TCO_MODULE_NAME ", v" TCO_VERSION
 #define PFX TCO_MODULE_NAME ": "
@@ -360,8 +374,14 @@
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_0,   PCI_ANY_ID, 
PCI_ANY_ID, },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801CA_12,  PCI_ANY_ID, 
PCI_ANY_ID, },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801DB_0,   PCI_ANY_ID, 
PCI_ANY_ID, },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801DB_12,  PCI_ANY_ID, 
PCI_ANY_ID, },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801E_0,PCI_ANY_ID, 
PCI_ANY_ID, },
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801EB_0,   PCI_ANY_ID, 
PCI_ANY_ID, },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_0,  PCI_ANY_ID, 
PCI_ANY_ID, },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1,  PCI_ANY_ID, 
PCI_ANY_ID, },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_2,  PCI_ANY_ID, 
PCI_ANY_ID, },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0,  PCI_ANY_ID, 
PCI_ANY_ID, },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1,  PCI_ANY_ID, 
PCI_ANY_ID, },
{ 0, }, /* End of list */
 };
 MODULE_DEVICE_TABLE (pci, i8xx_tco_pci_tbl);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a mess

Re: [2.6.11-rc2] kernel BUG at fs/reiserfs/prints.c:362

2005-01-28 Thread Lee Revell

On Thu, 2005-01-27 at 17:15 +0300, Vladimir Saveliev wrote:
> Earlier reiserfs used to lock_kernel on entering and unlock on exit. The
> reason is that reiserfs has no fine grain locking protecting access to
> its data structures.
> Since that time there could be introduced some minor improvements,
> though.

No, reiser3 still does not have proper locking.  It uses the BKL for
everything.  This will not be fixed as reiser3 is in maintenance mode.
According to Hans "the fix is reiser4".

This came up early in the voluntary preemption development process, we
found reiser3 to be unusable for low latency audio due to the excessive
BKL use disabling preemption all over the place.

It would be interesting to test reiser3 with the preemptible BKL
enabled.

Lee  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread David S. Miller

On Fri, 28 Jan 2005 13:34:08 -0800
Stephen Hemminger <[EMAIL PROTECTED]> wrote:

> per-cpu would be the way to go here.

Does the sbox get somehow seeded from use to use?
If not, then yes that's the thing to do.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/1] tpm: insert missing up mutex in an error path

2005-01-28 Thread Kylene Hall

This patch puts in the missing up call on the tpm_mutex on an 
error condition in the tpm_transmit function.  Bug reported by Stefan 
Berger <[EMAIL PROTECTED]>.  This patch also implements a new status 
function to handle future chip configurations which may generate status 
differntly. 

Thanks,
Kylie
  
Signed-off-by: Kylene Hall <[EMAIL PROTECTED]>
---
diff -uprN linux-2.6.10/drivers/char/tpm/tpm_atmel.c 
linux-2.6.10-tpm/drivers/char/tpm/tpm_atmel.c
--- linux-2.6.10/drivers/char/tpm/tpm_atmel.c   2005-01-18 16:42:17.0 
-0600
+++ linux-2.6.10-tpm/drivers/char/tpm/tpm_atmel.c   2005-01-21 
13:11:11.0 -0600
@@ -112,6 +112,11 @@ static void tpm_atml_cancel(struct tpm_c
outb(ATML_STATUS_ABORT, chip->vendor->base + 1);
 }
 
+static u8 tpm_atml_status(struct tpm_chip *chip)
+{
+   return inb( chip->vendor->base + 1);
+}
+
 static struct file_operations atmel_ops = {
.owner = THIS_MODULE,
.llseek = no_llseek,
@@ -125,6 +130,7 @@ static struct tpm_vendor_specific tpm_at
.recv = tpm_atml_recv,
.send = tpm_atml_send,
.cancel = tpm_atml_cancel,
+   .status = tpm_atml_status,
.req_complete_mask = ATML_STATUS_BUSY | ATML_STATUS_DATA_AVAIL,
.req_complete_val = ATML_STATUS_DATA_AVAIL,
.base = TPM_ATML_BASE,
diff -uprN linux-2.6.10/drivers/char/tpm/tpm.c 
linux-2.6.10-tpm/drivers/char/tpm/tpm.c
--- linux-2.6.10/drivers/char/tpm/tpm.c 2005-01-21 12:53:26.0 -0600
+++ linux-2.6.10-tpm/drivers/char/tpm/tpm.c 2005-01-28 16:28:45.578493680 
-0600
@@ -152,6 +151,7 @@ static ssize_t tpm_transmit(struct tpm_c
if ((len = chip->vendor->send(chip, (u8 *) buf, count)) < 0) {
dev_err(>pci_dev->dev,
"tpm_transmit: tpm_send: error %d\n", len);
+   up(>tpm_mutex);
return len;
}
 
@@ -165,7 +165,7 @@ static ssize_t tpm_transmit(struct tpm_c
up(>timer_manipulation_mutex);
 
do {
-   u8 status = inb(chip->vendor->base + 1);
+   u8 status = chip->vendor->status(chip);
if ((status & chip->vendor->req_complete_mask) ==
chip->vendor->req_complete_val) {
down(>timer_manipulation_mutex);
diff -uprN linux-2.6.10/drivers/char/tpm/tpm.h 
linux-2.6.10-tpm/drivers/char/tpm/tpm.h
--- linux-2.6.10/drivers/char/tpm/tpm.h 2005-01-18 16:42:17.0 -0600
+++ linux-2.6.10-tpm/drivers/char/tpm/tpm.h 2005-01-21 13:10:20.0 
-0600
@@ -40,6 +40,7 @@ struct tpm_vendor_specific {
int (*recv) (struct tpm_chip *, u8 *, size_t);
int (*send) (struct tpm_chip *, u8 *, size_t);
void (*cancel) (struct tpm_chip *);
+   u8 (*status) (struct tpm_chip *);
struct miscdevice miscdev;
 };
 
diff -uprN linux-2.6.10/drivers/char/tpm/tpm_nsc.c 
linux-2.6.10-tpm/drivers/char/tpm/tpm_nsc.c
--- linux-2.6.10/drivers/char/tpm/tpm_nsc.c 2005-01-18 16:42:17.0 
-0600
+++ linux-2.6.10-tpm/drivers/char/tpm/tpm_nsc.c 2005-01-21 13:12:27.0 
-0600
@@ -219,6 +219,12 @@ static void tpm_nsc_cancel(struct tpm_ch
outb(NSC_COMMAND_CANCEL, chip->vendor->base + NSC_COMMAND);
 }
 
+
+static u8 tpm_nsc_status(struct tpm_chip *chip)
+{
+   return inb(chip->vendor->base + NSC_STATUS);
+}
+
 static struct file_operations nsc_ops = {
.owner = THIS_MODULE,
.llseek = no_llseek,
@@ -232,6 +238,7 @@ static struct tpm_vendor_specific tpm_ns
.recv = tpm_nsc_recv,
.send = tpm_nsc_send,
.cancel = tpm_nsc_cancel,
+   .status = tpm_nsc_status,
.req_complete_mask = NSC_STATUS_OBF,
.req_complete_val = NSC_STATUS_OBF,
.base = TPM_NSC_BASE,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why does the kernel need a gig of VM?

2005-01-28 Thread Oliver Neukum

Am Freitag, 28. Januar 2005 21:42 schrieb Josh Boyer:
> Because of various reasons.  Normal kernel space virtual addresses
> usually start at 0xc000, which is where the 3GiB userspace
> restriction comes from.  
> 
> Then there is the vmalloc virtual address space, which usually starts at
> a higher address than a normal kernel address.  Along the same lines are
> ioremap addresses, etc.
> 
> Poke around in the header files.  I bet you'll find lots of reasons.

Probably, this some FAQ, but anyway. The kernel needs physical memory
present and accessible all the time from all contexts. This is mapped into
this area. All other RAM is called High Mem and needs to be specifically
mapped before it can be used from kernel space.

Regards
Oliver
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: userspace vs. kernelspace address

2005-01-28 Thread Rock Gordon

Hi everbody,

Thanks for your replies.

Lemme explain my problem a little bit more  I have
a thread that does exactly similar things in
kernel-mode and user-mode (depending on how you
invoked it; of course, the kernel one is forked using
kernel_thread(), and the user one is from
pthread_create()). The architecture-dependant stuff is
taken care of by extensive use of __KERNEL__ macro
testing.

This particular thread gets a packet of data, the
header of which contains address to where it should be
copying the payload associated with that packet. The
kernel-mode thread will need to decide how to copy
data into another process' address space, so will the
user-mode thread.

However I think my copy_to_user and copy_from_user are
failing since the kernel-mode thread is copying data
into another process's address space, and I am not
sure how to do this. Do the get_fs() and set_fs()
combinations let you do that? If not, then how do I do
it?

Something like when you invoke the ->write or ->read
functions, you need to copy the requisite data into
the buffer the application provided you with.

Thanks and regards,
Rock

--- Jan Hudec <[EMAIL PROTECTED]> wrote:

> On Fri, Jan 28, 2005 at 01:06:21 +0100, Bernd
> Petrovitsch wrote:
> > On Thu, 2005-01-27 at 09:14 -0800, Rock Gordon
> wrote:
> > > If I'm given a particular address, how do I test
> > > whether that address is from userspace or from
> kernel
> > > space?
> > 
> > You don't.
> > 
> > > I need to make these decisions from either
> inside a
> > > kernel module or a userspace program. The idea
> is I
> > > use memcpy() in the user-user version,
> > > copy_from/to_user in the kernel-kernel version,
> and
> > > prohibit the others.
> > 
> > You need to know where the address is from and use
> the correct function.
> 
> If the interface is defined as taking userland
> address, than kernel
> function passing a kernel address in is responsible
> for calling
> set_fs(KERNEL_DS) before and undoing it after. That
> way the
> copy_to/from_user does not complain.
> 
>
---
>Jan 'Bulb' Hudec <[EMAIL 
> PROTECTED]>
> 

> ATTACHMENT part 2 application/pgp-signature
name=signature.asc

__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] 2.6.11-rc2 ALSA

2005-01-28 Thread Lee Revell

On Thu, 2005-01-27 at 08:46 +0100, Jaroslav Kysela wrote:
> Fixed the default state of "Headphone Jack Sense" switch on AD1981x
> codecs.  Setting this on affects the output of some machines (e.g.
> Thindpads).

You probably meant "Thimkpads".

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread Stephen Hemminger

On Fri, 28 Jan 2005 12:45:17 -0800
"David S. Miller" <[EMAIL PROTECTED]> wrote:

> On Fri, 28 Jan 2005 21:34:52 +0100
> Lorenzo Hernández García-Hierro <[EMAIL PROTECTED]> wrote:
> 
> > Attached the new patch following Arjan's recommendations.
> 
> No SMP protection on the SBOX, better look into that.
> The locking you'll likely need to add will make this
> routine serialize many networking operations which is
> one thing we've been trying to avoid.
> 

per-cpu would be the way to go here.

-- 
Stephen Hemminger   <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Fix compile errors with 2.6.11-rc2

2005-01-28 Thread Manish Lachwani

Hi !

When compiling 2.6.11-rc2:

...

  CC  kernel/stop_machine.o
In file included from include/linux/sysdev.h:24,
 from include/linux/cpu.h:22,
 from include/linux/stop_machine.h:8,
 from kernel/stop_machine.c:1:
include/linux/kobject.h: In function `to_kset':
include/linux/kobject.h:116: warning: implicit declaration of function 
`container_of'
include/linux/kobject.h:116: error: parse error before "struct"
include/linux/kobject.h:117: warning: no return statement in function returning 
non-void
include/linux/kobject.h: In function `subsys_get':
include/linux/kobject.h:224: error: parse error before "struct"
include/linux/kobject.h:225: warning: no return statement in function returning 
non-void
make[1]: *** [kernel/stop_machine.o] Error 1
make: *** [kernel] Error 2

Attached patch fixes this.

Thanks
Manish Lachwani
Signed-off-by: Manish Lachwani <[EMAIL PROTECTED]>

Index: linux-2.6.11-rc2/include/linux/kobject.h
===
--- linux-2.6.11-rc2.orig/include/linux/kobject.h
+++ linux-2.6.11-rc2/include/linux/kobject.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define KOBJ_NAME_LEN  20

panic in raid1_end_write_request

2005-01-28 Thread Norman Gaywood

I have a Dell PE2650, Dual Xeon, 1G memory and several software raid1
partitions, ext3. Main duties include NFS, DHCP and samba. A Fedora
kernel 2.6.10-1.747_FC3smp which includes 2.6.10-ac10.

This system panics frequently, between several hours to several days. It
does not seem to be related to load. Hardware and memory tests indicate
a good system.

Panic messages are similar to:

Unable to handle kernel NULL pointer dereference at virtual address 0038
 printing eip:
f882940f
*pde = 379c9001
Oops:  [#1]
SMP 
Modules linked in: iptable_filter ip_tables nfsd exportfs md5 ipv6 parport_pc 
lp parport autofs4 i2c_dev i2c_core nfs lockd sunrpc microcode dm_mod video 
button battery ac cfi_probe gen_probe scb2_flash mtdcore chipreg map_funcs tg3 
floppy sg ext3 jbd raid1 aic7xxx sd_mod scsi_mod
CPU:3
EIP:0060:[]Not tainted VLI
EFLAGS: 00010246   (2.6.10-1.747_FC3smp) 
EIP is at raid1_end_write_request+0x8e/0xb2 [raid1]
eax:    ebx: f7dda400   ecx: f79e78a0   edx: 
esi: 0018   edi: f7dd6e00   ebp: f7dda400   esp: c03aef18
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c03ae000 task=f7f5fa40)
Stack: f7fbd100 1000 f8829381  c01564ce 1000 f7fbd100  
   c03aef60 c0217b6f f7bcca24    1000 f7bcca24 
   f7d4b33c f78f4080 0001 f88435ec 0001 e4d10b80 f7bcca24 f78f4080 
Call Trace:
 [] raid1_end_write_request+0x0/0xb2 [raid1]
 [] bio_endio+0x50/0x55
 [] __end_that_request_first+0xea/0x1ab
 [] scsi_end_request+0x1b/0x9d [scsi_mod]
 [] scsi_io_completion+0x206/0x40f [scsi_mod]
 [] __wake_up+0x29/0x3c
 [] scsi_finish_command+0xad/0xb1 [scsi_mod]
 [] scsi_softirq+0xb6/0xbe [scsi_mod]
 [] __do_softirq+0x4c/0xb1
 [] do_softirq+0x41/0x48
 ===
 [] do_IRQ+0x74/0x7e
 [] common_interrupt+0x1a/0x20
 [] default_idle+0x0/0x2f
 [] xfrm_sk_policy_lookup+0x2cd/0x355
 [] default_idle+0x29/0x2f
 [] cpu_idle+0x26/0x3b
Code: 53 08 89 44 0e 04 89 54 0e 08 f0 ff 0b 0f 94 c0 84 c0 74 0f 8b 43 14 e8 
bf 5f a3 c7 89 d8 e8 15 fe ff ff 8b 47 04 8b 1f 8b 04 06 <8b> 48 38 f0 ff 48 48 
0f 94 c2 84 d2 74 0d 85 c9 74 09 f0 0f ba 
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 

-- 
Norman Gaywood, Systems Administrator
School of Mathematics, Statistics and Computer Science
University of New England, Armidale, NSW 2351, Australia

[EMAIL PROTECTED]Phone: +61 (0)2 6773 2412
http://turing.une.edu.au/~normFax:   +61 (0)2 6773 3312

Please avoid sending me Word or PowerPoint attachments.
See http://www.fsf.org/philosophy/no-word-attachments.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Real-time rw-locks (Re: [patch] Real-Time Preemption, -RT-2.6.10-rc2-mm3-V0.7.32-15)

2005-01-28 Thread Lee Revell

On Fri, 2005-01-28 at 11:18 -0800, Trond Myklebust wrote:
> In the NFS client code we may use rwsems in order to protect stateful
> operations against the (very infrequently used) server reboot recovery
> code. The point is that when the server reboots, the server forces us to
> block *all* requests that involve adding new state (e.g. opening an
> NFSv4 file, or setting up a lock) while our client and others are
> re-establishing their existing state on the server.

Hmm, when I was an ISP sysadmin I used to use this all the time.  NFS
mounts from the BSD/OS clients would start to act up under heavy web
server load and the cleanest way to get them to recover was to simulate
a reboot on the NetApp.  Of course Linux clients were unaffected, they
were just along for the ride ;-)

Lee

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: journaled filesystems -- known instability; Was: XFS: inode with st_mode == 0

2005-01-28 Thread Jeffrey E. Hundstad

Stephen C. Tweedie wrote:
Hi,
On Fri, 2005-01-28 at 20:15, Jeffrey E. Hundstad wrote:
 

Does linux-2.6.11-rc2 have both the linux-2.6.10-ac10 fix and the xattr 
problem fixed?
   

 

Not sure about how much of -ac went in, but it has the xattr fix.
 

 

I've had my machine that would crash daily if not hourly stay up for 10 
days now.  This is with the linux-2.6.10-ac10 kernel. 
   

Good to know.  Are you using xattrs extensively (eg. for ACLs, SELinux
or Samba 4)?
--Stephen
 

On the machines that were having problems we really weren't using them 
for anything.  I think I may have been running into the BIO problem that 
was fixed in 2.6.10-ac10.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: journaled filesystems -- known instability; Was: XFS: inode with st_mode == 0

2005-01-28 Thread Stephen C. Tweedie

Hi,

On Fri, 2005-01-28 at 20:15, Jeffrey E. Hundstad wrote:

> >>Does linux-2.6.11-rc2 have both the linux-2.6.10-ac10 fix and the xattr 
> >>problem fixed?

> >Not sure about how much of -ac went in, but it has the xattr fix.

> I've had my machine that would crash daily if not hourly stay up for 10 
> days now.  This is with the linux-2.6.10-ac10 kernel. 

Good to know.  Are you using xattrs extensively (eg. for ACLs, SELinux
or Samba 4)?

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why does the kernel need a gig of VM?

2005-01-28 Thread John Richard Moser

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Wow.

I'd heard that there was a way to set 3.5/0.5 GiB split, and that there
was a patch that removed the split and isolated the kernel (but that was
slow), so I was just curious about all this stuff with people screaming
about how tight 4G of VM is vs a half gig or a gig that can be freed up.

Josh Boyer wrote:
> On Fri, 2005-01-28 at 15:06 -0500, John Richard Moser wrote:
> 
>>-BEGIN PGP SIGNED MESSAGE-
>>Hash: SHA1
>>
>>Can someone give me a layout of what exactly is up there?  I got the
>>basic idea
>>
>>K 4G
>>A 3G
>>A 2G
>>A 1G
>>
>>App has 3G, kernel has 1G at the top of VM on x86 (dunno about x86_64).
>>
>>So what's the layout of that top 1G?  What's it all used for?  Is there
>>some obscene restriction of 1G of shared memory or something that gets
>>mapped up there?
>>
>>How much does it need, and why?  What, if anything, is variable and
>>likely to do more than 10 or 15 megs of variation?
> 
> 
> Because of various reasons.  Normal kernel space virtual addresses
> usually start at 0xc000, which is where the 3GiB userspace
> restriction comes from.  
> 
> Then there is the vmalloc virtual address space, which usually starts at
> a higher address than a normal kernel address.  Along the same lines are
> ioremap addresses, etc.
> 
> Poke around in the header files.  I bet you'll find lots of reasons.
> 
> josh
> 
> 

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB+qUdhDd4aOud5P8RAmU8AJ9fRQi4A+yIVaXdv/oWlPIqObROPQCfUgvU
KAsRKxYgSTWVecLsZZCvXgE=
=v+fM
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] relayfs redux, part 2

2005-01-28 Thread Tim Bird

Tom Zanussi wrote:
> diff -urpN -X dontdiff linux-2.6.10/fs/Kconfig linux-2.6.10-cur/fs/Kconfig
...

> +   This file system is also available as a module ( = code which can be
> +   inserted in and removed from the running kernel whenever you want).
> +   The module is called relayfs.  If you want to compile it as a
> +   module, say M here and read .
...

This is a real nit, but personally I'd remove the stuff in parens above.
 It's not relayfs' job to educate users about what a module is.

I'll try to give some more substantive feedback next week.

Tim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: I need a hardware wizard... I have been beating my head on the wall..

2005-01-28 Thread Paulo Marques

David Sims wrote:
On Thu, 27 Jan 2005, Jeff Garzik wrote:
David Sims wrote:
[...]
 You can insert the module in a running kernel and after barking as
follows (once for each disk attached) it runs just fine.
Basically nobody has ever had hardware to test sata_vsc with that 
hardware.  We should probably remove the PCI ID until an engineer can 
fix it...
Hi again,
  I am willing to make this hardware available to any engineer that wants
to help me solve this problem and I will do whatever I can to make it
an easy job... Please help me...
Well, I don't consider myself a hardware wizard, but at least I'm an 
engineer, so I decided to give it a go :)

It seems that the driver is not acknowledging the interrupt from the 
controller. It would be nice to know what kind of interrupt is 
triggering this.

Could you run the attached patch and show the output from dmesg?
--
Paulo Marques - www.grupopie.com
All that is necessary for the triumph of evil is that good men do nothing.
Edmund Burke (1729 - 1797)
--- sata_vsc.c.orig 2005-01-28 12:23:47.0 +
+++ sata_vsc.c  2005-01-28 20:51:13.993868526 +
@@ -160,12 +160,17 @@ irqreturn_t vsc_sata_interrupt (int irq,
struct ata_host_set *host_set = dev_instance;
unsigned int i;
unsigned int handled = 0;
+static int int_count = 0;
u32 int_status;
 
spin_lock(_set->lock);
 
int_status = readl(host_set->mmio_base + VSC_SATA_INT_STAT_OFFSET);
 
+   int_count++;
+   if (int_count > 1000 && int_count <= 1020)
+   printk("vsc_sata int status: %08x\n", int_status);
+
for (i = 0; i < host_set->n_ports; i++) {
if (int_status & ((u32) 0xFF << (8 * i))) {
struct ata_port *ap;

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread David S. Miller

On Fri, 28 Jan 2005 21:34:52 +0100
Lorenzo Hernández García-Hierro <[EMAIL PROTECTED]> wrote:

> Attached the new patch following Arjan's recommendations.

No SMP protection on the SBOX, better look into that.
The locking you'll likely need to add will make this
routine serialize many networking operations which is
one thing we've been trying to avoid.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why does the kernel need a gig of VM?

2005-01-28 Thread Josh Boyer

On Fri, 2005-01-28 at 15:06 -0500, John Richard Moser wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Can someone give me a layout of what exactly is up there?  I got the
> basic idea
> 
> K 4G
> A 3G
> A 2G
> A 1G
> 
> App has 3G, kernel has 1G at the top of VM on x86 (dunno about x86_64).
> 
> So what's the layout of that top 1G?  What's it all used for?  Is there
> some obscene restriction of 1G of shared memory or something that gets
> mapped up there?
> 
> How much does it need, and why?  What, if anything, is variable and
> likely to do more than 10 or 15 megs of variation?

Because of various reasons.  Normal kernel space virtual addresses
usually start at 0xc000, which is where the 3GiB userspace
restriction comes from.  

Then there is the vmalloc virtual address space, which usually starts at
a higher address than a normal kernel address.  Along the same lines are
ioremap addresses, etc.

Poke around in the header files.  I bet you'll find lots of reasons.

josh

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread Arjan van de Ven

On Fri, 2005-01-28 at 21:34 +0100, Lorenzo HernÃndez GarcÃa-Hierro
wrote:
> Hi,
> 
> Attached the new patch following Arjan's recommendations.
> I'm sorry about not making it "inlined", but my mail agent messes up the
> diffs if I do so.
> Still waiting for the OSDL STP tests results, they will take a while to
> finish.
> 
> Cheers,

lots better already! Some more comments (now that the patch got a lot
easier to read :)

 static inline __u32 tcp_v4_init_sequence(struct sock *sk, struct
sk_buff *skb)
 {
-   return secure_tcp_sequence_number(skb->nh.iph->daddr,
- skb->nh.iph->saddr,
- skb->h.th->dest,
- skb->h.th->source);
+
+   return ip_randomisn();
 }

is there a reason for the weird indentation?

+   if (!tp->write_seq) {
+   tp->write_seq = ip_randomisn();
+   }

spare { } pare that's not needed, also looks like one tab too many


as for obsd_get_random_long().. would it be possible to use the
get_random_int() function from the patches I posted the other day? They
use the existing random.c infrastructure instead of making a copy...

I still don't understand why you need a obsd_rand.c and can't use the
normal random.c


  
 static inline u32 xprt_alloc_xid(struct rpc_xprt *xprt)
 {
-   return xprt->xid++;
+   /* Return randomized xprt->xid instead of prt->xid++ */
+   return (u32) obsd_get_random_long();
+
 }
 

that cast looks quite redundant...





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

page fault scalability patch V16 [2/4]: mm counter macros

2005-01-28 Thread Christoph Lameter

This patch extracts all the interesting pieces for handling rss and
anon_rss into definitions in include/linux/sched.h. All rss operations
are performed through the following three macros:

get_mm_counter(mm, member)  -> Obtain the value of a counter
set_mm_counter(mm, member, value)   -> Set the value of a counter
update_mm_counter(mm, member, value)-> Add a value to a counter

The simple definitions provided in this patch should result in no change to
to the generated code.

With this patch it becomes easier to add new counters and it is possible
to redefine the method of counter handling (f.e. the page fault scalability
patches may want to use atomic operations or split rss).

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.10/include/linux/sched.h
===
--- linux-2.6.10.orig/include/linux/sched.h 2005-01-28 11:01:51.0 
-0800
+++ linux-2.6.10/include/linux/sched.h  2005-01-28 11:02:00.0 -0800
@@ -203,6 +203,10 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct vm_area_struct *area);
 extern void arch_unmap_area_topdown(struct vm_area_struct *area);

+#define set_mm_counter(mm, member, value) (mm)->member = (value)
+#define get_mm_counter(mm, member) ((mm)->member)
+#define update_mm_counter(mm, member, value) (mm)->member += (value)
+#define MM_COUNTER_T unsigned long

 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */
@@ -219,7 +223,7 @@ struct mm_struct {
atomic_t mm_count;  /* How many references to 
"struct mm_struct" (users count as 1) */
int map_count;  /* number of VMAs */
struct rw_semaphore mmap_sem;
-   spinlock_t page_table_lock; /* Protects page tables, 
mm->rss, mm->anon_rss */
+   spinlock_t page_table_lock; /* Protects page tables and 
some counters */

struct list_head mmlist;/* List of maybe swapped mm's.  
These are globally strung
 * together off init_mm.mmlist, 
and are protected
@@ -229,9 +233,13 @@ struct mm_struct {
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
-   unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
+   unsigned long total_vm, locked_vm, shared_vm;
unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

+   /* Special counters protected by the page_table_lock */
+   MM_COUNTER_T rss;
+   MM_COUNTER_T anon_rss;
+
unsigned long saved_auxv[42]; /* for /proc/PID/auxv */

unsigned dumpable:1;
Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-01-28 11:01:58.0 -0800
+++ linux-2.6.10/mm/memory.c2005-01-28 11:02:00.0 -0800
@@ -324,9 +324,9 @@ copy_one_pte(struct mm_struct *dst_mm,
pte = pte_mkclean(pte);
pte = pte_mkold(pte);
get_page(page);
-   dst_mm->rss++;
+   update_mm_counter(dst_mm, rss, 1);
if (PageAnon(page))
-   dst_mm->anon_rss++;
+   update_mm_counter(dst_mm, anon_rss, 1);
set_pte(dst_pte, pte);
page_dup_rmap(page);
 }
@@ -528,7 +528,7 @@ static void zap_pte_range(struct mmu_gat
if (pte_dirty(pte))
set_page_dirty(page);
if (PageAnon(page))
-   tlb->mm->anon_rss--;
+   update_mm_counter(tlb->mm, anon_rss, -1);
else if (pte_young(pte))
mark_page_accessed(page);
tlb->freed++;
@@ -1345,13 +1345,14 @@ static int do_wp_page(struct mm_struct *
spin_lock(>page_table_lock);
page_table = pte_offset_map(pmd, address);
if (likely(pte_same(*page_table, pte))) {
-   if (PageAnon(old_page))
-   mm->anon_rss--;
+   if (PageAnon(old_page))
+   update_mm_counter(mm, anon_rss, -1);
if (PageReserved(old_page)) {
-   ++mm->rss;
+   update_mm_counter(mm, rss, 1);
acct_update_integrals();
update_mem_hiwater();
} else
+
page_remove_rmap(old_page);
break_cow(vma, new_page, address, page_table);
lru_cache_add_active(new_page);
@@ -1755,7 +1756,7 @@ static int do_swap_page(struct mm_struct
if (vm_swap_full())
remove_exclusive_swap_page(page);

-   mm->rss++;
+   update_mm_counter(mm, rss, 1);
acct_update_integrals();

page fault scalability patch V16 [3/4]: Drop page_table_lock in handle_mm_fault

2005-01-28 Thread Christoph Lameter

The page fault handler attempts to use the page_table_lock only for short
time periods. It repeatedly drops and reacquires the lock. When the lock
is reacquired, checks are made if the underlying pte has changed before
replacing the pte value. These locations are a good fit for the use of
ptep_cmpxchg.

The following patch allows to remove the first time the page_table_lock is
acquired and uses atomic operations on the page table instead. A section
using atomic pte operations is begun with

page_table_atomic_start(struct mm_struct *)

and ends with

page_table_atomic_stop(struct mm_struct *)

Both of these become spin_lock(page_table_lock) and
spin_unlock(page_table_lock) if atomic page table operations are not
configured (CONFIG_ATOMIC_TABLE_OPS undefined).

Atomic operations with pte_xchg and pte_cmpxchg only work for the lowest
layer of the page table. Higher layers may also be populated in an atomic
way by defining pmd_test_and_populate() etc. The generic versions of these
functions fall back to the page_table_lock (populating higher level page
table entries is rare and therefore this is not likely to be performance
critical). For ia64 the definitions for higher level atomic operations is
included and these may easily be added for other architectures.

This patch depends on the pte_cmpxchg patch to be applied first and will
only remove the first use of the page_table_lock in the page fault handler.
This will allow the following page table operations without acquiring
the page_table_lock:

1. Updating of access bits (handle_mm_faults)
2. Anonymous read faults (do_anonymous_page)

The page_table_lock is still acquired for creating a new pte for an anonymous
write fault and therefore the problems with rss that were addressed by splitting
rss into the task structure do not yet occur.

The patch also adds some diagnostic features by counting the number of cmpxchg
failures (useful for verification if this patch works right) and the number of
faults received that led to no change in the page table. These statistics may
be viewed via /proc/meminfo

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-01-27 16:27:59.0 -0800
+++ linux-2.6.10/mm/memory.c2005-01-27 16:28:54.0 -0800
@@ -36,6 +36,8 @@
  * ([EMAIL PROTECTED])
  *
  * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ * Jan 2005Scalability improvement by reducing the use and the length of 
time
+ * the page table lock is held (Christoph Lameter)
  */

 #include 
@@ -1285,8 +1287,8 @@ static inline void break_cow(struct vm_a
  * change only once the write actually happens. This avoids a few races,
  * and potentially makes it more efficient.
  *
- * We hold the mm semaphore and the page_table_lock on entry and exit
- * with the page_table_lock released.
+ * We hold the mm semaphore and have started atomic pte operations,
+ * exit with pte ops completed.
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct * vma,
unsigned long address, pte_t *page_table, pmd_t *pmd, pte_t pte)
@@ -1304,7 +1306,7 @@ static int do_wp_page(struct mm_struct *
pte_unmap(page_table);
printk(KERN_ERR "do_wp_page: bogus page at address %08lx\n",
address);
-   spin_unlock(>page_table_lock);
+   page_table_atomic_stop(mm);
return VM_FAULT_OOM;
}
old_page = pfn_to_page(pfn);
@@ -1316,21 +1318,27 @@ static int do_wp_page(struct mm_struct *
flush_cache_page(vma, address);
entry = maybe_mkwrite(pte_mkyoung(pte_mkdirty(pte)),
  vma);
-   ptep_set_access_flags(vma, address, page_table, entry, 
1);
-   update_mmu_cache(vma, address, entry);
+   /*
+* If the bits are not updated then another fault
+* will be generated with another chance of updating.
+*/
+   if (ptep_cmpxchg(page_table, pte, entry))
+   update_mmu_cache(vma, address, entry);
+   else
+   inc_page_state(cmpxchg_fail_flag_reuse);
pte_unmap(page_table);
-   spin_unlock(>page_table_lock);
+   page_table_atomic_stop(mm);
return VM_FAULT_MINOR;
}
}
pte_unmap(page_table);
+   page_table_atomic_stop(mm);

/*
 * Ok, we need to copy. Oh, well..
 */
if (!PageReserved(old_page))
page_cache_get(old_page);
-   spin_unlock(>page_table_lock);

if (unlikely(anon_vma_prepare(vma)))

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread Lorenzo Hernández García-Hierro

Hi,

Attached the new patch following Arjan's recommendations.
I'm sorry about not making it "inlined", but my mail agent messes up the
diffs if I do so.
Still waiting for the OSDL STP tests results, they will take a while to
finish.

Cheers,
-- 
Lorenzo Hernández García-Hierro <[EMAIL PROTECTED]> 
[1024D/6F2B2DEC] & [2048g/9AE91A22][http://tuxedo-es.org]


diff -Nur linux-2.6.11-rc2/include/linux/random.h linux-2.6.11-rc2.tx1/include/linux/random.h
--- linux-2.6.11-rc2/include/linux/random.h	2005-01-26 19:54:17.0 +0100
+++ linux-2.6.11-rc2.tx1/include/linux/random.h	2005-01-28 19:45:31.359923392 +0100
@@ -42,6 +42,12 @@
 
 #ifdef __KERNEL__
 
+/* OpenBSD Networking-related randomization functions - [EMAIL PROTECTED] */
+extern unsigned long obsd_get_random_long(void);
+extern __u16 ip_randomid(void);
+extern __u32 ip_randomisn(void);
+
+
 extern void rand_initialize_irq(int irq);
 
 extern void add_input_randomness(unsigned int type, unsigned int code,
diff -Nur linux-2.6.11-rc2/net/ipv4/tcp_ipv4.c linux-2.6.11-rc2.tx1/net/ipv4/tcp_ipv4.c
--- linux-2.6.11-rc2/net/ipv4/tcp_ipv4.c	2005-01-26 19:54:19.0 +0100
+++ linux-2.6.11-rc2.tx1/net/ipv4/tcp_ipv4.c	2005-01-28 19:39:48.0 +0100
@@ -539,10 +539,8 @@
 
 static inline __u32 tcp_v4_init_sequence(struct sock *sk, struct sk_buff *skb)
 {
-	return secure_tcp_sequence_number(skb->nh.iph->daddr,
-	  skb->nh.iph->saddr,
-	  skb->h.th->dest,
-	  skb->h.th->source);
+
+		return ip_randomisn();
 }
 
 /* called with local bh disabled */
@@ -833,14 +831,11 @@
 	tcp_v4_setup_caps(sk, >u.dst);
 	tp->ext2_header_len = rt->u.dst.header_len;
 
-	if (!tp->write_seq)
-		tp->write_seq = secure_tcp_sequence_number(inet->saddr,
-			   inet->daddr,
-			   inet->sport,
-			   usin->sin_port);
-
-	inet->id = tp->write_seq ^ jiffies;
-
+	if (!tp->write_seq) {
+			tp->write_seq = ip_randomisn();
+	}
+	
+	inet->id = htons(ip_randomid());
 	err = tcp_connect(sk);
 	rt = NULL;
 	if (err)
@@ -1579,8 +1574,8 @@
 	if (newinet->opt)
 		newtp->ext_header_len = newinet->opt->optlen;
 	newtp->ext2_header_len = dst->header_len;
-	newinet->id = newtp->write_seq ^ jiffies;
-
+	newinet->id = htons(ip_randomid());
+	
 	tcp_sync_mss(newsk, dst_pmtu(dst));
 	newtp->advmss = dst_metric(dst, RTAX_ADVMSS);
 	tcp_initialize_rcv_mss(newsk);
diff -Nur linux-2.6.11-rc2/net/Makefile linux-2.6.11-rc2.tx1/net/Makefile
--- linux-2.6.11-rc2/net/Makefile	2005-01-26 19:50:49.0 +0100
+++ linux-2.6.11-rc2.tx1/net/Makefile	2005-01-28 21:01:21.870140688 +0100
@@ -11,6 +11,7 @@
 
 tmp-$(CONFIG_COMPAT) 		:= compat.o
 obj-$(CONFIG_NET)		+= $(tmp-y)
+obj-y+= obsd_rand.o
 
 # LLC has to be linked before the files in net/802/
 obj-$(CONFIG_LLC)		+= llc/
diff -Nur linux-2.6.11-rc2/net/obsd_rand.c linux-2.6.11-rc2.tx1/net/obsd_rand.c
--- linux-2.6.11-rc2/net/obsd_rand.c	1970-01-01 01:00:00.0 +0100
+++ linux-2.6.11-rc2.tx1/net/obsd_rand.c	2005-01-28 17:43:50.0 +0100
@@ -0,0 +1,269 @@
+/* $Id: openbsd-netrand-2.6.11-rc2.patch,v 1.5 2005/01/28 20:16:21 lorenzo Exp $
+ * Copyright (c) 2005 Lorenzo Hernandez Garcia-Hierro <[EMAIL PROTECTED]>.
+ * All rights reserved.
+ *
+ * Added some macros and stolen code from random.c, for individual and less
+ * "invasive" implementation.Also removed the get_random_long() macro definition,
+ * which is not good if we can simply call back obsd_get_random_long().
+ *
+ * Copyright (c) 1996, 1997, 2000-2002 Michael Shalayeff.
+ * 
+ * Version 1.90, last modified 28-Jan-05
+ *
+ * Copyright Theodore Ts'o, 1994, 1995, 1996, 1997, 1998, 1999.
+ * All rights reserved.
+ *
+ * Copyright 1998 Niels Provos <[EMAIL PROTECTED]>
+ * All rights reserved.
+ * Theo de Raadt <[EMAIL PROTECTED]> came up with the idea of using
+ * such a mathematical system to generate more random (yet non-repeating)
+ * ids to solve the resolver/named problem.  But Niels designed the
+ * actual system based on the constraints.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer,
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED

page fault scalability patch V16 [4/4]: Drop page_table_lock in do_anonymous_page

2005-01-28 Thread Christoph Lameter

Do not use the page_table_lock in do_anonymous_page. This will significantly
increase the parallelism in the page fault handler in SMP systems. The patch
also modifies the definitions of _mm_counter functions so that rss and anon_rss
become atomic.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.10/mm/memory.c
===
--- linux-2.6.10.orig/mm/memory.c   2005-01-27 16:39:24.0 -0800
+++ linux-2.6.10/mm/memory.c2005-01-27 16:39:24.0 -0800
@@ -1839,12 +1839,12 @@ do_anonymous_page(struct mm_struct *mm,
 vma->vm_page_prot)),
  vma);

-   spin_lock(>page_table_lock);
+   page_table_atomic_start(mm);

if (!ptep_cmpxchg(page_table, orig_entry, entry)) {
pte_unmap(page_table);
page_cache_release(page);
-   spin_unlock(>page_table_lock);
+   page_table_atomic_stop(mm);
inc_page_state(cmpxchg_fail_anon_write);
return VM_FAULT_MINOR;
}
@@ -1862,7 +1862,7 @@ do_anonymous_page(struct mm_struct *mm,

update_mmu_cache(vma, addr, entry);
pte_unmap(page_table);
-   spin_unlock(>page_table_lock);
+   page_table_atomic_stop(mm);

return VM_FAULT_MINOR;
 }
Index: linux-2.6.10/include/linux/sched.h
===
--- linux-2.6.10.orig/include/linux/sched.h 2005-01-27 16:39:24.0 
-0800
+++ linux-2.6.10/include/linux/sched.h  2005-01-27 16:40:24.0 -0800
@@ -203,10 +203,26 @@ arch_get_unmapped_area_topdown(struct fi
 extern void arch_unmap_area(struct vm_area_struct *area);
 extern void arch_unmap_area_topdown(struct vm_area_struct *area);

+#ifdef CONFIG_ATOMIC_TABLE_OPS
+/*
+ * Atomic page table operations require that the counters are also
+ * incremented atomically
+*/
+#define set_mm_counter(mm, member, value) atomic_set(&(mm)->member, value)
+#define get_mm_counter(mm, member) ((unsigned long)atomic_read(&(mm)->member))
+#define update_mm_counter(mm, member, value) atomic_add(value, &(mm)->member)
+#define MM_COUNTER_T atomic_t
+
+#else
+/*
+ * No atomic page table operations. Counters are protected by
+ * the page table lock
+ */
 #define set_mm_counter(mm, member, value) (mm)->member = (value)
 #define get_mm_counter(mm, member) ((mm)->member)
 #define update_mm_counter(mm, member, value) (mm)->member += (value)
 #define MM_COUNTER_T unsigned long
+#endif

 struct mm_struct {
struct vm_area_struct * mmap;   /* list of VMAs */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

page fault scalability patch V16 [0/4]: redesign overview

2005-01-28 Thread Christoph Lameter

Changes from V15->V16 of this patch: Complete Redesign.

An introduction to what this patch does and a patch archive can be found on
http://oss.sgi.com/projects/page_fault_performance. The archive also has a
combined patch.

The basic approach in this patchset is the same as used in SGI's 2.4.X
based kernels which have been in production use in ProPack 3 for a long time.

The patchset is composed of 4 patches (and was tested against 2.6.11-rc2-bk6
on ia64, i386 and x86_64):

1/4: ptep_cmpxchg and ptep_xchg to avoid intermittent zeroing of ptes

The current way of synchronizing with the CPU or arch specific
interrupts updating page table entries is to first set a pte
to zero before writing a new value. This patch uses ptep_xchg
and ptep_cmpxchg to avoid writing the zero for certain
configurations.

The patch introduces CONFIG_ATOMIC_TABLE_OPS that may be
enabled as a experimental feature during kernel configuration
if the hardware is able to support atomic operations and if
an SMP kernel is being configured. A Kconfig update for i386,
x86_64 and ia64 has been provided. On i386 this options is
restricted to CPUs better than a 486 and non PAE mode (that
way all the cmpxchg issues on old i386 CPUS and the problems
with 64bit atomic operations on recent i386 CPUS are avoided).

If CONFIG_ATOMIC_TABLE_OPS is not set then ptep_xchg and
ptep_xcmpxchg are realized by falling back to clearing a pte
before updating it.

The patch does not change the use of mm->page_table_lock and
the only performance improvement is the replacement of
xchg-with-zero-and-then-write-new-pte-value with an xchg with
the new value for SMP on some architectures if
CONFIG_ATOMIC_TABLE_OPS is configured. It should not do anything
major to VM operations.

2/4: Macros for mm counter manipulation

There are various approaches to handling mm counters if the
page_table_lock is no longer acquired. This patch defines
macros in include/linux/sched.h to handle these counters and
makes sure that these macros are used throughout the kernel
to access and manipulate rss and anon_rss. There should be
no change to the generated code as a result of this patch.

3/4: Drop the first use of the page_table_lock in handle_mm_fault

The patch introduces two new functions:

page_table_atomic_start(mm), page_table_atomic_stop(mm)

that fall back to the use of the page_table_lock if
CONFIG_ATOMIC_TABLE_OPS is not defined.

If CONFIG_ATOMIC_TABLE_OPS is defined those functions may
be used to prep the CPU for atomic table ops (i386 in PAE mode
may f.e. get the MMX register ready for 64bit atomic ops) but
are simply empty by default.

Two operations may then be performed on the page table without
acquiring the page table lock:

a) updating access bits in pte
b) anonymous read faults installed a mapping to the zero page.

All counters are still protected with the page_table_lock thus
avoiding any issues there.

Some additional statistics are added to /proc/meminfo to
give some statistics. Also counts spurious faults with no
effect. There is a surprisingly high number of those on ia64
(used to populate the cpu caches with the pte??)

4/4: Drop the use of the page_table_lock in do_anonymous_page

The second acquisition of the page_table_lock is removed
from do_anonymous_page and allows the anonymous
write fault to be possible without the page_table_lock.

The macros for manipulating rss and anon_rss in include/linux/sched.h
are changed if CONFIG_ATOMIC_TABLE_OPS is set to use atomic
operations for rss and anon_rss (safest solution for now, other
solutions may easily be implemented by changing those macros).

This patch typically yield significant increases in page fault
performance for threaded applications on SMP systems.

I have an additional patch that drops the page_table_lock for COW but that
raises a lot of other issues. I will post that patch separately and only
to linux-mm.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 573 matches

Mail list logo