from:"linux"

Re: [ 00/45] 3.0.88-stable review

2013-07-27 Thread linux


Quoting Greg Kroah-Hartman :


This is the start of the stable review cycle for the 3.0.88 release.
There are 45 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Sun Jul 28 20:54:53 UTC 2013.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.0.88-rc1.gz
and the diffstat can be found below.


We have additional build failures.

Total builds: 54 Total build errors: 20

Link:  
http://desktop.roeck-us.net/buildlogs/v3.0.87-45-g367423c.2013-07-27.03:09:16


Previously:
Total builds: 54 Total build errors: 17

Additional failures are i386/allmodconfig, i386/allyesconfig and mips/malta.

I don't have time to track down the culprit tonight (I am still in  
Down Under ;). I hope I can do it tomorrow.


Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ 00/59] 3.4.55-stable review

2013-07-27 Thread linux


Quoting Greg Kroah-Hartman :


This is the start of the stable review cycle for the 3.4.55 release.
There are 59 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Sun Jul 28 20:48:22 UTC 2013.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.4.55-rc1.gz
and the diffstat can be found below.


Cross platform build looks good:

Total builds: 58 Total: build errors: 8

Details:  
http://desktop.roeck-us.net/buildlogs/v3.4.54-59-g956f996.2013-07-27.08:29:14


Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ 00/79] 3.10.4-stable review

2013-07-27 Thread linux


Quoting Greg Kroah-Hartman :


This is the start of the stable review cycle for the 3.10.4 release.
There are 79 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Sun Jul 28 20:45:08 UTC 2013.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:
kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.10.4-rc1.gz
and the diffstat can be found below.


Cross build is ok: Total builds: 64 Total build errors: 3

Details:  
http://desktop.roeck-us.net/buildlogs/v3.10/v3.10.3-79-g6d0cdc6.2013-07-27.15:42:05


Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ 000/103] 3.10.3-stable review

2013-07-24 Thread linux


Quoting Greg Kroah-Hartman :


This is the start of the stable review cycle for the 3.10.3 release.
There are 103 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Thu Jul 25 22:01:33 UTC 2013.
Anything received after that time might be too late.


Build results below. Same as with previous release.

Guenter


Build x86_64:defconfig passed
Build x86_64:allyesconfig passed
Build x86_64:allmodconfig passed
Build x86_64:allnoconfig passed
Build x86_64:alldefconfig passed
Build i386:defconfig passed
Build i386:allyesconfig passed
Build i386:allmodconfig passed
Build i386:allnoconfig passed
Build i386:alldefconfig passed
Build mips:defconfig passed
Build mips:bcm47xx_defconfig passed
Build mips:bcm63xx_defconfig passed
Build mips:nlm_xlp_defconfig passed
Build mips:ath79_defconfig passed
Build mips:ar7_defconfig passed
Build mips:fuloong2e_defconfig passed
Build mips:e55_defconfig passed
Build mips:cavium_octeon_defconfig passed
Build mips:powertv_defconfig passed
Build mips:malta_defconfig passed
Build powerpc:defconfig passed
Build powerpc:allyesconfig failed
Build powerpc:allmodconfig passed
Build powerpc:chroma_defconfig passed
Build powerpc:maple_defconfig passed
Build powerpc:ppc6xx_defconfig passed
Build powerpc:mpc83xx_defconfig passed
Build powerpc:mpc85xx_defconfig passed
Build powerpc:mpc85xx_smp_defconfig passed
Build powerpc:tqm8xx_defconfig passed
Build powerpc:85xx/sbc8548_defconfig passed
Build powerpc:83xx/mpc834x_mds_defconfig passed
Build powerpc:86xx/sbc8641d_defconfig passed
Build arm:defconfig passed
Build arm:allyesconfig failed
Build arm:allmodconfig failed
Build arm:exynos4_defconfig passed
Build arm:multi_v7_defconfig passed
Build arm:kirkwood_defconfig passed
Build arm:omap2plus_defconfig passed
Build arm:tegra_defconfig passed
Build arm:u8500_defconfig passed
Build arm:at91sam9rl_defconfig passed
Build arm:ap4evb_defconfig passed
Build arm:bcm_defconfig passed
Build arm:bonito_defconfig passed
Build arm:pxa910_defconfig passed
Build arm:mvebu_defconfig passed
Build m68k:defconfig passed
Build m68k:m5272c3_defconfig passed
Build m68k:m5307c3_defconfig passed
Build m68k:m5249evb_defconfig passed
Build m68k:m5407c3_defconfig passed
Build m68k:sun3_defconfig passed
Build m68k:m5475evb_defconfig passed
Build sparc:defconfig passed
Build sparc:sparc64_defconfig passed
Build xtensa:defconfig passed
Build xtensa:iss_defconfig passed
Build microblaze:mmu_defconfig passed
Build microblaze:nommu_defconfig passe
Build blackfin:defconfig passed
Build parisc:defconfig passed

---
Total builds: 64 Total build errors: 3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Test - Please Disregard

2001-02-12 Thread linux



Test.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/

DMA and my Maxtor drive

2000-10-20 Thread linux



I get this when DMA is enabled:

Oct 20 15:39:07 cr753963-a kernel: hdb: timeout waiting for DMA
Oct 20 15:39:07 cr753963-a kernel: hdb: irq timeout: status=0x6e {
DriveReady DeviceFault DataRequest CorrectedError Index }
ide0: reset: success
Oct 20 15:39:07 cr753963-a kernel: hdb: DMA disabled
Oct 20 15:39:07 cr753963-a kernel: ide0: reset: success

It only happens when there lots of data is being transferred, or compiled
on the drive.. The drive status is this:

/dev/hdb:

 Model=Maxtor 82560A4, FwRev=AA8Z2726, SerialNo=C40LTQGA
 Config={ Fixed }
 RawCHS=4962/16/63, TrkSize=0, SectSize=0, ECCbytes=20
 BuffType=DualPortCache, BuffSize=256kB, MaxMultSect=16, MultSect=off
 CurCHS=4962/16/63, CurSects=5001696, LBA=yes, LBAsects=5001728
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes: pio0 pio1 pio2 pio3 pio4
 DMA modes: mdma0 mdma1 *mdma2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

2.4.0pre9 and an analog joystick

2000-10-24 Thread linux



I just switched from 2.2.17pre9 to 2.4.0pre9, and my joystick won't work
anymore. It's an analog joystick connected to an AudioPCI sound card. I
can get it initialized, but I can not access it, it seems it does not map
it to js0

Oct 24 23:15:21 cr753963-a kernel: gameport0: NS558 ISA at 0x200 size 8
speed 917 kHz
Oct 24 23:15:31 cr753963-a kernel: input0: Analog 2-axis 4-button joystick
at gameport0.0 [TSC timer, 463 MHz clock, 1193 ns res]

and I can't get any further :[

Dave



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 2.4.0pre9 and an analog joystick

2000-10-25 Thread linux




On Tue, 24 Oct 2000, Brian Gerst wrote:

> [EMAIL PROTECTED] wrote:
> > 
> > I just switched from 2.2.17pre9 to 2.4.0pre9, and my joystick won't work
> > anymore. It's an analog joystick connected to an AudioPCI sound card. I
> > can get it initialized, but I can not access it, it seems it does not map
> > it to js0
> > 
> > Oct 24 23:15:21 cr753963-a kernel: gameport0: NS558 ISA at 0x200 size 8
> > speed 917 kHz
> > Oct 24 23:15:31 cr753963-a kernel: input0: Analog 2-axis 4-button joystick
> > at gameport0.0 [TSC timer, 463 MHz clock, 1193 ns res]
> > 
> > and I can't get any further :[
> > 
> > Dave
> 
> insmod joydev

Ok, I can get it to work with modules, but it will not work if it's
directly compiled into the kernel, is this a known bug?

Dave


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Networking problems with 2.4.0 and 2.4.1

2001-01-31 Thread linux



Since using kernel 2.4.0 and 2.4.1 I have been having very weird problems
with my network.  Suddenly the network connection drops and dies until I
take down the interface, and then successfully ping a machine. This is the
only thing that I can get out of syslog that is relevant:

Jan 31 14:17:29 cr753963-a kernel: eth1: 21143 10baseT link beat good.
Jan 31 14:17:50 cr753963-a kernel: NETDEV WATCHDOG: eth1: transmit timed 
out
Jan 31 14:17:50 cr753963-a kernel: eth1: 21041 transmit timed out, status
fc6908c5, CSR12 01c8, CSR13 ef05, CSR14 ff3f, resetting...
Jan 31 14:17:50 cr753963-a kernel: eth1: 21143 100baseTx sensed media.

The only problem is, is that eth1 is a 10mbit card. This also happens when
I remove eth1, and only have eth0 in the computer. I put eth1 to see if it
would fix the problem.

Relevant info:

Jan 30 21:26:37 cr753963-a kernel: eth1: Digital DC21041 Tulip rev 33 at
0xe400, 21041 mode, 00:E0:29:11:0F:3A, IRQ 10.
Jan 30 21:26:37 cr753963-a kernel: eth1: 21041 Media table, default media
 (10baseT).
Jan 30 21:26:37 cr753963-a kernel: eth1:  21041 media #0, 10baseT.
Jan 30 21:26:37 cr753963-a kernel: eth1:  21041 media #4, 10baseT-FD.

Jan 30 21:26:37 cr753963-a kernel: ne.c: ISAPnP reports Generic PNP at i/o
0x220, irq 5.
Jan 30 21:26:37 cr753963-a kernel: ne.c:v1.10 9/23/94 Donald Becker
([EMAIL PROTECTED])
Jan 30 21:26:37 cr753963-a kernel: Last modified Nov 1, 2000 by Paul
Gortmaker
Jan 30 21:26:37 cr753963-a kernel: NE*000 ethercard probe at 0x220: 00 40
f6 24 34 08
Jan 30 21:26:37 cr753963-a kernel: eth0: NE2000 found at 0x220, using IRQ
5.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Fortuna

2005-04-14 Thread linux

 entropy in the input samples
from the "holdover" material, the problem would go away, but that's
an entropy measurement problem!

Until this cloud is dissipated by further analysis, it's not possible to
say "this is shiny and new and better; they's use it!" in good conscience.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-14 Thread linux

> Waiting for 256bits of entropy before outputting data is a good goal.
> Problem becomes how do you measure entropy in a reliable way?  This had
> me lynched last time I asked it so I'll stop right there.

It's a problem.  Also, with the current increase in wireless keyboards
and mice, that source should possibly not be considered secure any more.

On the other hand, clock jitter in GHz clocks is a *very* rich source
of seed material.  One rdtsc per timer tick, even accounted for at
0.1 bit per rdtsc (I think reality is more like 3-5 bits) would keep
you in clover.

> I'll not make any claim that random-fortuna.c should be mv'd to random.c, the
> patch given allows people to kernel config it in under the Cryptographic
> Options menu.  Perhaps a disclaimer in the help menu would be in order to
> inform users that Fortuna has profound consequences for those expecting
> Info-theoretic /dev/random?

I don't mean to disparage you efforts, but then what's the upside to
the resultant code maintenance hassles?  Other than hack value, what's the
advantage of even offering the choice?  An option like that is justified
when the two options have value for different people and it's not
possible to build a single merged solutions that satisfies both markets.

Also, Ted very deliberately made /dev/random non-optional so it could
be relied on without a lot of annoying run-time testing.  Would
a separate /dev/fortuna be better?



> The case where an attacker has some small amount of unknown in the pool is a
> problem that effects BOTH random-fortuna.c and random.c (and any other
> replacement for that matter).  Just an FYI.

Yes, but entropy estimation is supposed to deal with that.  If the
attacker is never allowed enough information out of the pool to
distinguish various input states, the output is secure.

> As for the "shifting property" problem of an attacker controlling some input
> to the pooling system, I've tried to get around this:

(Code that varies the pool things get added to based on r->key[i++ & 7])

> The add_entropy_words() function uses the first 8 bytes of the central
> pool to aggravate the predictability of where entropy goes.  It's still a
> linear progression untill the central pool is re-keyed, then you don't know
> where it went.  The central pool is reseeded every 0.1ms.

You need to think more carefully.  You're just elaborating it without
adding security.  Yes, the attacker *does* know!  The entire underlying
assumption is that the attacker knows the entire initial state of the
pool, owing to some compromise or another.

The assumption is that the only thing the attacker doesn't know is the
exact value of the incoming seed material.  (But the attacker does have
an excellent model of its distribution.)

Given that, the attacker knows the initial value of the key[] array,
and thus knows the order in which the pools are fed.  The question
then arises, come next reseed, can the attacker (with a small mountain
of computers, able to brute-force 40-bit problems in the blink of an
eye) infer the *complete* state of the pool from the output.

The confusion is over the word "random".  In programming jargon, the
word is most often used to mean "arbitrary" or "the choice doesn't
matter".  But that doesn't capture the idea of "unpredictable to
a skilled and determined opponent" that is needed in /dev/random.

So while the contents of key[] may be "random-looking", they're not
*unpredictable*, any more than the digits of pi are.

The attacker just has to, after each reseeding, brute-force the seed
bits fed to the (predictable) pools chosen to mix in, and then
use that information to infer the seeds added to the not-selected
pools.

If the attacker's uncertainty about the state of some of the subpools
increases to the catastrophic reseeding level, then the Fortuna design
goal is achieved.

If the seed samples are independent, then it's easy to see that the
schedule works.

But if the seed samples are correlated, it gets a lot trickier.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-14 Thread linux

> What if someone accesses the seed file directly before the machine
> boots?  Well, either (a) they have broken root already, or (b) have
> direct access to the disk.  In either case, the attacker with such
> powers can just has easily trojan an executable or a kernel, and
> you've got far worse problems to worry about that the piddling worries
> of (cue Gilbert and Sullivan overly dramatic music, ) the
> dreaded state-extension attack.

So the only remaining case is where an attacker can read the random
seed file before boot but can't install any trojans.  Which seems
like an awfully small case.  (Although in some scenarios, passive
attacks are much easier to mount than active ones.)

> Actually, if you check the current random.c code, you'll see that it
> has catastrophic reseeding in its design already.

Yes, I know.  Fortuna's claim to fame is that it tries to achieve that
without explicitly measuring entropy.

> My big concern with Fortuna is that it really is the result of
> cryptographic masturbation.  It fundamentally assumes that crypto
> primitives are secure (when the recent break of MD4, MD5, and now SHA1
> should have been a warning that this is a Bad Idea (tm)), and makes
> assumptions that have no real world significance (assume the attacker
> has enough privileges that he might as well be superuser and can
> trojan your system to a fare-thee-well  now, if you can't recover
> from a state extension attack, your random number generator is fatally
> flawed.)

I'm not a big fan of Fortuna either, but the issues are separate.
I agree that trusting crypto prmitives that you don't have to is a bad
idea.  If my application depends on SHA1 being secure, then I might as
well go ahead and use SHA1 in my PRNG.  But a kernel service doesn't
know what applications are relying on.

(Speaking of which, perhaps it's time, in light of the breaking of MD5,
to revisit the cut-down MD4 routine used in the TCP ISN selection?
I haven't read the MD5 & SHA1 papers in enough detail to understand the
flaw, but perhaps some defenses could be erected?)

But still, all else being equal, an RNG resistant to a state extension
attack *is* preferable to one that's not.  And the catastrophic
reseeding support in /dev/random provides exactly that feature.

What Fortuna tries to do is sidestep the hard problem of entropy
measurement.  And that's very admirable.  It's a very hard thing to do in
general, and the current technique of heuristics plus a lot of derating
is merely adequate.

If a technique could be developed that didn't need an accurate entropy
measurement, then things would be much better.


> In addition, Frotuna is profligate with entropy, and wastes it in
> order to be able make certain claims of being able to recover from a
> state-extension attack.  Hopefully everyone agrees that entropy
> collected from the hardware is precious (especially if you don't have
> special-purpose a hardware RNG), and shouldn't be wasted.  Wasting
> collected entropy for no benefit, only to protect against a largely
> theoretical attack --- where if a bad guy has enough privileges to
> compromise your RNG state, there are far easier ways to compromise
> your entire system, not just the RNG --- is Just Stupid(tm).

Just to be clear, I don't remember it ever throwing entropy away, but
it hoards some for years, thereby making it effectively unavailable.
Any catastrophic reseeding solution has to hold back entropy for some
time.

And I think that, even in the absence of special-purpose RNG hardware,
synchronization jitter on modern GHz+ CPUs is a fruitful source of
entropy.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-15 Thread linux

r, and can be changed without affecting the subpool structure that
is Fortuna's real novel contribution.  That was just what Niels and
Bruce came up with to make the whole thing concrete.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-15 Thread linux

> And the argument that "random.c doesn't rely on the strength of crypto
> primitives" is kinda lame, though I see where you're coming from.
> random.c's entropy mixing and output depends on the (endian incorrect)
> SHA-1 implementation hard coded in that file to be pre-image resistant.
> If that fails (and a few other things) then it's broken.

/dev/urandom depends on the strength of the crypto primitives.
/dev/random does not.  All it needs is a good uniform hash.

Do a bit of reading on the subject of "unicity distance".

(And as for the endianness of the SHA-1, are you trying to imply
something?  Because it makes zero difference, and reduces the code
size and execution time.  Which is obviously a Good Thing.)


As for hacking Fortuna in, could you give a clear statement of what
you're trying to achieve?

Do you like:
- The neat name,
- The strong ciphers used in the pools, or
- The multi-pool reseeding strategy, or
- Something else?

If you're doing it just for hack value, or to learn how to write a
device driver or whatever, then fine.  But if you're proposing it as
a mainline patch, then could we discuss the technical goals?

I don't think anyone wants to draw and quarter *you*, but your
code is going to get some extremely critical examination.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-16 Thread linux

   down to very low rates, that it sequesters some entropy for literally
   years.  Ted thinks that's inexcusable, and I can't really disagree.
   This can be fixed to a significant degree by tweaking the number
   of subpools.

3) Fortuna's design doesn't actually *work*.  The authors' analysis
   only works in the case that the entropy seeds are independent, but
   forgot to state the assumption.  Some people reviewing the design
   don't notice the omission.

   It's that assumption which lets to "divide up" the seed material
   among various sub-pools.  Without it, seed information leaks from
   the sequestered sub-pools to the more exposed ones, decreasing the
   "value" of the sequestered pools.

   I've shown a contrived pathological example, but I haven't managed
   to figure out how to characterize the leakage in a more general way.
   But let me give a realistic example.

   Again, suppose we have an entropy source that delivers one fresh
   random bit each time it is sampled.

   But suppose that rather than delivering a bare bit, it delivers the
   running sum of the bits.  So adjacent samples are either the same or
   differ by +1.  This seems to me an extremely plausible example.

   Consider a Fortuna-like thing with two pools.  The first pool is seeded
   with n, then the second with n+b0, then the first again with n+b0+b1.
   n is the arbitrary starting count, while b0 and b1 are independent
   random bits.

   Assuming that an attacker can see the first pool, they can find n.
   After the second step, their uncertainty about the second pool is 1
   bit, the value of b0.

   But the third step is interesting.  The attacker can see the value of
   b0+b1.  If the sum is 0 or 2, the value of b0 is determined uniquely.
   Only in the case that b0+b1 = 1 is there uncertainty.  So we have
   only *half* a bit of uncertainty (one bit, half of the time) in the
   second pool.

   Where did the missing entropy go?  Well, remember the Shannon formula
   for entropy, H(p_1,...,p_n) = - Sum(p_i * log(p_i)).  If the log is
   to the base 2, the result is in bits.

   Well, p_0 = 1/4, p_1 = 1/2, and p_2 = 1/4.  The logs of those are
   -2, -1, and -2, respectively.  So the sum works out to
   2 * 1/4 + 1 * 1/2 + 2 * 1/4 = 1.5.

   Half a bit of entropy has leaked from the second pool back into the first!

   I probably just don't have enough mathematical background, but I don't
   currently know how to bound this leakage.  In pathological cases,
   *all* of the entropy leaks into the lowest-order pool, at which point
   the whole elaborate structure of Fortuna is completely useless.

   *That* is my big problem with Fortuna.  If someone can finish the
   analysis and actually bound the leakage, then we can construct something
   that works.  But I've pushed the idea around for a while and not figured
   it out.

> I'll take my patch and not bother you anymore.  I'm sure I've taken a
> lot of your time as it is.

And you've spent a lot of time preparing that patch.  It's not a bad idea
to revisit the ideas occasionally, but let's talk about the real *meat*
of the issue.

If you think my analysis of Fortuna's issues above is flawed, please
say so!  If you disagree about the importance of the issues, that's
worth discussing too, although I can't promise that such a difference
of opinions will ever be totally resolved.  But arguing about the
relative importance  of good and bad points is meaningful.

Ideally, we manage to come up with a solution that has all the good points.

The only thing that's frustrating is discussing it with someone who doesn't
even seem to *see* the issues.

> Not to sound like a "I'm taking my ball and going home" - just explaining
> that I like the Fortuna design, I think it's elegant, I want it for my
> systems.  GPL requires I submit changes back, so I did with the unpleasant
> side-dish of my opinion on random.c.

Actually, the GNU GPL doesn't.  It only requires that you give out the
source if and when you give out the binary.  You can make as many
private changes as you like.  (Search debian-legal for "desert island
test".)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-16 Thread linux

>> First, a reminder that the design goal of /dev/random proper is
>> information-theoretic security.  That is, it should be secure against
>> an attacker with infinite computational power.

> I am skeptical.
> I have never seen any convincing evidence for this claim,
> and I suspect that there are cases in which /dev/random fails
> to achieve this standard.

I'm not sure which claim you're skeptical of.  The claim that it's
a design goal, or the claim that it achieves it?


I'm pretty sure that's been the *goal* since the beginning, and it says
so in the comments:
 * Even if it is possible to
 * analyze SHA in some clever way, as long as the amount of data
 * returned from the generator is less than the inherent entropy in
 * the pool, the output data is totally unpredictable. 

That's basically the information-theoretic definition, or at least
alluding to it.  "We're never going to give an attacker the unicity
distance needed to *start* breaking the crypto."

The whole division into two pools was because the original single-pool
design allowed (information-theoretically) deriving previous
/dev/random output from subsequent /dev/urandom output.  That's
discussed in section 5.3 of the paper you cited, and has been fixed.

There's probably more discussion of the subject in linux-kernel around
the time that change went in.


Whether the goal is *achieved* is a different issue.  random.c tries
pretty hard, but makes some concessions to practicality, relying on
computational security as a backup.  (But suggestions as to how to get
closer to the goal are still very much appreciated!)

In particular, it is theoretically possible for an attacker to exploit
knowledge of the state of the pool and the input mixing transform to
feed in data that permutes the pool state to cluster in SHA1 collisions
(thereby reducing output entropy), or to use the SHA1 feedback to induce
state collisions (therby reducing pool entropy).  But that seems to bring
whole new meaning to the word "computationally infeasible", requiring
first preimage solutions over probability distributions.

Also, the entropy estimation may be flawed, and is pretty crude, just
heavily derated for safety.  And given recent developments in keyboard
skiffing, and wireless keyboard deployment, I'm starting to think that
the idea (taken from PGP) of using the keyboard and mouse as an entropy
source is one whose time is past.

Given current processor clock rates and the widespread availability of
high-resolution timers, interrupt synchronization jitter seems like
a much more fruitful source.  I think there are many bits of entropy
in the lsbits of the RDTSC time of interrupts, even from the periodic
timer interrupt!  Even derating that to 0.1 bit per sample, that's still
a veritable flood of seed material.


/dev/random has an even more important design goal of being universally
available; it should never cost enough to make disabling it attractive.
If this conflicts with information-theoretic security, the latter will
be compromised.  But if a practical information-theoretic /dev/random is
(say) just too bulky for embedded systems, perhaps making a scaled-back
version available for such hosts (as a config option) could satisfy
both goals.

Ted, you wrote the thing in the first place; is my summary of the goals
correct?  Would you like comment patches to clarify any of this?


Thank you for pointing out the paper; Appendix A is particularly
interesting.  And the [BST03] reference looks *really* nice!  I haven't
finished it yet, but based on what I've read so far, I'd like to
*strongly* recommnd that any would-be /dev/random hackers read it
carefully.  It can be found at
http://www.wisdom.weizmann.ac.il/~tromer/papers/rng.pdf

Happily, it *appears* to confirm the value of the LFSR-based input
mixing function.  Although the suggested construction in section 4.1 is
different, and I haven't seen if the proof can be extended.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-16 Thread linux

>> /dev/urandom depends on the strength of the crypto primitives.
>> /dev/random does not.  All it needs is a good uniform hash.
> 
> That's not at all clear.  I'll go farther: I think it is unlikely
> to be true.
> 
> If you want to think about cryptographic primitives being arbitrarily
> broken, I think there will be scenarios where /dev/random is insecure.
> 
> As for what you mean by "good uniform hash", I think you'll need to
> be a bit more precise.

Well, you just pointed me to a very nice paper that *makes* it precise:

Boaz Barak, Ronen Shaltiel, and Eran Tromer. True random number generators
secure in a changing environment. In Workshop on Cryptographic Hardware
and Embedded Systems (CHES), pages 166-180, 2003. LNCS no. 2779.

I haven't worked through all the proofs yet, but it looks to be highly
applicable.

>> Do a bit of reading on the subject of "unicity distance".
> 
> Yes, I've read Shannon's original paper on the subject, as well
> as many other treatments.

I hope it's obvious that I didn't mean to patronize *you* with such
a suggestion!  Clearly, you're intimately familiar with the concept,
and any discussion can go straight on to more detailed issues.

I just hope you'll grant me that understanding the concept is pretty
fundamental to any meaningful discussion of information-theoretic
security.

> I stand by my comments above.

Cool!  So there's a problem to be solved!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-16 Thread linux

> Correct me if I'm wrong here, but uniformity of the linear function isn't
> sufficent even if we implemented like this (right now it's more a+X than
> a  X).
>
> The part which suggests choosing an irreducible poly and a value "a" in the
> preprocessing stage ... last I checked the value for a and the poly need to
> be secret.  How do you generate poly and a, Catch-22?  Perhaps I'm missing
> something and someone can point it out.

No, the value (the parameter pi) are specifically described as "the public
parameter".  See the "Preprocessing" paragraph at the end of section 1.2
on page 3.  "This string is then hardwired into the implementation and 
need not be kept secret."

All that's required is that the adversary can't tailor his limited
control over the input based on knowing pi.

There's a simple proof in all the papers that if an adversary knows
*everything* about the randomness extraction function, and has total
control over the input distribution, you're screwed.

Basically, suppose you have a 1024-bit input block, the attacker
is required to choose a distribution with at least 1023 bits of entropy,
and you only want 1 bit out.  Should be easy, right?
Well, with any *fixed* function, the possible inputs are divided into
those that hash to 0, and those that hash to 1.  One of those sets
must have at least 2^1023 members.  Suppose it's 0.  The attacker can
choose the input distribution to be "uniformly at random from the
>= 2^1023 inputs that hash to 0" and keep the promise while totally
breaking your extraction function.

But this paper says that if the attacker has to choose 2^t possible
input distributions (based on t bits of control over the input)
*before* the random parameter pi is chosen, then they're locked out.
*After* learning pi, they can choose *which* of the 2^t input
distributions to use.


The thing is, you need a parameterized family of hash functions.
They choose a random multiplier mod GF(2^n).  Their construction
is based on the well-known 2-universal family of hash functions
hash(x) = (a*x+b) mod p.

The /dev/random input mix is based on choosing a "random" polynomial
(since there was a lot of efficiency pressure, it isn't actually very
random; the question is, is it non-random enough to help an attacker).
Remiander modulo a uniformly chosen random irreducible polynomial is a
well-known ("division hash") family of universal hash functions, but
it's a little bit weaker than the above, and I have to figure out of
the proof extends.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-16 Thread linux

re entropy of the event?" becomes a
> crucial concern here.  What if, by your leading example, there is 1/2 bit
> of entropy in each event?  Will the estimator even account for 1/2 bits?
> Or will it see each event as 3 bits of entropy?  How much of a margin
> of error can we tolerate?

H'm... the old code *used* to handle fractional bits, but the new code
seems to round down to the nearest bit.  May have to get fixed to
handle low-rate inputs.

As for margin of error, any persistent entropy overestimate is Bad.
a 6-fold overestimate is disastrous.

What we can do is refuse to drain the main pool below, say, 128 bits of
entropy.  Then we're safe against any *occasional* overestimates
as long as they don't add up to 128 bits.

> /dev/random will output once it has at least 160 bits of entropy
> (iirc), 1/2 bit turning into 3 bits would mean that 160bits of output
> it effectively only 27 bits worth of true entropy (again, assuming the
> catastrophic reseeder and output function don't waste entropy).
> 
> It's a lot of "ifs" for my taste.

/dev/random will output once it has as many bits of entropy as you're
asking for.  If you do a 20-byte read, it'll output once it has 160
bits.  If you do a 1-byte read, it'll output once it has 8 bits.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel guide to space

2005-07-13 Thread linux

my rule is that a comment or
broken expression requires braces:

if (foo) {
/* We need to barify it, or else pagecache gets FUBAR'ed */
bar();
}

if (foo) {
bar(p->foo[hash(garply) % LARGEPRIME]->head,
flags & ~(FLAG_FOO | FLAG_BAR | FLAG_BAZ | FLAG_QUUX));
}

> Thus we may be better to slighty encourage use of {}s even if they are
> not needed:
>
>   if(foo) {
>   bar();
>   }

It's not horrible to include them, but it reduces clutter sometimes to
leave them out.


>>  if (foobar(.) + barbar * foobar(bar +
>>  foo *
>>  oof)) {
>>  }
> 
> Ugh, that's as ugly as it can get... Something like below is much
> easier to read...
> 
>   if (foobar(.) +
>   barbar * foobar(bar + foo * oof)) {
>   }

Strongly agreed!  If you have to break an expression, do it at the lowest
precedence point possible!
 
> Even easier is
>   if (foobar(.)
>   + barbar * foobar(bar + foo * oof)) {
>   }
>
> Since a statement cannot start with binary operators
> and as such we are SURE that there must have been something before.

I don't tend to do this, but I see the merit.  However, C uses a number
of operators (+ - * &) in both unary and binary forms, so it's
not always unambiguous.

In such cases, I'll usually move the brace onto its own line to make the
end of the condition clearer:

if (foobar(.) +
barbar * foobar(bar + foo * oof))
{
}

Of course, better yet is to use a temporary or something to shrink
the condition down to a sane size, but sometimes you just need

if (messy_condition_one &&
messy_condition_two &&
messy_condition_three)
{
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: a 15 GB file on tmpfs

2005-07-22 Thread linux

> I have a 15 GB file which I want to place in memory via tmpfs. I want to do 
> this because I need to have this data accessible with a very low seek time.

It should work fine.  tmpfs has the same limits as any other file system,
2 TB or more, and more than that with CONFIG_LBD.

NOTE, however, that tmpfs does NOT guarantee the data will be in RAM!  It
uses the page cache just like any other file system, and pages out unused
data just like any other file system.  If you just want average-case fast,
it'll work fine.  If you want guaranteed fast, you'll have to work harder.

> I want to know if this is possible before spending 10,000 euros on a machine 
> that has 16 GB of memory. 

So create a 15 GB file on an existing machine.  Make it sparse, so you
don't need so much RAM, but test to verify that the kernel doesn't
wrap at 4 GB, and can keep the data at offsets 0, 4 GB, 8 GB, and 12 GB
separate.

Works for me (test code below).

> The machine we plan to buy is a HP Proliant Xeon machine and I want to run a 
> 32 bit linux kernel on it (the xeon we want doesn't have the 64-bit stuff 
> yet)

If you're working with > 4GB data sets, I would recommend you think VERY hard
before deciding not to get a 64-bit machine.  If you could just put all 15 GB
into your application's address space:
- The application would be much simpler and faster.
- The kernel wouldn't be slowed by HIGHMEM workarounds.  It's not that bad,
  but it's definitely noticeable.
- Your expensive new machine won't be obsolete quite as fast.

I'd also like to mention that AMD's large L2 TLB is enormously helpful when
working with large data sets.  It's not discussed much on the web sites that
benchmark with games, but it really helps crunch a lot of data.



#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include 
#include 
#include 
#include 
#include 

#define STRIDE (1<<20)

int
main(int argc, char **argv)
{
int fd;
off_t off;

if (argc != 2) {
fprintf(stderr, "Wrong number of arguments: %u\n", argc);
return 1;
}
fd = open(argv[1], O_RDWR|O_CREAT|O_LARGEFILE, 0666);
if (fd < 0) {
perror(argv[1]);
return 1;
}

for (off = 0; off < 0x4LL; off += STRIDE) {
char buf[40];
off_t res;
ssize_t ss1, ss2;;

ss1 = sprintf(buf, "%llu", off);

res = lseek(fd, off, SEEK_SET);
if (res == (off_t)-1) {
perror("lseek");
return 1;
}
ss2 = write(fd, buf, ++ss1);
if (ss2 != ss1) {
perror("write");
return 1;
}
}

for (off = 0; off < 0x4LL; off += STRIDE) {
char buf[40], buf2[40];
off_t res;
ssize_t ss1, ss2;;

ss1 = sprintf(buf, "%lld", off);

res = lseek(fd, off, SEEK_SET);
if (res == (off_t)-1) {
perror("lseek");
return 1;
}

ss2 = read(fd, buf2, ++ss1);
if (ss2 != ss1 || memcmp(buf, buf2, ss1) != 0) {
fprintf(stderr, "Mismatch at %llu: %.*s vs. %s\n", off, 
(int)ss2, buf2, buf);
    return 1;
}
}
printf("All tests succeeded.\n");
return 0;
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CCITT-CRC16 in kernel

2005-08-10 Thread linux

> Does anybody know what the CRC of a known string is supposed
> to be? I have documentation that states that the CCITT CRC-16
> of "123456789" is supposed to be 0xe5cc and "A" is supposed
> to be 0x9479. The kernel one doesn't do this. In fact, I
> haven't found anything on the net that returns the "correct"
> value regardless of how it's initialized or how it's mucked
> with after the CRC (well I could just set the CRC to 0 and
> add the correct number). Anyway, how do I use the crc_citt
> in the kernel? I've grepped through some drivers that use
> it and they all seem to check the result against some
> magic rather than performing the CRC of data, but not the
> CRC, then comparing it to the CRC. One should not have
> to use magic to verify a CRC, one should just perform
> a CRC on the data, but not the CRC, then compare the result
> with the CRC. Am I missing something here?

There are two common 16-bit CRC polynomials.
The original IBM CRC-16 is x^16 + x^15 + x^2 + 1.
The more popular CRC-CCITT is x^16 + x^12 + x^5 + 1.

Both of thse include (x+1) as a factor, so provide parity detection,
detecting all odd-bit errors, at the expense of reducing the largest
detectable 2-bit error from 65535 bits to 32767.

All CRC algorithms work on bit strings, so an endianness convention
for bits within a byte is always required.  Unless specified, the
little-endian RS-232 serial transmission order is generally assumed.
That isl the least significant bit of the first byte is "first".

This bit string is equated to a polynomial where the first bit is
the coefficient of the highest power of x, and the last bit (msbit of
the last byte) is the coefficient of x^0.

(Some people think of this as big-endian, and get all confused.)

Using this bit-ordering, and omitting the x^16 term as is
conventional (it's implicit in the implementation), the polynomials
come out as:
CRC-16: 0xa001
CRC-CCITT: 0x8408

The mathematically "cleanest" CRC has the unfortunate property that
leading or trailing zero bits can be added or removed without affecting
the CRC computation.  That is, they are not detected as errors.
For fixed-size messages, this does not matter, but for variable-sized
messages, a way to detect inserted or deleted padding is desirable.

To detect leading padding, it is customary to invert the first 16 bits
of the message.  This is equivalent to initializing the CRC accumulator
to all-ones rather than 0, and is invariably implemented that way.

This change is duplicated on CRC verification, and has no effect on the
final result.

To detect trailing padding, it is customary to invert all 16 bits of
the CRC before appending it to the message.  This has an effect on
CRC verification.

One way to CRC-check a message is to compute the CRC of the entire
message *including* the CRC.  You can see this in many link-layer protocol
specifications which place the trailing frame delimiter after the CRC,
because the decoding hardware doesn't need to know in advance where the
message stops and the CRC starts.

If the CRC is NOT inverted, the CRC of a correct message should be zero.
If the CRC is inverted, the correct CRC is a non-zero constant.  You can
still use the same "checksum everything, including the original CRC"
technique, but you have to compare with a non-zero result value.

For CRC-16, the final result is x^15 + x^3 + x^2 + 1 (0xb001).
For CRC-CCITT, the final result is x^13+x^11+x^10+x^8+x^x^3+x^2+x+1 (0xf0b8).

The *other* think you have to do is append the checksum to the message
correctly.  As mentioned earlier, the lsbit of a byte is considered
first, so the lsbyte of the 16-bit accumulator is appended first.


Anyway, with all this, and using preset-to-all-ones:
CRC-CCITT of "A" is 0x5c0a, or f5 a3 when inverted and converted to bytes.
CRC-CCITT of "123456789" is 0x6f91, or 63 90.
(When preset to zero, the values are 0x538d and 0x2189, respectively.
That would be 8d 53 or 89 21 if *not* inverted.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CCITT-CRC16 in kernel

2005-08-11 Thread linux

>> Using this bit-ordering, and omitting the x^16 term as is
>> conventional (it's implicit in the implementation), the polynomials
>> come out as:
>> CRC-16: 0xa001
>> CRC-CCITT: 0x8408
> 
> Huh? That's the problem.
> 
> X^16 + X^12 + X^5 + X^0 = 0x1021, not 0xa001
> 
> Also,
> 
> X^16 + X^15 + X^2 + X^0 = 0x8005, not 0x8408

You're wrong in two ways:
1) You've got CRC-16 and CRC-CCITT mixed up, and
2) You've got the bit ordering backwards.  Remember, I said very clearly,
   the lsbit is the first bit, and the first bit is the highest power
   of x.  You can reverse the convention and still have a CRC, but that's
   not the way it's usually done and it's more awkward in software.

CRC-CCITT = X^16 + X^12 + X^5 + X^0 = 0x8408, and NOT 0x1021
CRC-16 =  X^16 + X^15 + X^2 + X^0 = 0xa001, and NOT 0x8005

> Attached is a program that will generate a table of polynomials
> for the conventional CRC lookup-table code. If you look at
> the table in the kernel code, offset 1, you will see that
> the polynomial is 0x1189. This corresponds to the CRC of
> the value 1. It does not correspond to either your polynomials
> or the ones documented on numerous web pages.

No, it doesn't.  The table entry at offset *128* is the CRC polynomial,
which is 0x8408, exactly as the comment just above the table says.


> I think somebody just guessed and came up with "magic" because the
> table being used isn't correct.

The table being used is 100% correct.  There is no mistake.
If you think you've found a mistake, there's something you're not
understanding.

Sorry to be so blunt, but it's true.

>> The *other* think you have to do is append the checksum to the message
>> correctly.  As mentioned earlier, the lsbit of a byte is considered
>> first, so the lsbyte of the 16-bit accumulator is appended first.

> Right, but the hardware did that. I have no control over that. I
> have to figure out if:
> 
> (1) It started with 0x or something else.
> (2) It was inverted after.
> (3) The result was byte-swapped.
> 
> With the "usual" CRC-16 that I used before, using the lookup-
> table that is for the 0x1021 polynomial, hardware was found
> to have inverted and byte-swapped, but started with 0xefde
> (0x1021 inverted). Trying to use the in-kernel CRC, I was
> unable to find anything that made sense.

You can get rid of the starting value and inversion by XORing together
two messages (with valid CRCs) of equal length.  The result has a valid
CRC with preset to 0 and no inversion.  You can figure that out later.

Then, the only questions are the polynomial and bit ordering.
(You can also have a screwed-up CRC byte ordering, but that's rare
except in software written by people who don't know better.  Hardware
invariably gets it right.)

As I said, the commonest case is to consider the lsbit first.
However, some implementations take the msbit of each byte first.

Here's code to do it both ways.  This is the bit-at-a-time version,
not using a table.  You can verify that the first implementation,
fed an initial crc=0, poly=0x8408, and all possible 1-byte messages,
produces the table in crc-ccitt.c.

/*
 * Expects poly encoded so 0x8000 is x^0 and 0x0001 is x^15.
 * CRC should be appended lsbyte first.
 */
uint16_t
crc_lsb_first(uint16_t crc, uint16_t poly, unsigned char const *p, size_t len)
{
while (len--) {
unsigned i;
crc ^= (unsigned char)*p++;
for (i = 0; i < 8; i++)
crc = (crc >> 1) ^ ((crc & 1) ? poly : 0);
}
return crc;
}

/*
 * Expects poly encoded so 0x0001 is x^0 and 0x8000 is x^15.
 * CRC should be appended msbyte first.
 */
uint16_t
crc_msb_first(uint16_t crc, uint16_t poly, unsigned char const *p, size_t len)
{
while (len--) {
unsigned i;
crc ^= (uint16_t)(unsigned char)*p++ << 8;
for (i = 0; i < 8; i++)
crc = (crc << 1) ^ ((crc & 0x8000) ? poly : 0);
}
return crc;
}

If you're trying to reverse-engineer an unknown CRC, get two valid
messages of the same length, form their XOR, and try a few different
polynomials.  (There's a way to do it more efficiently using a GCD, but
on a modern machine, it's faster to try all 32768 possible polynomials
than to write and debug the GCD code.)

After that, you can figure out the preset and final inversion, if any.
For fixed-length messages, you can merge them into a single 16-bit
constant that you can include at the beginning or the end, but if
you have variable-length messages, it matters.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CCITT-CRC16 in kernel

2005-08-11 Thread linux

>> CRC-CCITT = X^16 + X^12 + X^5 + X^0 = 0x8408, and NOT 0x1021
>> CRC-16 =  X^16 + X^15 + X^2 + X^0 = 0xa001, and NOT 0x8005
>>
> 
> Thank you very much for your time, but what you say is completely
> different than anything else I have found on the net.
> 
> Do the math:
> 
>   2^ 16 = 65536
>   2^ 12 =  4096
>   2^  5 =32
>   2^  0 = 1
> --
>  69655 = 0x11021
> 
> That's by convention 0x1021 as the X^16 is thrown away. I have
> no clue how you could possibly get 0x8408 out of this, nor
> how the CRC of 1 could possibly lie at offset 128 in a table
> of CRC polynomials. Now I read it in the header, but that
> doesn't make it right.

The thing is that X is not 2.  x is a formal variable with
no defined value.

x^0  is represented as to 0x8000
x^5  is represented as to 0x0400
x^12 is represented as to 0x0008
x^16 is not represented by any bit
TOTAL:0x8408

> The "RS-232C" order to which you refer simply means that the
> string of "bits" needs to handled as a string of bytes, not
> words or longwords, in other words, not interpreted as
> words, just bytes. If this isn't correct then ZMODEM and
> a few other protocols are wrong. You certainly don't
> swap every BIT in a string do you? You are not claiming
> that (0x01 == 0x80) and (0x02 == 0x40), etc, are you?

Not at all.  To repeat:

- A CRC is computed over a string of *bits*.  All of its error-corection
  properties are described in terms of *bit* patterns and *bit* positions
  and runs of adjacent *bits*.  It does not know or care about larger
  structures such a bytes.

- The CRC algorithm requires that the *first* bit it sees is the
  coefficient of the highest power of x, and the *last* bit it sees is
  the coefficient of x^0.  This is because it's basically long division.

- If you are working in software, you (the implementor) must define a
  mapping between a byte string and a bit string.  There are only two
  mappings that make any sense at all:
  1) The least-significant bit of each byte is considered "first",
 and the most-significant is considered "last".
  2) The most-significant bit of each byte is considered "first",
 and the least-significant is considered "last".

The logic of the CRC *does not care* which one you choose, but you have
to choose one.  If the bytes are to be converted to bit-serial form, it
is best to choose the form actually used for transmission to preserve the
burst error detection properies of the CRC.  Note that:

- Many people (including, apparently, you) find the second choice a bit
  easier to visualize, as bit i is the coefficient of x^i.
- The first choice is
  a) Easier to implement in software, and
  b) Matches RS-232 transmission order, and
  c) Is used by hardware such as the Z8530 SCC and MPC860 QUICC, and
  d) Is the form invariably used by experienced software implementors.

If you have some wierd piece of existing hardware, it might have chosen
either.  Just try them both and see which works.

However, if your hardware uses the opposite bit ordering within bytes,
DO NOT ATTEMPT to "fix" lib/crc-ccitt.c.  It will break all of the
existing users of the code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CCITT-CRC16 in kernel

2005-08-12 Thread linux

> The "Bible" has been:
>   http://www.joegeluso.com/software/articles/ccitt.htm

This fellow is just plain Confused.

First of all, The Standard Way to do it is to preset to -1 *and* to
invert the result.  Any combination is certainly valid, but if you don't
invert the CRC before appending it, you fail to detect added or deleted
trailing zero bits, which can be A Bad Thing in some applications.

Secondly, I see what he's on about "trailing zero bits", but he isn't
aware that *everyone* uses the "implicit" algorithm, so the reason that
the specs don't explain that very well is that the spec writers forgot
there was any other way to do it.

So his long-hand calculations are just plain WRONG.  Presetting the CRC
accumulator to -1 is equivalent to *inverting the first 16 bits of the
message*, and NOT to prepending 16 1 bits.

Also, he's got his bit ordering wrong.

The correct way to do it, long-hand, is this:

Polynomial: x^16 + x^12 + x^5 + 1.  In bits, that's 100010011
Message: ascii "A", 0x41.
Convert to bits, lsbit first: 1010
Append trailing padding to hold CRC: 1010
Invert first 16 bits: 0101

Now let's do the long division.  You can compute the quotient, but it's
not needed for anything, so I'm not bothering to write it down:

0101
 100010011
 -
  1110010111010
  100010011
  -
   1101100110110
   100010011
   -
1010000101110
100010011
-
  100111000
  100010011
  -
010100111010 - Final remainder

XOR trailing padding with computed CRC (since we used padding of zero,
that's equivalent to overwriting it): 1010010100111010 
Or, if the CRC is inverted: 1010110101000101

To double-check, let's verify that CRC.  I'm verifying the
non-inverted CRC, so I expect a remainder of zero:

Received message: 1010010100111010 
Invert first 16 bits: 0101101000111010 

0101101000111010 
 100010011
 -
  11100110100111011
  100010011
  -
   11011101000110101
   100010011
   -
1010101101001
100010011
-
  100010011
  100010011
  -
 000 - Final remainder

Now, note how in each step, the decision whether to XOR with the 17-bit
polynomial is made based soley on the leading bit of the remainder.
The trailing 16 bits are modified (by XOR with the polynomial), but
not examined.

This leads to a standard optimization, where the bits from the dividend
are not merged into the working remainder until the moment they are
needed.  Each step, the leading bit of the 17-bit remainder is XORed
with the next dividend bit, and the polynomial is XORed in as required
to force the leading bit to 0.  Then the remainder is shifted, discarding
the leading 0 bit and shifting in a trailing 0 bit.

This technique avoids the need for explicit padding, and is the way
that the computation is invariably performed in all but pedagogical
implementations.

Also, the awkward 17-bit size of the remainder register can be reduced
to 16 bits with care, as at any given moment, one of the bits is known
to be zero.  It is usually the trailing bit, but between the XOR and
the shift, it is the leading bit.

(Again, recall that in a typical software implementation, the "leading
bit" is the lsbit and the "trailing bit" is the msbit.  Because the
CRC algorithm does not use addition or carries or anything of the sort,
it does not care which convention software uses.)

> I have spent over a week grabbing everything on the Web that
> could help decipher the CCITT CRC and they all show this
> same kind of code and same kind of organization. Nothing
> I could find on the Web is like the linux kernel ccitt_crc.
> Go figure.

Funny, I can find it all over the place:
http://www.nongnu.org/avr-libc/user-manual/group__avr__crc.html
http://www.aerospacesoftware.com/checks.htm
http://www.bsdg.org/SWAG/CRC/0011.PAS.html
http://www.ethereal.com/lists/ethereal-dev/200406/msg00414.html
http://pajhome.org.uk/progs/crcsrcc.html
http://koders.com/c/fidE2A434B346BFDCD29DA556A54E37C99E403ED26B.aspx

> Do you suppose it was bit-swapped to bypass a patent?

There's no patent.  That's just the way that the entire SDLC family of
protocols (HDLC, LAPB, LAPD, SS#7, X.25, AX.25, PPP, IRDA, etc.) do it.
They transmit lsbit-first, so they compute lsbit-first.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

2005-08-15 Thread linux

Actually, is there any place *other* than write() to the page cache that
warrants a non-temporal store?  Network sockets with scatter/gather and
hardware checksum, maybe?

This is pretty much synonomous with what is allowed to go into high
memory, no?


While we're on the subject, for the copy_from_user source, prefetchnta is
probably indicated.  If user space hasn't caused it to be cached already
(admittedly, the common case), we *know* the kernel isn't going to look
at that data again.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Fortuna

2005-04-17 Thread linux

se a fixed seed.

Also, unless I'm misunderstanding the definition very badly, any "strong
extractor" can use a fixed secret seed.

> I'm not sure whether any of the above will be practically relevant.
> They may be too theoretical for real-world use.  But if you're interested,
> I could try to give you more information about any of these categories.

I'm doing some reading to see if something practical can be dug out of the
pile.  I'm also looking at "compressors", which are a lot like our random
pools; they reduce the size of an input while preserving its entropy,
just not necesarily to 100% density like an extractor.

This is attractive because our entropy measurement is known to be
heavily derated for safety.  An extractor, in producing an output that
is as large as our guaranteed entropy, will throw away any additional
entropy that might be remaining.

Th other thing that I absolutely need is some guarantee that things will
still mostly work if our entropy estimate is wrong.  If /dev/random
produces 128 bits of output that only have 120 bits of entropy in them,
then your encryption is still secure.  But these extractor constructions
are very simple and linear.  If everything falls apart if I overestimate
the source entropy by 1 bit, it's probably a bad idea.  Maybe it can be
salvaged with some cryptographic techniques as backup.

>> 3) Fortuna's design doesn't actually *work*.  The authors' analysis
>>only works in the case that the entropy seeds are independent, but
>>forgot to state the assumption.  Some people reviewing the design
>>don't notice the omission.

> Ok, now I understand your objection.  Yup, this is a real objection.
> You are right to ask questions about whether this is a reasonable assumption.
>
> I don't know whether /dev/random makes the same assumption.  I suspect that
> its entropy estimator is making a similar assumption (not exactly the same
> one), but I don't know for sure.

Well, the entropy *accumulator* doesn't use any such assumption.
Fortuna uses the independence assumption when it divides up the seed
material round-robin among the various subpools.  The current /dev/random
doesn't do anything like that.  (Of course, non-independence affects
us by limiting us to the conditional entropy of any given piece of
seed material.)

> I also don't know whether this is a realistic assumption to make about
> the physical sources we currently feed into /dev/random.  That would require
> some analysis of the physics of those sources, and I don't have the skills
> it would take to do that kind of analysis.

And given the variety of platforms that Linux runs on, it gets insane.
Yes, it can be proved based on fluid flow computations that hard drive
rotation rates are chaotic and thus disk access timing is a usable
entropy source, but then someone installs a solid-state disk.

That's why I like clock jitter.  That just requires studying oscillators
and PLLs, which are universal across all platforms.

> Actually, this example scenario is not a problem.  I'll finish the
> analysis for you.

Er... thank you, but I already knew that; I omitted the completion
because it seemed obvious.  And yes, there are many other distributions
which are worse.

But your 200-round assumption is flawed; I'm assuming the Fortuna
schedule, which is that subpool i is dumped into the main pool (and
thus inforation-theoretically available at the output) every 2^i
rounds.  So the second pool is dumped in every 2 rounds, not every 200.
And with 1/3 the entropy rate, if the first pool is brute-forceable
(which is our basic assumption), then the second one certainly is.

Now, this simple construction doesn't extend to more pools, but it's
trying to point out the lack of a *disproof* of a source distribution
where higher-order pools get exponentially less entropy per seed
due to the exposure of lower-order pools.

Which would turn Fortuna into an elaborate exercise in bit-shuffling
for no security benefit at all.

This can all be dimly seen through the papers on extractors, where
low-k sources are really hard to work with; all the designs want you
to accumulate enough input to get a large k.

> If you want a better example of where the two-pool scheme completely
> falls apart, consider this: our source picks a random bit, uses this
> same bit the next two times it is queried, and then picks a new bit.
> Its sequence of outputs will look like (b0,b0,b1,b1,b2,b2,..,).  If
> we alternate pools, then the first pool sees the sequence b0,b1,b2,..
> and the second pool sees exactly the same sequence.  Consequently, an
> adversary who can observe the entire evolution of the first pool can
> deduce everything there is to know about the second pool.  This just
> illustrates that these multiple-poo

Re: enforcing DB immutability

2005-04-20 Thread linux

[A discussion on the git list about how to provide a hardlinked file
that *cannot* me modified by an editor, but must be replaced by
a new copy.]

[EMAIL PROTECTED] wrote all of:
>>> perhaps having a new 'immutable hardlink' feature in the Linux VFS 
>>> would help? I.e. a hardlink that can only be readonly followed, and 
>>> can be removed, but cannot be chmod-ed to a writeable hardlink. That i 
>>> think would be a large enough barrier for editors/build-tools not to 
>>> play the tricks they already do that makes 'readonly' files virtually 
>>> meaningless.
>> 
>> immutable hardlinks have the following advantage: a hardlink by design 
>> hides the information where the link comes from. So even if an editor 
>> wanted to play stupid games and override the immutability - it doesnt 
>> know where the DB object is. (sure, it could find it if it wants to, 
>> but that needs real messing around - editors wont do _that_)
>
> so the only sensible thing the editor/tool can do when it wants to 
> change the file is precisely what we want: it will copy the hardlinked 
> files's contents to a new file, and will replace the old file with the 
> new file - a copy on write. No accidental corruption of the DB's 
> contents.

This is not a horrible idea, but it touches on another sore point I've
worried about for a while.

The obvious way to do the above *without* changing anything is just to
remove all write permission to the file.  But because I'm the owner, some
piece of software running with my permissions can just deicde to change
the permissions back and modify the file anyway.  Good old 7th edition
let you give files away, which could have addressed that (chmod a-w; chown
phantom_user), but BSD took that ability away to make accounting work.

The upshot is that, while separate users keeps malware from harming the
*system*, if I run a piece of malware, it can blow away every file I
own and make me unhappy.  When (notice I'm not saying "if") commercial
spyware for Linux becomes common, it can also read every file I own.

Unless I have root access, Linux is no safer *for me* than Redmondware!

Since I *do* have root access, I often set up sandbox users and try
commercial binaries in that environment, but it's a pain and laziness
often wins.  I want a feature that I can wrap in a script, so that I
can run a commercial binary in a nicely restricted enviromment.

Or maybe I even want to set up a "personal root" level, and run
my normal interactive shells in a slightly restricted enviroment
(within which I could make a more-restricted world to run untrusted
binaries).  Then I could solve the immutable DB issue by having a
"setuid" binary that would make checked-in files unwriteable at my
normal permission level.

Obviously, a fundamental change to the Unix permissions model won't
be available to solve short-term problems, but I thought I'd raise
the issue to get people thinking about longer-term solutions.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2.6.13-rc3a] i386: inline restore_fpu

2005-07-26 Thread linux

> Since fxsave leaves the FPU state intact, there ought to be a better way
> to do this but it gets tricky.  Maybe using the TSC to put a timestamp
> in every thread save area?
> 
>   when saving FPU state:
> put cpu# and timestamp in thread state info
> also store timestamp in per-cpu data
> 
>   on task switch:
> compare cpu# and timestamps for next task
> if equal, clear TS and set TS_USEDFPU
> 
>   when state becomes invalid for some reason:
> zero cpu's timestamp
> 
> But the extra overhead might be too much in many cases.

Simpler:
- Thread has "CPU that I last used FPU on" pointer.  Never NULL.
- Each CPU has "thread whose FPU state I hold" pointer.  May be NULL.

When *loading* FPU state:
- Set up both pointers.

On task switch:
- If the pointers point to each other, then clear TS and skip restore.
  ("Preloaded")

When state becomes invalid (kernel MMX use, or whatever)
- Set CPU's pointer to NULL.

On thread creation:
- If current CPU's thread pointer points to the newly allocated thread,
  clear it to NULL.
- Set thread's CPU pointer to current CPU.

The UP case just omits the per-thread CPU pointer.  (Well, stores
it in zero bits.)

An alternative SMP thread-creation case would be to have a NULL value for
the thread-to-CPU pointer and initialize the thread's CPU pointer to that,
but that then complicates the UP case.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)

2005-07-29 Thread linux

> OK, I guess when I get some time, I'll start testing all the i386 bitop
> functions, comparing the asm with the gcc versions.  Now could someone
> explain to me what's wrong with testing hot cache code. Can one
> instruction retrieve from memory better than others?

To add one to Linus' list, note that all current AMD & Intel chips
record instruction boundaries in L1 cache, either predecoding on
L1 cache load, or marking the boundaries on first execution.

The P4 takes it to an extreme, but P3 and K7/K8 do it too.

The result is that there are additional instruction decode limits
that apply to cold-cache code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Add prefetch switch stack hook in scheduler function

2005-07-29 Thread linux

>  include/asm-alpha/mmu_context.h |6 ++
>  include/asm-arm/mmu_context.h   |6 ++
>  include/asm-arm26/mmu_context.h |6 ++
>  include/asm-cris/mmu_context.h  |6 ++
>  include/asm-frv/mmu_context.h   |6 ++
>  include/asm-h8300/mmu_context.h |6 ++
>  include/asm-i386/mmu_context.h  |6 ++
>  include/asm-ia64/mmu_context.h  |6 ++
>  include/asm-m32r/mmu_context.h  |6 ++
>  include/asm-m68k/mmu_context.h  |6 ++
>  include/asm-m68knommu/mmu_context.h |6 ++
>  include/asm-mips/mmu_context.h  |6 ++
>  include/asm-parisc/mmu_context.h|6 ++
>  include/asm-ppc/mmu_context.h   |6 ++
>  include/asm-ppc64/mmu_context.h |6 ++
>  include/asm-s390/mmu_context.h  |6 ++
>  include/asm-sh/mmu_context.h|6 ++
>  include/asm-sh64/mmu_context.h  |6 ++
>  include/asm-sparc/mmu_context.h |6 ++
>  include/asm-sparc64/mmu_context.h   |6 ++
>  include/asm-um/mmu_context.h|6 ++
>  include/asm-v850/mmu_context.h  |6 ++
>  include/asm-x86_64/mmu_context.h|5 +
>  include/asm-xtensa/mmu_context.h|6 ++
>  kernel/sched.c  |9 -
>  25 files changed, 151 insertions(+), 1 deletion(-)

I think this pretty clearly points out the need for some arch-generic
infrastructure in Linux.  An awful lot of arch hooks are for one
or two architectures with some peculiarities, and the other 90% of
the implementations are identical.

For example, this is 22 repetitions of
#define MIN_KERNEL_STACK_FOOTPRINT L1_CACHE_BYTES

with one different case.

It would be awfully nice if there was a standard way to provide a default
implementation that was automatically picked up by any architecture that
didn't explicitly override it.

One possibility is to use #ifndef:

/* asm-$PLATFORM/foo.h */
#define MIN_KERNEL_STACK_FOOTPRINT IA64_SWITCH_STACK_SIZE
inline void
prefetch_task(struct task_struct const *task)
{
...
}
#define prefetch_task prefetch_task


/* asm-generic/foo.h */
#include 

#ifndef MIN_KERNEL_STACK_FOOTPRINT
#define MIN_KERNEL_STACK_FOOTPRINT L1_CACHE_BYTES
#endif

#ifndef prefetch_task
inline void prefetch_task(struct task_struct const *task) { }
/* The #define is OPTIONAL... */
#define prefetch_task prefetch_task
#endif


But both understanding and maintaining the arch code could be
much easier if the shared parts were collapsed.  A comment in the
generic versions can explain what the assumptions are.


If there are cases where there is more than one implementation with
multiple users, it can be stuffed into a third category of headers.
E.g.  and  or some
such, using the same duplicate-suppression technique and #included at
the end of 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How to get dentry from inode number?

2005-07-29 Thread linux

> How can I get a full pathname from an inode number ? (Our data
> structure only keep track inode number instead of pathname in
> order to keep thin, so don't have any information but inode
> number.)

Except in extreme circumstances (there's some horrible kludgery in the
NFS code), you don't.  Just store a dentry pointer to begin with; it's
easy to map from dentry to inode.

In addition to files with multiple names, you can have files with no
names, made by the usual Unix trick of deleting a file after opening it.


The NFS kludgery is required by the short-sighted design of the NFS
protocol.  Don't emulate it, or you will be lynched by a mob of angry
kernel developers with torches and pitchforks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)

2005-07-31 Thread linux

> static inline int new_find_first_bit(const unsigned long *b, unsigned size)
> {
>   int x = 0;
>   do {
>   unsigned long v = *b++;
>   if (v)
>   return __ffs(v) + x;
>   if (x >= size)
>   break;
>   x += 32;
>   } while (1);
>   return x;
> }

Wait a minute... suppose that size == 32 and the bitmap is one word of all
zeros.  Dynamic execution will overflow the buffer:

int x = 0;
unsigned long v = *b++; /* Zero */

if (v)  /* False, v == 0 */
if (x >= size)  /* False, 0 < 32 */
x += 32;
} while (1);
unsigned long v = *b++; /* Buffer overflow */
if (v)  /* Random value, suppose non-zero */
return __ffs(v) + x;/* >= 32 */

That should be:
static inline int new_find_first_bit(const unsigned long *b, unsigned size)
int x = 0;
do {
unsigned long v = *b++;
if (v)
return __ffs(v) + x;
} while ((x += 32) < size);
return size;
}

Note that we assume that the trailing long is padded with zeros.

In truth, it should probably be either

static inline unsigned new_find_first_bit(u32 const *b, unsigned size)
int x = 0;
do {
u32 v = *b++;
if (v)
return __ffs(v) + x;
} while ((x += 32) < size);
return size;
}

or

static inline unsigned
new_find_first_bit(unsigned long const *b, unsigned size)
unsigned x = 0;
do {
unsigned long v = *b++;
if (v)
return __ffs(v) + x;
} while ((x += CHAR_BIT * sizeof *b) < size);
return size;
}

Do we actually store bitmaps on 64-bit machines with 32 significant bits
per ulong?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Need better is_better_time_interpolator() algorithm

2005-08-25 Thread linux

> (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus))

(Note that 1/cpus, being a constant for all evaluations of this
expression, has no effect on the final ranking.)
The usual way it's done is with some fiddle factors:

quality_a^a * quality_b^b * quality_c^c

Or, equivalently:

a * log(quality_a) + b * log(quality_b) + c * log(quality_c)

Then you use the a, b and c factors to weight the relative importance
of them.  Your suggestion is equivalent to setting all the exponents to 1.

But you can also say that "a is twice as important as b" in a
consistent manner.

Note that computing a few bits of log_2 is not hard to do in integer
math if you're not too anxious about efficiency:

unsigned log2(unsigned x)
{
unsigned result = 31;
unsigned i;

assert(x);
while (!x & (1u<<31)) {
x <<= 1;
result--;
}
/* Think of x as a 1.31-bit fixed-point number, 1 <= x < 2 */
for (i = 0; i < NUM_FRACTION_BITS; i++) {
unsigned long long y = x;
/* Square x and compare to 2. */
y *= x;
result <<= 1;
if (y & (1ull<<63)) {
result++;
x = (unsigned)(y >> 32);
} else {
x = (unsigned)(y >> 31);
}
}
return result;
}

Setting NUM_FRACTION_BITS to 16 or so would give enough room for
reasonable-sized weights and not have the total overflow 32 bits.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.24-rc6 oops in net_tx_action

2008-01-06 Thread linux

Kernel is 2.6.24-rc6 + linuxpps patches, which are all to the serial
port driver.

2.6.23 was known stable.  I haven't tested earlier 2.6.24 releases.
I think it happened once before; I got a black-screen lockup with
keyboard LEDs blinking, but that was with X running so I couldn't see a
console oops.  But given that I installed 2.6.24-rc6 about 24 hours ago,
that's a disturbing pattern.

(N.B. I was pretty careful, but the following was transcribed by hand.)

BUG: unable to handle kernel paging request at virtual address 00100104
printing eip: b02b3d6a *pde=
Oops: 0002 [#1]

Pid 3162, comm: ntop Not tainted (2.6.24-rc6 #36)
EIP: 0060[] EFLAGS: 00210046 CPU: 0
EIP is at net_tx_action+0x8b/0xec
EAX: 00100100 EBX: efa63924 ECX: 0801fbff EDX: 00200200
ESI: 0010 EDI: 0010 EBP: 012c ESP: b0444fc8
 DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
 Process ntop (pid: 3162, ti=b0444000 task=e9122f90 task.ti=e92ec000)
 Stack:  000a b02b3a84 b044007b 000a7ac5 0001 b0457a44 0009
 b0118016 e92ecf74 e92ec000 00200046 b0103c3c
Call Trace:
 [] net_tx_action+0x5a/0xa8
 [] __do_softirq+0x35/0x75
 [] do_softirq+0x3e/0x8f
 [] do_gettimeofday+0x2c/0xc6
 [] handle_level_irq+0x0/0x8d
 [] irq_exit+0x29/0x58
 [] do_IRQ+0xaf/0xc2
 [] sys_gettimeofday+0x27/0x53
 [] common_interrupt+0x23/0x28
 ===
Code: 24 04 ec 61 3d b0 c7 04 24 87 01 3a b0 e8 ad 10 e6 ff e8 44 fd e4 ff c7 
05 1c b9 46 b0 01 00 00 00 fa 39 fe 75 20 8b 03 8b 53 04 <89> 50 04 89 02 a1 fc 
b8 46 b0 c7 03 f8 b8 46 b0 89 1d fc b8 46
EIP: [] at net_tx_action+0x8b/0xec SS:ESP 0068:b0444fc8
Kernel panic - not syncing: Fatal exception in interrupt


Network config is a little complex; there are 5 physical network
interfaces and a bunch of netfilter rules.  A quad-port 100baseT Tulip
card which provides "outside-facing" interfaces (two uplinks, a DMZ,
and a spare), and a gigabit VIA velocity card for the internal network.

The hardware has ECC memory (1 GB, kernel starts at 2.75G) and mirrored
drives, and has generally been very stable for a long time, modulo some
disk hiccups.

$ lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge 
(rev 03)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge 
(rev 03)
00:04.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02)
00:04.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:04.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01)
00:04.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
00:09.0 Ethernet controller: VIA Technologies, Inc. VT6120/VT6121/VT6122 
Gigabit Ethernet Adapter (rev 11)
00:0a.0 Mass storage controller: Promise Technology, Inc. PDC20268 (Ultra100 
TX2) (rev 01)
00:0b.0 Mass storage controller: Promise Technology, Inc. PDC20268 (Ultra100 
TX2) (rev 01)
00:0c.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
00:0d.0 PCI bridge: Digital Equipment Corporation DECchip 21152 (rev 03)
01:00.0 VGA compatible controller: nVidia Corporation NV6 [Vanta/Vanta LT] (rev 
15)
02:04.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 
(rev 41)
02:05.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 
(rev 41)
02:06.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 
(rev 41)
02:07.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 
(rev 41)

The tulip drivers have been solid forever.  The VIA velocity driver is more
suspect; I made an effort a while ago to get tagged VLANs working on it,
which was a notable failure.  Still, this oops is in core network code.

As you might guess, it makes people somewhat grumpy when the main
firewall/router takes a dive, but I can experiment after hours.

Here's the kernel config:

$ grep ^CONFIG /usr/src/linux/.config
CONFIG_X86_32=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_SUPPORTS_OPROFILE=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_IKCONFIG=y
CONFIG_LOG_BUF_SHIFT=15
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CO

Re: 2.6.24-rc6 oops in net_tx_action

2008-01-06 Thread linux

> [EMAIL PROTECTED] <[EMAIL PROTECTED]> :
>> Kernel is 2.6.24-rc6 + linuxpps patches, which are all to the serial
>> port driver.
>> 
>> 2.6.23 was known stable.  I haven't tested earlier 2.6.24 releases.
>> I think it happened once before; I got a black-screen lockup with
>> keyboard LEDs blinking, but that was with X running so I couldn't see a
>> console oops.  But given that I installed 2.6.24-rc6 about 24 hours ago,
>> that's a disturbing pattern.

> It is probably this one:
>
> http://marc.info/?t=11978279403&r=1&w=2

Thanks!  I got the patch from
http://marc.info/?l=linux-netdev&m=119756785219214
(Which didn't make it into -rc7; please fix!)
and am recompiling now.

Actually, I grabbed the hardware mitigation followon patch while I was
at it.  I notice that the comment explaining the format of CSR11 and
what 0x80F1 means got lost; perhaps it would be nice to resurrect it?

0x80F1
  8000 = Cycle size (timer control)
  7800 = TX timer in 16 * Cycle size
  0700 = No. pkts before Int. (0 =  interrupt per packet)
  00F0 = Rx timer in Cycle size
  000E = No. pkts before Int.
  0001 = Continues mode (CM)
  
(Boy, that tulip driver could use a whitespace overhaul.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc6 oops in net_tx_action

2008-01-07 Thread linux

>> Thanks!  I got the patch from
>> http://marc.info/?l=linux-netdev&m=119756785219214
>> (Which didn't make it into -rc7; please fix!)
>> and am recompiling now.

> Jeff is busy so he's asked me to pick up the more important
> driver bug fixes that get posted.
>
> I'll push this around, thanks.

Much obliged.  It's only 11 hours of uptime, but no problems so far,
even trying abusive things like "ping -f -l64 -s8000".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFT] Port 0x80 I/O speed

2007-12-12 Thread linux

Here are a variety of machines:

600 MHz PIII (Katmai), 440BX chipset, 82371AB/EB/MB PIIX4 ISA bridge:
cycles: out 794, in 348
cycles: out 791, in 348
cycles: out 791, in 349
cycles: out 791, in 348
cycles: out 791, in 348

433 MHz Celeron (Mendocino), 440 BX chipset, same ISA bridge:
cycles: out 624, in 297
cycles: out 623, in 296
cycles: out 624, in 297
cycles: out 623, in 297
cycles: out 623, in 296

1100 MHz Athlon, nForce2 chipset, nForce2 ISA bridge:
cycles: out 1295, in 1162
cycles: out 1295, in 1162
cycles: out 1295, in 1162
cycles: out 1295, in 1162
cycles: out 1295, in 1162

800 MHz Transmeta Crusoe TM5800, Transmeta/ALi M7101 chipset.
cycles: out 1212, in 388
cycles: out 1195, in 375
cycles: out 1197, in 377
cycles: out 1196, in 376
cycles: out 1196, in 377

2200 MHz Athlon 64, K8T890 chipset, VT8237 ISA bridge:
cycles: out 1844674407370814, in 1844674407365758
cycles: out 1844674407370813, in 1844674407365756
cycles: out 1844674407370805, in 1844674407365750
cycles: out 1844674407370813, in 1844674407365755
cycles: out 1844674407370814, in 1844674407365756

Um, huh?  That's gcc 4.2.3 (Debian version 4.2.2-4), -O2.  Very odd.

I can run it with -O0:
cycles: out 4894, in 4894
cycles: out 4905, in 4917
cycles: out 4910, in 4896
cycles: out 4909, in 4896
cycles: out 4894, in 4898
cycles: out 4911, in 4898

or with -O2 -m32:
cycles: out 4914, in 4927
cycles: out 4913, in 4927
cycles: out 4913, in 4913
cycles: out 4914, in 4913
cycles: out 4913, in 4929
cycles: out 4912, in 4912
cycles: out 4913, in 4915

With -O2, the cycle counts come out (before division) as
out: 0xFFEA6F4F
in:  0xFCE68BB6
I think the "A" constraint doesn't work quite the same in
64-bit code.  The compiler seems to be using %rdx rather than
%edx:%eax.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

SCHED_FIFO & system()

2008-01-17 Thread linux


Hello,
I have some strange behavior in one of my systems.

I have a real-time kernel thread under SCHED_FIFO which is running every
10ms.
It is blocking on a semaphore and released by a timer interrupt every 10ms.
Generally this works really well.

However, there is a module in the system that makes a / system() / call
from c-code in user-space;

  system("run_my_script");

By calling and running a bash script. Independent of how the actual
script looks like the real time kernel thread does not get scheduled under
the time of 80ms -- the time it takes for the system() call to finish.

I can see when running a LTT session that the wake_up event occurs for
the real time thread 10ms into the system call but nevertheless the real
time kernel thread does not get scheduled.

The thread that calls system("run_my_script") is configured as SCHED_OTHER.

The Kernel is 2.6.21.


Anybody who recognize this or similar situations?



Cheers // Matias
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: /dev/urandom uses uninit bytes, leaks user data

2007-12-15 Thread linux

>> There is a path that goes from user data into the pool.  This path
>> is subject to manipulation by an attacker, for both reading and
>> writing.  Are you going to guarantee that in five years nobody
>> will discover a way to take advantage of it?  Five years ago
>> there were no public attacks against MD5 except brute force;
>> now MD5 is on the "weak" list.

> Yep, I'm confident about making such a guarantee.  Very confident.

For the writing side, there's a far easier way to inject potentially
hostile data into the /dev/random pool:
"echo evil inentions > /dev/random".

This is allowed because it's a very specific design goal that an attacker
cannot improve their knowledge of the state of the pool by feeding in
chosen text.

Which in turn allows /dev/random to get potential entropy from lots of
sources without worrying about how good they are.  It tries to account
for entropy it's sure of, but it actually imports far more - it just
don't know how much more.

One of those "allowed, but uncredited" sources is whatever you want to
write to /dev/random.

So you can, if you like, get seed material using
wget -t1 -q --no-cache -O /dev/random 
'http://www.fourmilab.ch/cgi-bin/Hotbits?fmt=bin&nbytes=32' 
'http://www.random.org/cgi-bin/randbyte?nbytes=32&format=f' 
'http://www.randomnumbers.info/cgibin/wqrng.cgi?limit=255&amount=32' 
'http://www.lavarnd.org/cgi-bin/randdist.cgi?pick_num=16&max_num=65536'

I don't trust them, but IF the data is actually random, and IF it's not
observed in transit, then that's four nice 256-bit random seeds.

(Note: if you actually use the above, be very careful not to abuse these
free services by doing it too often.  Also, the latter two actually
return whole HTML pages with the numbers included in ASCII.  If anyone
knows how to just download raw binary, please share.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: permit link(2) to work across --bind mounts ?

2007-12-19 Thread linux

> Why does link(2) not support hard-linking across bind mount points
> of the same underlying filesystem ?

Whenever we get mount -r --bind working properly (which I use to place
copies of necessary shared libraries inside chroot jails while allowing
page cache sharing), this feature would break security.

mkdir /usr/lib/libs.jail
for i in $LIST_OF_LIBRARIES; do
ln /usr/lib/$i /usr/lib/libs.jail/$i
done
mount -r /usr/lib/libs.jail /jail/lib
chown prisoner /usr/log/jail
mount /usr/log/jail /jail/usr/log
chrootuid /jail prisoner /bin/untrusted &

Although protections should be enough, but I'd rather avoid having the
prisoner link /jail/lib/libfoo.so (write returns EROFS) to /jail/usr/log
where it's potentially writeable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-31 Thread linux

> could you please also react to this feedback:
>
>   http://marc.theaimsgroup.com/?l=linux-kernel&m=110698371131630&w=2
> 
> to quote a couple of key points from that very detailed security
> analysis:
> 
> " I'm not sure how the OpenBSD code is better in any way.  (Notice that
>   it uses the same "half_md4_transform" as Linux; you just added another
>   copy.) Is there a design note on how the design was chosen? "

Just note that, in addition to the security aspects, there are also a
whole set of multiprocessor issues.  OpenBSD added SMP support in June
2004, and it looks like this code dates back to before that.  It might
be worth looking at what OpenBSD does now.

Note that I have NOT looked at the patch other than the TCP ISN
generation.  However, given the condition of the ISN code, I am inclined
to take a "guilty until proven innocent" view of the rest of it.
Don't merge it until someone has really grokked it, not just kibitzed
about code style issues.

(The homebrew 15-bit block cipher in this code does show how much the
world needs a small block cipher for some of these applications.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Linux Kernel Subversion Howto

2005-02-10 Thread linux

I really got bored of this thread.Can you all question your self on thing?
If someone starts reading right now the sources of the linux kernel will be
able to understand every aspect and part of the code???
Do you understand every aspect?

Is it still "opensource" or starts to be a "closedsource" software "product
" despite the fact that is still free to the community.
DONT say again the source is there ,you dont have but to read it.
Someone in this list ( very popular ) has said some years ago that even now
Micro$oft gives out the source noone will be able to do some changes and
some of us wont be able to understand this product ever never.
So in the name of this idea is still linux an "opensource" idea or a
"closed" one delivered from those who have the information already to
maintain it and give it to us freely. I tend to develop in linux only 4
years now and i tried to make some changes in the kernel so that they can
conform according to my needs... i have to say that it was very very time
consuming.no doc...no comments...no explanationI think that a lot of
us have the same question


"Opensource" || "Closedsource" ???????

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-02-12 Thread linux

> [EMAIL PROTECTED] writes:
>> (The homebrew 15-bit block cipher in this code does show how much the
>> world needs a small block cipher for some of these applications.)
>
> Doesn't TEA fill this niche? It's certainly used for this in the Linux
> kernel, e.g. in reiserfs (although I have my doubts it is really useful
> there) 

Sorry; ambiguous parsing.  I meant "(small block) cipher", not "small
(block cipher)".  TEA is intended for the latter niche.  What I meant
was a cipher that could encrypt blocks smaller than 64 bits.

It's easy to make a smaller hash by just thowing bits away, but a block
cipher is a permutation, and has to be invertible.

For example, if I take a k-bit counter and encrypt it with a k-bit
block cipher, the output is guaranteed not to repeat in less than 2^k
steps, but the value after a given value is hard to predict.

There is a well-known technique for reducing the block size of a cipher
by a small factor, such as from a power of 2 to a prime number
slightly lower.  That is:

unsigned
encrypt_mod_n(unsigned x, unsigned n)
{
assert(x < n);
do {
x = encrypt(x);
} while (x >= n);
return x;
}

It takes a bit of thinking to realize why this creates an bijection from
[0..n-1] -> [0..n-1], but it's kind of a neat "aha!" when it does.

Remember, encrypt() is a bijection from [0..N-1] -> [0..N-1] for some
N >= n.  Typically N = 2^k for some k.

However, this technique requires N/n calls to encrypt().  I.e.
n calls to encrypt_mod_n() will cause N calls to encrypt().

It's generally considered practical up to N/n = 2, so we can encrypt
modulo any modulus n if we have encrypt() functions for any N = 2^k a
power of 2.  I.e. a k-bit block cipher.

For example, suppose we want to encrypt 7-digit North American telephone
numbers.  These are of the form NXX-, where N is a digit other than
0 or 1, and X is any digit.  there are 8e6 possibilities.  Using this
scheme and a 23-bit block cipher, we can encrypt them to different valid
7-digit telephone numbers.

Likewise, 10-digit number with area codes, +1 NXX NXX- (but not
starting with N11) are also possible.  There are 792 area codes and
8e6 numbers for a total of 777600 < 2^33 combinations.

This sort of thing is very useful for adding encryption to protocols and
file formats not designed for it.

However, the standard literature is notably lacking in block ciphers
in funny block sizes.  There was one AES submission (The Hasty Pudding
Cipher, http://www.cs.arizona.edu/~rcs/hpc/) that supported variable
block sizes, but it was eliminated fairly early.

To start with, consider very small blocks: 1, 2 or 3 bits.
There are only two possible things encrypt() can do with a 1-bit value:
either invert it or leave it alone.

There are 4! = 24 possible 2-bit encryption operations.  Ideally, the
key should specify them all with equal probability, but 24 does not
evenly divide the (power of 2 sized) keyspace.  It is interesting to
look at how iniformly the possibilities are covered.

It's fun to consider a Feistel network, dividing the plaintext into 1-bit
L and R values, and alternating L ^= f(R), R ^= f(L) for (not nexessarily
invertible) round functions f.  Since there are only 4 possible 1-bit
functions (1, 0, x and !x), you can consider each round to have an
independent 2-bit round subkey and see how the cipher's uniformity
develops as you increase the number of rounds and the key length to go
with it.

There are 8! = 40320 3-bit encryption operations.  Again, all should be
covered uniformly.  An odd number of bits makes a Feistel design more
challenging.  But if you don't allow odd numbers of bits, you have to push
the shrinking technique it to N/n = 4, which starts to get unpleasant.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-02-12 Thread linux

linux> It's easy to make a smaller hash by just thowing bits away,
    linux> but a block cipher is a permutation, and has to be
linux> invertible.

linux> For example, if I take a k-bit counter and encrypt it with
linux> a k-bit block cipher, the output is guaranteed not to
linux> repeat in less than 2^k steps, but the value after a given
linux> value is hard to predict.

> Huh?  What if my cipher consists of XOR-ing with a k-bit pattern?
> That's a permutation on the set of k-bit blocks but it happens to
> decompose as a product of (non-overlapping) swaps.
> 
> In general for more realistic block ciphers like DES it seems
> extremely unlikely that the cipher has only a single orbit when viewed
> as a permutation.  I would expect a real block cipher to behave more
> like a random permutation, which means that the expected number of
> orbits for a k-bit cipher should be about ln(2^k) or roughly .7 * k.

I think you misunderstand; your comments don't seem to make sense unless
I assume you're imagining output feedback mode:

x[0] = encrypt(IV)
x[1] = encrypt(x[0])
x[2] = encrypt(x[1])
etc.

Obviously, this pattern will repeat after some unpredictable interval.
(However, owing to the invertibility of encryption, looping can
be easily detected by noticing that x[i] = IV.)

But I was talking about counter mode:

x[0] = encrypt(0)
x[1] = encrypt(1)
x[2] = encrypt(2)
etc.

It should be obvious that this will not repeat until the counter
overflows k bits and you try to compute encrypt(2^k) = encrypt(0).

One easy way to generate unpredictable 16-bit port numbers that don't
repeat too fast is:

highbit = 0;
for (;;) {
generate_random_encryption_key(key);
for (i = 0; i < 2; i++)
use(highbit | encrypt15(i, key));
highbit ^= 0x8000;
}

Note that this does NOT use all 32K values before switching to another
key; if that were the case, an attacker who kept a big bitmap of reviously
seen values could preduct the last few values based on knowing what
hadn't been seen already.


Of course, you can always wrap a layer of Knuth's Algorithm B
(randomization by shuffling) around anything:

#include "basic_rng.h"

#define SHUFFLE_SIZE 32 /* Power of 2 is more efficient */

struct better_rng_state {
struct basic_rng_state basic;
unsigned y;
unsigned z[SHUFFLE_SIZE];
};

void
better_rng_seed(struct better_rng_state *state, unsigned seed)
{
unsigned i;
basic_rng_seed(&state->basic, seed);

for (i = 0; i < SHUFFLE_SIZE; i++)
state->z[i] = basic_rng(&state->basic);
state->y = basic_rng(&state->basic) % SHUFFLE_SIZE;
}

unsigned
better_rng(struct better_rng_state *state)
{
unsigned x = state->z[state->y];
state->y = (state->z = basic_rng(&state->basic)) % SHUFFLE_SIZE;
return x;
}

(You can reduce code size by reducing modulo SHUFFLE_SIZE when you use
state->y rather than when storing into it, but I have done it the other
way to make clear exactly how much "effective" state is stored.  You can
also just initialize state->y to a fixed value.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] OpenBSD Networking-related randomization port

2005-01-28 Thread linux

> It adds support for advanced networking-related randomization, in
> concrete it adds support for TCP ISNs randomization

Er... did you read the existing Linux TCP ISN generation code?
Which is quite thoroughly randomized already?

I'm not sure how the OpenBSD code is better in any way.  (Notice that it
uses the same "half_md4_transform" as Linux; you just added another copy.)
Is there a design note on how the design was chosen?


I don't wish to be *too* discouraging to someone who's *trying* to help,
but could you *please* check a little more carefully in future to
make sire it's actually an improvement?

I fear there's some ignorance of what the TCP ISN does, why it's chosen
the way it is, and what the current Linux algorithm is designed to do.
So here's a summary of what's going on.  But even as a summary, it's
pretty long...


First, a little background on the selection of the TCP ISN...

TCP is designed to work in an environment where packets are delayed.
If a packet is delayed enough, TCP will retransmit it.  If one of
the copies floats around the Internet for long enough and then arrives
long after it is expected, this is a "delayed duplicate".

TCP connections are between (host, port, host port) quadruples, and
packets that don't match some "current connection" in all four fields
will have no effect on the current connection.  This is why systems try
to avoid re-using source port numbers when making connections to
well-known destination ports.

However, sometimes the source port number is explicitly specified and
must be reused.  The problem then arises, how do we avoid having any
possible delayed packets from the previous use of this address pair show
up during the current connection and confuse the heck out of things by
acknowledging data that was never received, or shutting down a connection
that's supposed to stay open, or something like that?

First of all, protocols assume a maximum packet lifetime in the Internet.
The "Maximum Segment Lifetime" was originally specified as 120 seconds,
but many implementations optimize this to 60 or 30 seconds.  The longest
time that a response can be delayed is 2*MSL - one delay for the packet
eliciting the response, and another for the response.

In truth, there are few really-hard guarantees on how long a packet can
be delayed.  IP does have a TTL field, and a requirement that a packet's
TTL field be decremented for each hop between routers *or each second of
delay within a router*, but that latter portion isn't widely implemented.
Still, it is an identified design goal, and is pretty reliable in
practice.


The solution is twofold: First, refuse to accept packets whose
acks aren't in the current transmission window.  That is, if the
last ack I got was for byte 1000, and I have sent 1100 bytes
(numbers 0 through 1099), then if the incoming packet's ack isn't
somewhere between 1000 and 1100, it's not relevant.  If it's
950, it might be an old ack from the current connection (which
doesn't include anything interesting), but in any case it can be
safely ignored, and should be.

The only remaining issue is, how to choose the first sequence number
to use in a connection, the Initial Sequence Number (ISN)?

If you start every connection at zero, then you have the risk that
packets from an old connection between the same endpoints will
show up at a bad time, with in-range sequence numbers, and confuse
the current connection.

So what you do is, start at a sequence number higher than the
last one used in the old connection.  Then there can't be any
confusion.  But this requires remembering the last sequence number
used on every connection ever.  And there are at least 2^48 addresses
allowed to connect to each port on the local machine.  At 4 bytes
per sequence number, that's a Petabyte of storage...

Well, first of all, after 2*MSL, you can forget about it and use
whatever sequence number you like, because you know that there won't
be any old packets floating around to crash the party.

But still, it can be quite a burden on a busy web server.  And you might
crash and lose all your notes.  Do you want to have to wait 2*MSL before
rebooting?


So the TCP designers (I'm not on page 27 of RFC 793, if you want to follow
along) specified a time of day based ISN.  If you use a clock to generate
an ISN which counts up faster than your network connection can send
data (and thus crank up its sequence numbers), you can be sure that your
ISN is always higher than the last one used by an old connection without
having to remember it explicitly.

RFC 793 specifies a 250,000 bytes/second counting rate.  Most
implementations since Ethernet used a 1,000,000 byte/second counting
rate, which matches the capabilities of 10base5 and 10base2 quite well,
and is easy to get from the gettimeofday() call.

Note that there are

Re: Patch 4/6 randomize the stack pointer

2005-01-31 Thread linux

> Why not compromise, if possible?  256M of randomization, but move the
> split up to 3.5/0.5 gig, if possible.  I seem to recall seeing an option
> (though I think it was UML) to do 3.5/0.5 before; and I'm used to "a
> little worse" meaning "microbenches say it's worse, but you won't notice
> it," so perhaps this would be a good compromise.  How well tuned can
> 3G/1G be?  Come on, 1G is just a big friggin' even number.

Ah, grasshopper, so much you have to learn...
In particular, prople these days are more likely to want to move the
split DOWN rather than UP.

First point: it is important that the split happens at an "even" boundary
for the highest-level page table.  This makes it as simple as possible to
copy the shared global pages into each process' page tables.

On typical x86, each table is 1024 entries long, so the top table maps
4G/1024 = 4M sections.

However, with PAE (Physical Address Extensions), a 32-bit page table
entry is no longer enough to hold the 36-bit physical address.  Instead,
the entries are 64 bits long, so only 512 fit into a page.  With a
4K page and 18 more bits from page tables, two levels will map only
30 bits of the 32-bit virtual address space.  So Intel added a small,
4-entry third-level page table.

With PAE, you are indeed limited to 1G boundaries.  (Unless you want to
seriously overhaul mm setup and teardown.)


Secondly, remember that, unless you want to pay a performance penalty
for enabling one of the highmem options, you have to fit ALL of physical
memory, PLUS memory-mapped IO (say around 128M) into the kernel's portion
of the address space.  512M of kernel space isn't enough unless you have
less than 512M (like 384M) of memory to keep track of.

That is getting less common, *especially* on servers.  (Which are
presumably an important target audience for buffer overflow defenses.)


Indeed, if you have a lot of RAM and you don't have a big database that
needs tons of virtual address space, it's usually worth moving the
split DOWN.


Now, what about the case where you have gobs of RAM and need a highmem
option anyway?  Well, there's a limit to what you can use high mem for.
Application memory and page cache, yes.  Kernel data structures, no.
You can't use it for dcache or inodes or network sockets or page
tables or the struct page array.

And the mem_map array of struct pages (on x86, it's 32 bytes per page,
or 1/128 of physical memory; 32M for a 4G machine) is a fixed overhead
that's subtracted before you even start.  Fully-populated 64G x86
machines need 512M of mem_map, and the remaining space isn't enough to
really run well in.

If you crunch kernel lowmem too tightly, that becomes the
performance-limiting resource.


Anyway, the split between user and kernel address space is mandated
by:
- Kernel space wants to be as bit as physical RAM if possible, or not
  more than 10x smaller if not.
- User space really depends on the application, but larger than 2-3x
  physical memory is pointless, as trying to actually use it all will
  swap you to death.

So for 1G of physical RAM, 3G:1G is pretty close to perfect.  It was
NOT pulled out of a hat.  Depending on your applications, you may be
able to get away with a smaller user virtual address space, which could
allow you to work with more RAM without needing to slow the kernel with
highmem code.



You'll find another discussion of the issues at
http://kerneltrap.org/node/2450
http://lwn.net/Articles/75174/

Finally, could I suggest a little more humility when addressing the
assembled linux-kernel developers?  I've seen Linus have to eat his
words a time or two, and I know I can't do as well.
http://marc.theaimsgroup.com/?m=91723854823435
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Copyright / licensing question

2005-02-03 Thread linux

I'll respond in terms of U.S. law; if you want something else, please
mention it.

You might find a lot of useful information at
http://fairuse.stanford.edu/Copyright_and_Fair_Use_Overview/chapter9/index.html
http://www.usg.edu/admin/legal/copyright/#part3d3a
http://en.wikipedia.org/wiki/Fair_use
ttp://www.nolo.com/lawcenter/ency/article.cfm/ObjectID/C3E49F67-1AA3-4293-9312FE5C119B5806/catID/2EB060FE-5A4B-4D81-883B0E540CC4CB1E

> 1. For explaining the internals of a filesystem in detail, I need to
>take their code from kernel sources 'as it is' in the book. Do I need
>to take any permissions from the owner/maintainer regarding this ?
>Will it violate any license if reproduce the driver source code in
>my book ??

This is exactly the sort of "Comment and criticism" that is anticipated
and covered by the fair use exemption. In judging whether the use is
fair, 17 USC 107 says:

# § 107. Limitations on exclusive rights: Fair use
#
# Release date: 2004-04-30
#
# Notwithstanding the provisions of sections 106 and 106A, the fair use
# of a copyrighted work, including such use by reproduction in copies
# or phonorecords or by any other means specified by that section, for
# purposes such as criticism, comment, news reporting, teaching (including
# multiple copies for classroom use), scholarship, or research, is not
# an infringement of copyright. In determining whether the use made of a
# work in any particular case is a fair use the factors to be considered
# shall include:
# (1) the purpose and character of the use, including whether such use
# is of a commercial nature or is for nonprofit educational purposes;
# (2) the nature of the copyrighted work;
# (3) the amount and substantiality of the portion used in relation to
# the copyrighted work as a whole; and
# (4) the effect of the use upon the potential market for or value of
* the copyrighted work.
# The fact that a work is unpublished shall not itself bar a finding
# of fair use if such finding is made upon consideration of all the above
# factors.

Going through those in your case, they are:

1. The Transformative Factor: The Purpose and Character of Your Use

It's commercial use, but the non-commercial exemptions are a relatively
recent addition to copyright law. The original, classic "fair use"
is commentary and criticism.

I.e. are you adding something to the quoted material? Have you added
new information or insights? This is one of the most important factors,
and in your case, assuming the book is worth anything at all, the answer
is clearly "yes".

On this ground alone, you're probably safe.

2. The Nature of the Copyrighted Work

Scope for fair use is broader for published than unpublished works
(because the potential future value of an unpublished work is affected
more by copious excerpting), and broader for factual works than fiction
(because facts and ideas cannot be copyrighted, so it takes more quoting
to include a threshold amount of copyrightable "expression").

The Linux kernel is clearly "published", and while the second part is
a little fuzzy (and I'm not eager enough to chase it back to original
case law), I think the functional nature of software places it in the
"factual" category.

3. The Amount and Substantiality of the Portion Taken

Your publisher won't let you waste enough paper to print a huge fraction
of the Linux kernel. Yes, it may be a lot of code, but it's not going
to be "most" by a long shot.

In general the standard is that "no more was taken than was necessary"
to achieve the purpose for which the copying was done. I think you'll
do this anyway, and the law doesn't require you to be super anal about
eliding every snippet and #define that's not directly referenced.

The Lions book, in contrast, included most of 6th edition Unix,
leading to the need for negotiations. Also, the 6th edition wasn't
publushed, leading to problems with the previous factor.

The legally fuzzy isse is what constitutes a "work" here. The function?
The source file? The tarball? I'd have to look for a case involving
copying of entire entries from an encyclopedia or dictionary to get it
fully untangled.

However, you're helped here by the GPL, which can be used to show
the original author's intentions. It defines the "work" as an entire
program, that compiles to an executable that does something. As long
as your excerpts don't compile to a working kernel, you're pretty safe.

4. The Effect of the Use Upon the Potential Market

Will it hurt the copyright owner? This is typicaly expressed in terms
of income, which doesn't apply very much. But your intent is clearly to
*add* value to the Linux kernel, so this factor militates in your favor.

> 2. I will write some custom drivers also for illustration. F

Re: [PATCH] OpenBSD Networking-related randomization port

2005-02-02 Thread linux

*Sigh*.  This thread is heading into the weeds.

I have things I should be doing instead, but since nobody seems to
actually be looking at what the patch *does*, I guess I'll have
to dig into it a bit more...

Yes, licensing issues need to be resolved before a patch can go in.
Yes, code style standards needs to be kept up.
And yes, SMP-locking issues need to be looked at.
(And yes, ipv6 needs to be looked at, too!)

But before getting sidetracked into the fine details, could
folks please take a step back from the trees and look at the forest?

Several people have asked (especially when the first patch came out),
but I haven't seen any answers to the Big Questions:

1) Does this patch improve Linux's networking behaviour in any way?
2) Are the techniques in this patch a good way to achieve those
   improvements?


Let's look at the various parts of the change:

- Increases the default random pool size.
  Opinion: whatever.  No real cost, except memory.  Increases the
  maximum amount that can be read from /dev/random without blocking.
  Note that this is already adjustable at run time, so the question
  is why put it in the kernel config.

  If you want this, I'd suggest instead an option under CONFIG_EMBEDDED to
  shrink the pools and possibly get rid of the run-time changing code,
  then you could increase the default with less concern.

- Changes the TCP ISN generation algorithm.
  I have't seen any good side to this.  The current algorithm can be
  used for OS fingerprinting based on starting two TCP connections from
  different sources (ports or IPs) and noticing that the ISNs
  only differ in the low 24 bits, but is that a serious issue?
  If it is, there are better ways to deal with it that still preserve
  the valuable timer property.

  I point out that the entire reason for the cryptographically
  marginal half_md4_transform oprtation was that a full MD5 was a very
  noticeable performance bottleneck; the hash was only justified by
  the significant real-world attacks.  obsd_get_random uses two calls
  to half_md4_transform.  Which is the same cost as a full MD4 call.
  Frankly, they could just change half_md4_transform to return 64 bits
  instead of 32 and make do with one call.

- Changes to the IP ID generation algorithm.
  All it actually does is change the way the initial inet->id is
  initialized for the inet_opt structure associated with the TCP socket.
  And if you look at ip_output.c:ip_push_pending_frames(), you'll see
  that, if DF is set (as is usual for a TCP connection), iph->id (the
  actual IP header ID) is set to htons(inet->id++).  So it's still
  an incrementing sequence.

  This is in fact (see the comment in ip.h:ip_select_ident()) a workaround
  for a Microsoft VJ compression bug.  The fix was added in 2.4.4 (via
  DaveM's zerocopy-beta-3 patch); before that, Linux 2.4 sent a constant
  zero as the IP ID of DF packets.  See discussion at
http://www.postel.org/pipermail/end2end-interest/2001-May/thread.html
http://tcp-impl.lerc.nasa.gov/tcp-impl/list/archive/2378.html
  I'm not finding the diagnosis of the problem.  I saw one report at
http://oss.sgi.com/projects/netdev/archive/2001-01/msg6.html
  and Dave Miller is pretty much on top of it when he posts
http://marc.theaimsgroup.com/?l=linux-kernel&m=98275316400452&w=2
  but I haven't found the actual debugging leading to the conclusion.

  This also led to some discussion of the OpenBSD IP ID algorithm that
  I haven't fully waded through at
http://mail-index.netbsd.org/tech-net/2003/11/

  If the packet is fragmentable (the only time the IP ID is really
  needed by the receiver), it's done by route.c:__ip_select_ident().
  Wherein the system uses inet_getid to assign p->ip_id_count++
  based on the route cache's struct inet_peer *p.

  (If the route cache is OOM, the system falls back on random IP ID
  assignment.)

  This latter technique nicely prevents the sort of stealth port
  scanning that was mentioned earlier in this thread, and prevents
  a person at address A from guessing the IP ID range I'm using to
  talk to address B.  So note that the boast about "Randomized IP IDs"
  in the grsecurity description at
http://www.gentoo.org/proj/en/hardened/grsecurity.xml
  is, as far as I can tell from a quick look at the code, simply false.

  As for the algorithm itself, it's described at

http://www.usenix.org/events/usenix99/full_papers/deraadt/deraadt_html/node18.html
  but it's not obvious to me that it'd be hard to cryptanalyze given
  a stream of consecutive IDs.  You need to recover:
  - The n value for each inter-ID gap,
  - The LCRNG state ru_a, ru_x, ru_b,
  - The 15-bit XOR masks ru_seed and ru_seed2, and
  - The discrete log generator ru_j (ru_g = 2^ru_j mod RU_N).
Which is actually just a multiplier (mod RU_N-1  = 32748) on
th

Re: thoughts on kernel security issues

2005-02-27 Thread linux

I followed the start of this thread when it was about security mailing
lists and bug-disclosure rules, and then lost interest.

I just looked in again, and I seem to be seeing discussion of merging
grsecurity pathes into mainline.  I haven't yet found an message
where this is proposed explicitly, so if I am inferring incorrectly,
I apologize.  (And you can ignore the rest of this missive.)

However, I did look carefully at an earlier patch that claimed to be a
Linux port of some OpenBSD networking randomization code, ostensibly to
make packet-guessing attacks more difficult.

http://marc.theaimsgroup.com/?l=linux-kernel&m=110693283511865

It was further claimed that this code came via grsecurity.  I did verify
that the code looked a lot like pieces of OpenBSD, but didn't look at
grsecurity at all.

However, I did look in some detail at the code itself.
http://marc.theaimsgroup.com/?l=linux-netdev&m=110736479712671

What I concluded was that it was broken beyond belief, and the effect
on the networking code varied from (due to putting the IP ID generation
code in the wrong place) wasting a lot of time randomizing a number that
could be a constant zero if not for working around a bug in Microsoft's
PPP stack, to (RPC XID generation) severe protocol violation.

Not to mention race conditions out the wazoo due to porting 
single-threaded code.

After careful review, I couldn't find a single redeeming feature, or
even a good idea that was merely implemented badly.  See the posting
for details and more colorful criticism.

Now, as I said, I have *not* gone to the trouble of seeing if this patch
really did come from grsecurity, or if it was horribly damaged in the
process of splitting it out.  So I may be unfairly blaming grsecurity,
but I didn't feel like seeking out more horrible code to torture my
sanity with.

My personal, judgemental opinion was that if that was typical of
grsecurity, it's a festering pile of pus that I'm not going to let
anywhere near my kernel, thank you very much.

But to the extent that this excerpt constitutes reasonable grounds for
suspicion, I would like to recommend a particularly careful review of any
grsecurity patches.  In addition to Linus' dislike of monolithic patches.

Just my $0.02.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

What's the status of kernel PNP?

2001-07-06 Thread linux


I just noticed that 2.4.6-ac1 parport won't compile (well, link) without
the kernel PnP stuff configured.  So I tried turning it on.

It prints a line saying that it found my modem at boot time, but doesn't
actually configure it, so I have to run isapnp anyway if I want to use it.

Okay, RTFM time... Documentation/isapnp.txt doesn't say anything about
boot time (only /proc/isapnp usage after boot and some function call
interfaces for kernel programming that are hard to follow).

kernel-parameters.txt gives a hint, although it required reading the
source code to figure out what to pass as "isapnp=" to turn verbose up.

A lot of google searching comes up with a lot of stale data but the only
2.4-relevant kernel ISAPNP howto is written in Japanese.  Lots of stuff
describes it as a feature in the 2.4 kernels, but I can't find anything
on how to use it.

MAINTAINERS claims that it's maintained, but the web page is down (the
whole site has moved, and /~pnp doesn't exist on the new site) and the
only mailing list archives I can find for pnp-devel (at geocrawler)
doesn't have any updates since the year 2000 - and those are all spam.

I'm a little suspect about that maintained status, although I haven't
written the maintainer yet.


But the upshot of all of this is that I can't figure out WTF to do with
this "feature", since I haven't noticed it actually doing anything except
taking up kernel memory.


On another machine, with an ISA PCMCIA adapter, which works with isapnp
and David Hinds' PCMCIA package, but if I try to use the 2.4 cardbus
code, it fails to probe the PCMCIA adapter, apparently because the PnP
code again didn't set it up.  (And there's no obvious way to force a
re-probe after boot unless I build the whole thing as a module.)  Again,
the PnP code cheerfully points out that the PCMCIA adpater exists, but
doesn't appear to grasp the concept that I didn't put the adapter into
the machine because it looks pretty.


Can someome point me at TFM or some other source of information?  I'd be
much obliged.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make pipe data structure be a circular list of pages, rather

2005-01-15 Thread linux

ation can read directly (e.g. DMA).
- If all else fails, it may be necessary to allocate a bounce buffer
  accesible to both source and destination and copy the data.

Because of the advantages of PCI writes, I assume that having the source
initiate the DMA is preferable, so the above are listed in decreasing
order of preference.

The obvious thing to me is for there to be a way to extract the following
information from the source and destination:
"I am already mapped in the following address spaces"
"I can be mapped into the following address spaces"
"I can read from/write to the following address spaces"

Only the third list is required to be non-empty.

Note that various kinds of "low memory" count as separate "address spaces"
for this purpose.

Then some generic code can look for the cheapest way to get the data from
the source to the destination.  Mapping one and passing the address to
the other's read/write routine if possible, or allocating a bounce
buffer and using a read followed by a write.

Questions for those who know odd machines and big iron better:

- Are there any cases where a (source,dest) specific routine could
  do better than this "map one and access it from the other" scheme?
  I.e. do we need to support the n^2 case?
- Now many different kinds of "address spaces" do we need to deal with?
  Low kernel memory (below 1G), below 4G (32-bit DMA), and high memory
  are the obvious ones.  And below 16M for legacy DMA, I assume.
  x86 machines have I/O space as well.
  But what about machines with multiple PCI buses?  Can you DMA from one
  to the other, or is each one an "address space" that can have intra-bus
  DMA but needs a main-memory bounce buffer to cross buses?
  Can this be handled by a "PCI bus number" that must match, or are there
  cases where there are more intricate hierarchies of mutual accessibility?

  (A similar case is a copy from a process' address space.  Normally,
  simgle-copy pipes require a kmap in one of the processes.  But if
  the copying code can notice that source and dest both have the same
  mm structure, a direct copy becomes possible.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

mmap tricks and writing to files without reading first

2005-04-04 Thread linux

)

I'm not sure which of those would be "best" in the sense of minimum overhead.

Does anyone have any suggestions?  Or a completely different way to zero
out a chunk of a file without reading it in?  I don't want to actually
make a hole in the log file or I'd fragment it and increase the risk of
ENOSPC problems.  I could create just a single zero page and writev()
multiple copies of it, but then I have to worry about the system page
size (I'm not sure if the kernel will DTRT and not page in half of an
8K page if I writev() two 4K vectors to it), and it prevents me from
using pwrite().

I haven't tracked down the splice() idea that sct mentioned in
http://www.ussg.iu.edu/hypermail/linux/kernel/0002.3/0057.html
It appears that sendfile() can't be used for the purpose.


Finally, is there a standard semantics for the interaction between mmap()
and read()/write()?  I have a dim recollection of seeing Linus rant that
anything other than making writes via one path instantly available to
the other is completely brain-dead, which would make the most sense if
some standard somewhere allows weaker synchronization, but I can't seem
to find that rant again, and it was a long time ago.

I also note that there doesn't seem to be an msync() flag for "make
changes visible to read(2) users (i.e. flush them to the buffer
cache), but DON'T schedule a disk write yet", which I assume a weaker
synchronization model would provide.

I also can't find any mention of the possibility of weaker ordering
in the descriptions of mmap() I've seen at www.opengroup.org.  But it
doesn't come right out and clearly require strong ordering, either, and
I can just imagine some vendor with a virtually-addresed cache getting
creative and saying "show me where it says I can't do that!".

The one phrase that concerns me is the caution in SUSv2 that

# The application must ensure correct synchronisation when using mmap()
# in conjunction with any other file access method, such as read() and
# write(), standard input/output, and shmat().
http://www.opengroup.org/onlinepubs/007908799/xsh/mmap.html

Great, but what's "correct"?


The part of the semantics I particularly need to be clearly defined is
what happens if my application crashes after writing to an mmap buffer
but before msync() or munmap().


Thanks for any enlightenment on this somewhat confusing issue!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Make pipe data structure be a circular list of pages, rather than

2005-01-19 Thread linux

[EMAIL PROTECTED] wrote:
> You seem to have misunderstood the original proposal, it had little to do
> with file descriptors.  The idea was that different subsystems in the OS
> export pull() and push() interfaces and you use them.  The file decriptors
> are only involved if you provide them with those interfaces(which you
> would, it makes sense).  You are hung up on the pipe idea, the idea I
> see in my head is far more generic.  Anything can play and you don't
> need a pipe at all, you need 

I was fantasizing about more generality as well.  In particular, my
original fantasy allowed data to, in theory and with compatible devices,
be read from one PCI device, passed through a series of pipes, and
written to another without ever hitting main memory - only one PCI-PCI
DMA operation performed.

A slightly more common case would be zero-copy, where data gets DMAed
from the source into memory and from memory to the destination.
That's roughly Larry's pull/push model.

The direct DMA case requires buffer memory on one of the two cards.
(And would possibly be a fruitful source of hardware bugs, since I
suspect that Windows Doesn't Do That.)

Larry has the "full-page gift" optimization, which could in theory allow
data to be "renamed" straight into the page cache.  However, the page also
has to be properly aligned and not in some awkward highmem address space.
I'm not currently convinced that this would happen often enough to be
worth the extra implementation hair, but feel free to argue otherwise.

(And Larry, what's the "loan" bit for?   When is loan != !gift ?)

The big gotcha, as Larry's original paper properly points out, is handling
write errors.  We need some sort of "unpull" operation to put data back
of the destination can't accept it.  Otherwise, what do you return from
splice()?  If the source is seekable, that's easy, and a pipe isn't much
harder, but for a general character device, we need a bit of help.

The way I handle this in user-level software, to connect modules that
provide data buffering, is to split "pull" into two operations.
"Show me some buffered data" and "Consume some buffered data".
The first returns a buffer pointer (to a const buffer) and length.
(The length must be non-zero except at EOF, but may be 1 byte.)

The second advances the buffer pointer.  The advance distance must be
no more than the length returned previously, but may be less.

In typical single-threaded code, I allow not calling the advance function
or calling it multiple times, but they're typically called 1:1, and
requiring that would give you a good place to do locking.  A character
device, network stream, or the like, would acquire an exclusive lock.
A block device or file would not need to (or could make it a shared lock
or refcount).

The same technique can be used when writing data to a module that does
buffering: "Give me some bufer space" and "Okay, I filled some part of
it in."  In some devices, the latter call can fail, and the writer has
to be able to cope with that.

By allowing both of those (and, knowing that PCI writes are more efficient
than PCI reads, giving the latter preference if both are available),
you can do direct device-to-device copies on splice().

The problem with Larry's separate pull() and push() calls is that you
then need a user-visible abstraction for "pulled but not yet pushed"
data, which seems lile unnecessary abstraction violation.

The main infrastructure hassle you need to support this *universally*
is the unget() on "character devices" like pipes and network sockets.
Ideally, it would be some generic buffer front end that could be used
by the device for normal data as well as the special case.

Ooh.  Need to think.  If there's a -EIO problem on one of the file
descriptors, how does the caller know which one?  That's an argument for
separate pull and push (although the splice() library routine still
has the problem).  Any suggestions?  Does userland need to fall
back on read()/write() for a byte?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-28 Thread linux

Just a note on the attribute((acquire,release)) proposal:

It's nice to be able to annotate functions, but please don't forget to
provide a way to write such functions.  Ultimately, there will be an
asm() or assignment that is the acquire or release point, and GCC needs
to know that so it can compile the function itself (possibly inline).

Having just a function attribute leaves the problem that

void __attribute__((noreturn))
_exit(int status)
{
asm("int 0x80" : : (__NR_exit) "a", (status) "b" );
}

generates a complaint about a noreturn function returning, because
there's no way to tell GCC about a non-returning statement.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread linux

May I just say, that this is f***ing brilliant.
It completely separates the threadlet/fibril core from the (contentious)
completion notification debate, and allows you to use whatever mechanism
you like.  (fd, signal, kevent, futex, ...)

You can also add a "macro syscall" like the original syslet idea,
and it can be independent of the threadlet mechanism but provide the
same effects.

If the macros can be designed to always exit when donew, a guarantee
never to return to user space, then you can always recycle the stack
after threadlet_exec() returns, whether it blocked in the syscall or not,
and you have your original design.


May I just suggest, however, that the interface be:
tid = threadlet_exec(...)
Where tid < 0 means error, tid == 0 means completed synchronously,
and tod > 0 identifies the child so it can be waited for?


Anyway, this is a really excellent user-space API.  (You might add some
sort of "am I synchronous?" query, or maybe you could just use gettid()
for the purpose.)

The one interesting question is, can you nest threadlet_exec() calls?
I think it's implementable, and I can definitely see the attraction
of being able to call libraries that use it internally (to do async
read-ahead or whatever) from a threadlet function.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3

2007-02-22 Thread linux

> It's brilliant for disk I/O, not for networking for which
> blocking is the norm not the exception.
> 
> So people will have to likely do something like divide their
> applications into handling for I/O to files and I/O to networking.
> So beautiful. :-)
>
> Nobody has proposed anything yet which scales well and handles both
> cases.

The truly brilliant thing about the whole "create a thread on blocking"
is that you immediately make *every* system call asynchronous-capable,
including the thousands of obscure ioctls, without having to boil the
ocean rewriting 5/6 of the kernel from implicit (stack-based) to
explicit state machines.

You're right that it doesn't solve everything, but it's a big step
forward while keeping a reasonably clean interface.


Now, we have some portions of the kernel (to be precise, those that
currently support poll() and select()) that are written as explicit
state machines and can block on a much smaller context structure.

In truth, the division you assume above isn't so terrible.
My applications are *already* written like that.  It's just "poll()
until I accumulate a whole request, then fork a thread to handle it."

The only way to avoid allocating a kernel stack is to have the entire
handling code path, including the return to user space, written in
explicit state machine style.  (Once you get to user space, you can have
a threading library there if you like.) All the flaming about different
ways to implement completion notification is precisely because not much
is known about the best way to do it; there aren't a lot of applications
that work that way.

(Certainly that's because it wasn't possible before, but it's clearly
an area that requires research, so not committing to an implementation
is A Good Thing.)

But once that is solved, and "system call complete" can be reported
without returning to a user-space thread (which is basically an alternate
system call submission interface, *independent* of the fibril/threadlet
non-blocking implementation), then you can find the hot paths in the
kernel and special-case them to avoid creating a whole thread.

To use a networking analogy, this is a cleanly layered protocol design,
with an optimized fast path *implementation* that blurs the boundaries.


As for the overhead of threading, there are basically three parts:
1) System call (user/kernel boundary crossing) costs.  These depend only
   on the total number of system calls and not on the number of threads
   making them.  They can be mitigated *if necessary* with a syslet-like
   "macro syscall" mechanism to increase the work per boundary crossing.

   The only place threading might increase these numbers is thread
   synchronization, and futexes already solve that pretty well.

2) Register and stack swapping.  These (and associated cache issues)
   are basically unavoidable, and are the bare minimum that longjmp()
   does.  Nothing thread-based is going to reduce this.  (Actually,
   the kernel can do better than user space because it can do lazy FPU
   state swapping.)

3) MMU context switch costs.  These are the big ones, particular on x86
   without TLB context IDs.  However, these fall into a few categories:
   - Mandatory switches because the entire application is blocked.
 I don't see how this can be avoided; these are the cases where
 even a user-space longjmp-based thread library would context
 switch.
   - Context switches between threads in an application.  The Linux
 kernel already optimizes out the MMU context switch in this case,
 and the scheduler already knows that such context switches are
 cheaper and preferred.

   The one further optimization that's possible is if you have a system
   call that (in a common case) blocks multiple times *without accessing
   user memory*.  This is not a read() or write(), but could be
   something like fsync() or ftruncate().  In this case, you could
   temporarily mark the thread as a "kernel thread" that can run in any
   MMU context, and then fix it explicitly when you unmark it on the
   return path.

I can see the space overhead of 1:1 threading, but I really don't think
there's much time overhead.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

2007-02-03 Thread linux

First of all, may I say, this is a wonderful piece of work.
It absolutely reeks of The Right Thing.  Well done!

However, while I need to study it in a lot more detail, I think Ingo's
implementation ideas make a lot more immediate sense.  It's the same
idea that I thought up.

Let me make it concrete.  When you start an async system call:
- Preallocate a second kernel stack, but don't do anything
  with it.  There should probably be a per-CPU pool of
  preallocated threads to avoid too much allocation and
  deallocation.
- Also at this point, do any resource limiting.
- Set the (normally NULL) "thread blocked" hook pointer to
  point to a handler, as explained below.
- Start down the regular system call path.
- In the fast-path case, the system call completes without blocking and
  we set up the completion structure and return to user space.
  We may want to return a special value to user space to tell it that
  there's no need to call asys_await_completion.  I think of it as the
  Amiga's IOF_QUICK.
- Also, when returning, check and clear the thread-blocked hook.

Note that we use one (cache-hot) stack for everything and do as little
setup as possible on the fast path.

However, if something blocks, it hits the slow path:
- If something would block the thread, the scheduler invokes the
  thread-blocked hook before scheduling a new thread.
- The hook copies the necessary state to a new (preallocated) kernel
  stack, which takes over the original caller's identity, so it can return
  immediately to user space with an "operation in progress" indicator.
- The scheduler hook is also cleared.
- The original thread is blocked.
- The new thread returns to user space and execution continues.

- The original thread completes the system call.  It may block again,
  but as its block hook is now clear, no more scheduler magic happens.

- When the operation completes and returns to sys_sys_submit(), it
  notices that its scheduler hook is no longer set.  Thus, this is a
  kernel-only worker thread, and it fills in the completion structure,
  places itself back in the available pool, and commits suicide.

Now, there is no chance that we will ever implement kernel state machines
for every little ioctl.  However, there may be some "async fast paths"
state machines that we can use.  If we're in a situation where we can
complete the operation without a kernel thread at all, then we can
detect the "would block" case (probably in-line, but you could
use a different scheduler hook function) and set up the state machine
structure.  Then return "operation in progress" and let the I/O
complete in its own good time.

Note that you don't need to implement all of a system call as an explicit
state machine; only its completion.  So, for example, you could do
indirect block lookups via an implicit (stack-based) state machine,
but the final I/O via an explicit one.  And you could do this only for
normal block devices and not NFS.  You only need to convert the hot
paths to the explicit state machine form; the bulk of the kernel code
can use separate kernel threads to do async system calls.

I'm also in the "why do we need fibrils?" camp.  I'm studying the code,
and looking for a reason, but using the existing thread abstraction
seems best.  If you encountered some fundamental reason why kernel threads
were Really Really Hard, then maybe it's worth it, but it's a new entity,
and entia non sunt multiplicanda praeter necessitatem.


One thing you can do for real-time tasks is, in addition to the
non-blocking flag (return EAGAIN from asys_submit rather than blocking),
you could have an "atomic" flag that would avoid blocking to preallocate
the additional kernel thread!  Then you'd really be guaranteed no
long delays, ever.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why is "Memory split" Kconfig option only for EMBEDDED?

2006-12-09 Thread linux

> I have not had yet any problems with VMSPLIT_3G_OPT ever since I
> used it -- which dates back to when it was a feature of Con
> Kolivas's patchset (known as LOWMEM1G), [even] before it got
> merged in mainline.
>
> (Excluding the cases Adrian Bunk listed: WINE, which I don't use, and 
> also 'some Java programs' which I have not seen.)

Seconded.  I have several servers with 1G of memory, and appreciate the
option very much; I maintained it as a custom patch long before it
became a CONFIG option.

Turning on CONFIG_EMBEDDED makes it a bit annoying to be sure not to
play with any of the other far more dangerous options that enables.
(I suppose I could just maintain a local patch to remove that from Kconfig.)

The last I remember hearing, the vm system wasn't very happy with highem
much smaller than lowmem (128M/896M = 1/7) anyway.

There's nothing wrong with a stern warning, but I'd think that disabling
CONFIG_NET would break a lot more user-space programs, and that's not
protected.


How about the following (which also fixes a bug if you select VMSPLIT_2G
and HIGHEM; with 64-bit page tables, the split must be on a 1G boundary):

choice
depends on EXPERIMENTAL
prompt "Memory split"
default VMSPLIT_3G
help
  Select the desired split between kernel and user memory.
  If you are not absolutely sure what you are doing, leave this
  option alone!

  There are important efficiency reasons why the user address
  space and the kernel address space must both fit into the 4G
  linear virtual address space provided by the x86 architecture.

  Normally, Linux divides this into 3G for user virtual memory
  and 1G for kernel memory, which holds up to 896M of RAM plus all
  memory-mapped peripheral (e.g. PCI) devices.  Excess RAM is ignored.

  If the "High memory support" options are enabled, the excess memory
  is available as "high memory", which can be used for user data,
  including file system caches, but not kernel data structures.
  However, accessing high memory from the kernel is slightly more
  costly than low memory, as it has to be mapped into the kernel
  address range first.

  This option lets systems choose to have a larger "low memory" space,
  either to avoid the need for high memory support entirely, or for
  workloads which require particularly large kernel data structures.

  The downside is that the available user address space is reduced.
  While most programs do not care, this is an incompatible change
  to the kernel binary interface, and must be made with caution.
  Some programs that process a lot of data will work more slowly or
  fail, and some programs that do clever things with virtual memory
  will crash immediately.

  In particular, changing this option from the default breaks valgrind
  version 3.1.0, VMware, and some Java virtual machines.

config VMSPLIT_3G
bool "Default 896MB lowmem (3G/1G user/kernel split)"
config VMSPLIT_3G_OPT
depends on !HIGHMEM
bool "1G lowmem (2.75G/1.25G user/kernel split)   CAUTION"
config VMSPLIT_2G
bool "1.875G lowmem (2G/2G user/kernel split) CAUTION"
config VMSPLIT_2G_OPT
depends on !HIGHMEM
bool "2G lowmem (1.875G/2.125G user/kernel split) CAUTION"
config VMSPLIT_1G
bool "2.875G lowmem (1G/3G user/kernel split) CAUTION"
config VMSPLIT_1G_OPT
depends on !HIGHMEM
bool "3G lowmem (896M/3.125G user/kernel split)   CAUTION"
endchoice

config PAGE_OFFSET
hex
default 0xB000 if VMSPLIT_3G_OPT
default 0x8000 if VMSPLIT_2G
default 0x7800 if VMSPLIT_2G_OPT
default 0x4000 if VMSPLIT_1G
default 0x3800 if VMSPLIT_1G_OPT
default 0xC000

(Copyright on the above abandoned to the public domain.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an

2006-12-10 Thread linux

on (a.k.a. obfuscation).

And it lets you optimize them better.

I apologize for not having counted them before.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an

2006-12-11 Thread linux

>> to keep the amount of code between ll and sc to an absolute minimum
>> to avoid interference which causes livelock.  Processor timeouts
>> are generally much longer than any reasonable code sequence.

> "Generally" does not mean you can just ignore it and hope the C compiler
> does the right thing. Nor is it enough for just SOME of the architectures
> to have the properties you require.

If it's an order of magnitude larger than the common case, then yes
you can.  Do we worry about writing functions so big that they
exceed branch displacement limits?

That's detected at compile time, but LL/SC pair distance is
in principle straightforward to measure, too.

> Ralf tells us that MIPS cannot execute any loads, stores, or sync
> instructions on MIPS. Ivan says no loads, stores, taken branches etc
> on Alpha.
>
> MIPS also has a limit of 2048 bytes between the ll and sc.

I agree with you about the Alpha, and that will have to be directly
coded.  But on MIPS, the R4000 manual (2nd ed, covering the R4400
as well) says

> The link is broken in the following circumstances:
>·   if any external request (invalidate, snoop, or intervention)
>changes the state of the line containing the lock variable to
>invalid
>·   upon completion of an ERET (return from exception)
>instruction
>·   an external update to the cache line containing the lock
>variable

Are you absolutely sure of what you are reporting about MIPS?
Have you got a source?  I've been checking the most authoritative
references I can find and can't find mention of such a restriction.
(The R8000 User's Manual doesn't appear to mention LL/SC at all, sigh.)

One thing I DID find is the "R4000MC Errata, Processor Revision 2.2 and
3.0", which documents several LL/SC bugs (Numbers 10, 12, 13) and #12
in particular requires extremely careful coding in the workaround.

That may completely scuttle the idea of using generic LL/SC functions.

> So you almost definitely cannot have gcc generated assembly between. I
> think we agree on that much.

We don't.  I think that if that restriction applies, it's worthless,
because you can't achieve a net reduction in arch-dependent code.

GCC specifically says that if you want a 100% guarantee of no reloads
between asm instructions, place them in a single asm() statement.

> In truth, however, realizing that we're only talking about three
> architectures (wo of which have 32 & 64-bit versions) it's probably not
> worth it.  If there were five, it would probably be a savings, but 3x
> code duplication of some small, well-defined primitives is a fair price
> to pay for avoiding another layer of abstraction (a.k.a. obfuscation).
> 
> And it lets you optimize them better.
> 
> I apologize for not having counted them before.

> I also disagree that the architectures don't matter. ARM and PPC are
> pretty important, and I believe Linux on MIPS is growing too.

Er... I definitely don't see where I said, and I don't even see where
I implied - or even hinted - that MIPS, ARM and PPC "don't matter."
I use Linux on ARM daily.

I just thought that writing a nearly-optimal generic primitive is about
3x harder than writing a single-architecture one, so even for primitives
yet to be written, its just as easy to do it fully arch-specific.

Plus you have corner cases like the R5900 that don't have LL/SC at all.
(Can it be used multiprocessor?)

> One proposal that I could buy is an atomic_ll/sc API, which mapped
> to a cmpxchg emulation even on those llsc architectures which had
> any sort of restriction whatsoever. This could be used in regular C
> code (eg. you indicate powerpc might be able to do this). But it may
> also help cmpxchg architectures optimise their code, because the
> load really wants to be a "load with intent to store" -- and is
> IMO the biggest suboptimal aspect of current atomic_cmpxchg.

Or, possibly, an interface like

do {
oldvalue = ll(addr)
newvalue = ... oldvalue ...
} while (!sc(addr, oldvalue, newvalue))

Where sc() could be a cmpxchg.  But, more importantly, if the
architecture did implement LL/SC, it could be a "try plain SC; if
that fails try CMPXCHG built out of LL/SC; if that fails, loop"

Actually, I'd want something a bit more integrated, that could
have the option of fetching the new oldvalue as part of the sc()
implementation if that failed.  Something like

DO_ATOMIC(addr, oldvalue) {
... code ...
} UNTIL_ATOMIC(addr, oldvalue, newvalue);

or perhaps, to encourage short code sections,
DO_ATOMIC(addr, oldvalue, code, newvalue);

The problem is, that's already not optimal for spinlocks, where
you want to use a non-linked load while spinning.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] ensure unique i_ino in filesystems without permanent

2006-12-12 Thread linux

> Good catch on the inlining. I had meant to do that and missed it.

Er... if you want it to *be* inlined, you have to put it into the .h
file so the compiler knows about it at the call site.  "static inline"
tells gcc not avoid emitting a callable version.

Something like this the following.  (You'll also need to add
a "#include ", unless you expand the "bool", "false" and
"true" macros to their values "_Bool", "0" and "1" by hand.)

--- linux-2.6/include/linux/fs.h.super  2006-12-12 08:53:34.0 -0500
+++ linux-2.6/include/linux/fs.h2006-12-12 08:54:14.0 -0500
@@ -1879,7 +1879,32 @@
  extern struct inode_operations simple_dir_inode_operations;
  struct tree_descr { char *name; const struct file_operations *ops; int mode; 
};
  struct dentry *d_alloc_name(struct dentry *, const char *);
-extern int simple_fill_super(struct super_block *, int, struct tree_descr *);
+extern int __simple_fill_super(struct super_block *s, int magic,
+   struct tree_descr *files, bool registered);
  extern int simple_pin_fs(struct file_system_type *, struct vfsmount **mount, 
int *count);
  extern void simple_release_fs(struct vfsmount **mount, int *count);

+/*
+ * Fill a superblock with a standard set of fields, and add the entries in the
+ * "files" struct. Assign i_ino values to the files sequentially. This function
+ * is appropriate for filesystems that need a particular i_ino value assigned
+ * to a particular "files" entry.
+ */
+static inline int simple_fill_super(struct super_block *s, int magic,
+   struct tree_descr *files)
+{
+   return __simple_fill_super(s, magic, files, false);
+}
+
+/*
+ * Just like simple_fill_super, but does an iunique_register on the inodes
+ * created for "files" entries. This function is appropriate when you don't
+ * need a particular i_ino value assigned to each files entry, and when the
+ * filesystem will have other registered inodes.
+ */
+static inline int registered_fill_super(struct super_block *s, int magic,
+   struct tree_descr *files)
+{
+   return __simple_fill_super(s, magic, files, true);
+}
+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

2007-06-11 Thread linux

+#define F3(x,y,z)  \
+   movlx, TMP2;\
+   andly, TMP2;\
+   movlx, TMP; \
+   orl y, TMP; \
+   andlz, TMP; \
+   orl TMP2, TMP

*Sigh*.  You don't need TMP2 to compute the majority function.

You're implementing it as (x & y) | ((x | y) & z).
Look at the rephrasing in lib/sha1.c:
#define f3(x,y,z)   ((x & y) + (z & (x ^ y)))   /* majority */

By changing the second OR to x^y, you ensure that the two halves
of the first disjunciton are distinct, so you can replace the OR
with XOR, or better yet, +.

Then you can just do two adds to e.  That is, write:

/* Bitwise select: x ? y : z, which is (z ^ (x & (y ^ z))) */
#define F1(x,y,z,dest)  \
movlz, TMP; \
xorly, TMP; \
andlx, TMP; \
xorlz, TMP; \
addlTMP, dest

/* Three-way XOR (x ^ y ^ z) */
#define F2(x,y,z,dest)  \
movlz, TMP; \
xorlx, TMP; \
xorly, TMP; \
addlTMP, dest

/* Majority: (x^y)|(y&z)|(z&x) = (x & z) + ((x ^ z) & y)
#define F3(x,y,z,dest)  \
movlz, TMP; \
andlx, TMP; \
addlTMP, dest;  \
movlz, TMP; \
xorlx, TMP; \
andly, TMP; \
addlTMP, dest

Since y is the most recently computed result (it's rotated in the
previous round), I arranged the code to delay its use as late as
possible.


Now you have one more register to play with.


I thought I had some good sha1 asm code lying around, but I
can't seem to find it.  (I have some excellent PowerPC asm if anyone
wants it.)  Here's a basic implementation question:

SHA-1 is made up of 80 rounds, 20 of each of 4 types.
There are 5 working variables, a through e.

The basic round is:
t = F(b, c, d) + K + rol32(a, 5) + e + W[i];
e = d; d = c; c = rol32(b, 30); b = a; a = t;
where W[] is the input array.  W[0..15] are the input words, and
W[16..79] are computed by a sort of LFSR from W[0..15].

Each group of 20 rounds has a different F() and K.
This is the smallest way to write the function, but all the
register shuffling makes for a bit of a speed penalty.

A faster way is to unroll 5 iterations and do:
e += F(b, c, d) + K + rol32(a, 5) + W[i  ]; b = rol32(b, 30);
d += F(a, b, c) + K + rol32(e, 5) + W[i+1]; a = rol32(a, 30);
c += F(e, a, b) + K + rol32(d, 5) + W[i+2]; e = rol32(e, 30);
b += F(d, e, a) + K + rol32(c, 5) + W[i+3]; d = rol32(d, 30);
a += F(c, d, e) + K + rol32(b, 5) + W[i+4]; c = rol32(c, 30);
then loop over that 4 times each.  This is somewhat larger, but
still reasonably compact; only 20 of the 80 rounds are written out
long-hand.

Faster yet is to unroll all 80 rounds directly.  But it also takes the
most code space, and as we have learned, when your code is not the
execution time hot spot, less cache is faster code.

Is there a preferred implementation?


Another implementation choice has to do with the computation of W[].
W[i] is a function of W[i-3], W[i-8], W[i-14] and W[i-16].
It is possible to keep a 16-word circular buffer with only the most
recent 16 values of W[i%16] and compute each new word as it is needed.
However, the offsets i%16 repeat every 16 rounds, which is an awkward
fit with the 5-round repeating pattern of the main computation.

One option is to compute all the W[] values in a pre-pass beforehand.
Simple and small, but uses 320 bytes of data on the stack or wherever.
An intermediate one is to keep a 20-word buffer, and compute 20 words
at a time just before each of the 20 round groups.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

2007-06-11 Thread linux

ndif

popl%ebx
popl%esi
popl%edi
popl%ebp

ret
.size   sha_transform5, .-sha_transform5
# Size is 0xDE6 = 3558 bytes


.globl sha_stackwipe
.type   sha_stackwipe, @function
# void sha_stackwipe(void)
# After one or more sha_transform calls, we have left the contents of W[]
# on the stack, and from any 16 of those 80 words, the entire input
# can be reconstructed.  If the caller cares, this function obliterates
# the relevant portion of the stack.
# 2 words of argument + 4 woirds of saved registers + 80 words of W[]
sha_stackwipe:
xorl%eax,%eax
movl$86,%ecx
# Damn, I had hoped that loop; pushl %eax would work..
1:
decl%ecx
pushl   %eax
jne 1b

addl$4*86,%esp
    ret
.size   sha_stackwipe, .-sha_stackwipe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-12 Thread linux

Given that incomprehensible help texts are a bit of a pet peeve of mine
(I just last weekend figured out that you don't need to select an I2C
algorithm driver to have working I2c - I had thought it was a "one from
column A, one from column B" thing), let me take a crack...

PAE doubles the size of each page table entry, increasing
kernel memory consumption and slowing page table access.
However, it enables:
- Addressing more than 4G of physical RAM (CONFIG_HIGHMEM is
  also required)
- Marking pages as readable but not executable using the NX
  (no-execute) bit, which protects applications from stack
  overflow attacks.
- Swap files or partitions larger than 64G each.
  (Only needed with >4G RAM or very heavy tmpfs use.)

A kernel compiled with this option cannot boot on a processor
without PAE support.  Enabling this also disables the
(expert use only) CONFIG_VMSPLIT_[23]G_OPT options.

Does that seem reasonably user-oriented?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] random: fix folding

2007-06-12 Thread linux

> Folding is done to minimize the theoretical possibility of systematic
> weakness in the particular bits of the SHA1 hash output. The result of
> this bug is that 16 out of 80 bits are un-folded. Without a major new
> vulnerability being found in SHA1, this is harmless, but still worth
> fixing.

Actually, even WITH a major new vulnerability found in SHA1, it's
harmless.  Sorry to put BUG in caps earlier; it actually doesn't warrant
the sort of adjective I used.  The purpose of the folding is to ensure that
the feedback includes bits underivable from the output.  Just outputting
the first 80 bits and feeding back all 160 would achieve that effect;
the folding is of pretty infinitesimal benefit.

Note that last five rounds have as major outputs e, d, c, b, and a,
in that order.  Thus, the first words are the "most hashed" and
the ones most worth using as output... which happens naturally with
no folding.

The folding is a submicroscopic bit of additional mixing.
Frankly, the code size savings probably makes it worth deleting it.
(That would also give you more flexibility to select the output/feedback
ratio in whatever way you like.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

2007-06-12 Thread linux

2,W[i+4]+K4);
}

digest[0] += a;
digest[1] += b;
digest[2] += c;
digest[3] += d;
digest[4] += e;
}

extern void sha_transform2(uint32_t digest[5], const char in[64]);
extern void sha_transform3(uint32_t digest[5], const char in[64]);
extern void sha_transform5(uint32_t digest[5], const char in[64]);
extern void sha_stackwipe(void);

void sha_init(uint32_t buf[5])
{
buf[0] = 0x67452301;
buf[1] = 0xefcdab89;
buf[2] = 0x98badcfe;
buf[3] = 0x10325476;
buf[4] = 0xc3d2e1f0;
}

#include 
#include 
#include 
#include 

#if 1
void sha_stackwipe2(void)
{
uint32_t buf[90];
memset(buf, 0, sizeof buf);
asm("" : : "r" (&buf)); /* Force the compiler to do the memset */
}
#endif


#define TEST_SIZE (10*1024*1024)

int main(void)
{
uint32_t W[80];
uint32_t out[5];
char const text[64] = "Hello, world!\n";
char *buf;
uint32_t *p;
unsigned i;
struct timeval start, stop;

sha_init(out);
sha_transform(out, text, W);
printf("  One: %08x %08x %08x %08x %08x\n",
out[0], out[1], out[2], out[3], out[4]);

sha_init(out);
sha_transform4(out, text, W);
printf(" Four: %08x %08x %08x %08x %08x\n",
out[0], out[1], out[2], out[3], out[4]);

sha_init(out);
sha_transform2(out, text);
printf("  Two: %08x %08x %08x %08x %08x\n",
out[0], out[1], out[2], out[3], out[4]);

sha_init(out);
sha_transform3(out, text);
printf("Three: %08x %08x %08x %08x %08x\n",
out[0], out[1], out[2], out[3], out[4]);

sha_init(out);
sha_transform5(out, text);
printf(" Five: %08x %08x %08x %08x %08x\n",
out[0], out[1], out[2], out[3], out[4]);

sha_stackwipe();
#if 1

/* Set up a large buffer full of stuff */
buf = malloc(TEST_SIZE);
p = (uint32_t *)buf;
memcpy(p, W+80-16, 16*sizeof *p);
for (i = 0; i < TEST_SIZE/sizeof *p - 16; i++) {
uint32_t a = p[i+13] ^ p[i+8] ^ p[i+2] ^ p[i];
p[i+16] = rol32(a, 1);
}

sha_init(out);
gettimeofday(&start, 0);
for (i = 0; i < TEST_SIZE; i += 64)
sha_transform(out, buf+i, W);
gettimeofday(&stop, 0);
printf("  One: %08x %08x %08x %08x %08x -- %lu us\n",
out[0], out[1], out[2], out[3], out[4],
100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

sha_init(out);
gettimeofday(&start, 0);
for (i = 0; i < TEST_SIZE; i += 64)
sha_transform4(out, buf+i, W);
gettimeofday(&stop, 0);
printf(" Four: %08x %08x %08x %08x %08x -- %lu us\n",
out[0], out[1], out[2], out[3], out[4],
100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

sha_init(out);
gettimeofday(&start, 0);
for (i = 0; i < TEST_SIZE; i += 64)
sha_transform2(out, buf+i);
gettimeofday(&stop, 0);
printf("  Two: %08x %08x %08x %08x %08x -- %lu us\n",
out[0], out[1], out[2], out[3], out[4],
100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

sha_init(out);
gettimeofday(&start, 0);
for (i = 0; i < TEST_SIZE; i += 64)
sha_transform3(out, buf+i);
gettimeofday(&stop, 0);
printf("Three: %08x %08x %08x %08x %08x -- %lu us\n",
out[0], out[1], out[2], out[3], out[4],
100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

sha_init(out);
gettimeofday(&start, 0);
for (i = 0; i < TEST_SIZE; i += 64)
sha_transform5(out, buf+i);
gettimeofday(&stop, 0);
printf(" Five: %08x %08x %08x %08x %08x -- %lu us\n",
out[0], out[1], out[2], out[3], out[4],
100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec);

sha_stackwipe();
#endif  

return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

msleep(1000) vs. schedule_timeout_uninterruptible(HZ+1)

2007-06-17 Thread linux

I was looking at some of the stupider code that calls msleep(),
particularly that which does
msleep(jiffies_to_msecs(jiff))

and I noticed that msleep() just calls schedule_timeout_uninterruptible().
But it does it in a loop.

The basic question is, when does the loop make a difference?
Is it only when you're on a wait queue?  Or are there other kinds of
unexpected wakeups that can arrive?

I see all kinds of uses of both kinds for simple "wait a while" operations,
and I'm not sure if one is more correct than the other.

(And, in drivers/media/video/cpia2/cpia2_v4l.c:cpia2_exit(), a lovely
example of calling schedule_timeout() without set_current_state() first.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why can't we sleep in an ISR?

2007-05-14 Thread linux

Sleeping in an ISR is not fundamentally impossible - I could design
a multitasker that permitted it - but has significant problems, and
most multitaskers, including Linux, forbid it.

The first problem is the scheduler.  "Sleeping" is actually a call
into the scheduler to choose another process to run.  There are times -
so-called critical sections - when the scheduler can't be called.

If an interrupt can call the scheduler, then every criticial section
has to disable interrupts.  Otherwise, an interrupt might arrive and
end up calling the scheduler.  This increases interrupt latency.

If interrupts are forbidden to sleep, then there's no need to
disable interrupts in critical sections, so interrupts can be responded
to faster.  Most multitaskers find this worth the price.

The second problem is shared interrupts.  You want to sleep until something
happens.  The processor hears about that event via an interrupt.  Inside
an ISR, interrupts are disabled.

You have to somehow enable the interrupt that will wake up the sleeping
ISR without enabling the interrupt that the ISR is in the middle of handling
(or the handler will start a second time and make a mess).

This is complicated and prone to error.  And, in the case of shared interrupts
(as allowed by PCI), it's possible that the the interrupt you need
to wait for is exactly the same interrupt as what you're in the middle
of handling.  So it might be impossible!

The third problem is that you're obviously increasing the latency of the
interrupt whose handler you're sleeping in.

Finally, if you're even *thinking* of wanting to sleep in an ISR,
you probably have a deadlock waiting to happen.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.22-rcX Transmeta/APM regression

2007-06-27 Thread linux

Hardware: Fujitsu Lifebook P-2040, TM5800 800 MHz processor
2.6.21: Closing the lid causes APM suspend.  Opening it resumes just fine.
2.6.22-rc5/-rc6: On resume, backlight comes on, but system is otherwise
frozen.  Nothing happens until I hold the power button to force
a power off.

I'm trying to bisect, but there's a large range of commits which crash
on boot in init_transmeta, which is slowing me down.

However, I did manage to find a kernel version that gives an error
message instead of a blank screen, which might be useful.  I can even
switch VTs and type into the shell afterwards, but actually trying to do
anything hangs.  Which includes anything like run a command to capture
this to a file or another machine on the network, even if I took care
to cache the necessary executables and libraries before suspending.

So the following is transcribed by hand.

general protection fault:  [#1]
Modules linked in:
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010246   (2.6.21-gba7cc09c #16)
EIP is at get_fixed_ranges+0x9/0x60
eax: c0338d24   ebx: c03589a0   ecx: 0250   edx: 
esi: c0338d24   edi: 000a   ebp:    esp: cefa4f5c
ds: 007b   es: 007b   fs:   gs: 000   ss: 0068
Process kapmd (pid: 70, ti=cefa4000 task=cef89550 task.ti=cefa4000)
Stack: c03589a0 c010b1a0 c0238e4d c010b1a0 000a  c010addf c010b1a0
   000a  c010b5f1 cefa4fc4 cefa4fc0 cefa4fbc cefa4fb8 cefa4fb8
   0001 e45a3b0f  cef89550 c0110f20 c02f53e4 c02f5e34 c010b1a0
Call Trace:
 [] apm+0x0/0x500
 [] __save_process_rstate+0xd/0x50
 [] apm+0x0/0x500
 [] suspend+0x1f/0xb0
 [] apm+0x0/0x500
 [] apm+0x451/0x500
 [] default_wake_function+0x0/0x10
 [] apm+0x0/0x500
 [] apm+0x0/0x500
 [] kthread+0x39/0x60
 [] kthread+0x0/0x60
 [] kernel_thread_helper+0x7/0x10
 ===
Code: 46 83 c7 04 39 ee 0f 8c 40 ff ff ff 83 c4 3c 31 c0 5b 5e 5f 5d c3 90 90 90
 90 90 90 90 90 90 90 90 90 56 b9 50 02 00 00 53 89 c6 <0f> 32 89 06 89 d0 b1 58
 31 d2 89 46 04 0f 32 89 46 08 89 d0 b1
EIP: [] get_fixed_ranges+0x9/0x60 SS:ESP 0068:cefa4f5c


The init_transmeta crash looks like the following:

Calibrating delay using timer specific routine.. 1630.69 BogoMIPS (lpj=8153474)
Mount-cache hash table entries: 512
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (32 bytes/line)
CPU: L2 Cache: 512K (128 bytes/line)
CPU: Processor revision 1.4.1.0, 800 MHz
CPU: Code Morphing Software revision 4.2.6-8-168
CPU: 20010703 00:29 official release 4.2.6#2
general protection fault:  [#1]
Modules linked in:
CPU: 0
EIP: 0060:[]   Not tainted VLI
EFLAGS: 00010286   (2.6.21-g1e7371c1 #18)
EIP is at init_transmeta+0x1d5/0x230
eax:    ebx:    ecx: 80860004   edx: 
esi:    edi:    ebp: 80860004   esp: c030fed0
ds: 007b   es: 007b   fs:   gs:   ss: 0068
Process swapper (pid: 0, ti=c030f000 task=c02ed280 task.ti=c030f000)
Stack: c02b5c52 c030ff1b 0002 0006 0008 00a8 cefc2600 0246
     c030d2e0 0320 0020   c01b553f
     3200 30313030 20333037 323a3030 666f2039 69636966  
/* "20010703 00:29 offici" */
Call Trace:
 [] idr_get_new_above_int+0x10f/0x1f0
 [] identify_cpu+0x20e/0x370
 [] idr_get_new+0xd/0x30
 [] proc_register+0x30/0xe0
 [] identify_boot_cpu+0xd/0x20
 [] check_bugs+0x8/0x100
 [] start_kernel+0x203/0x210
 [] unknown_bootoption+0x0/0x210
 ===
Code: 00 c6 84 24 8b 00 00 00 00 89 7c 24 04 c7 04 24 52 5c 2b c0 ed 8d fc df ff
 bd 04 00 86 80 89 e9 0f 32 89 c6 93 c8 ff 89 d7 89 c2 <0f> 30 31 c9 b8 01 00 00
 00 0f a2 8b 44 24 28 89 e9 89 50 0c b8
EIP: [] init_transmeta+0x1d5/0x230 SS:ESP 0068:c030fed0
Kernel panic - not syncing: Attempted to kill the idle task!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-rcX Transmeta/APM regression

2007-06-27 Thread linux

> .config and contents of /proc/cpuinfo would be helpful...

Apologies!  I'm still working on the bisection, but...

The following is from 2.6.21-gae1ee11b, which works.

$ cat /tmp/cpuinfo
processor   : 0
vendor_id   : GenuineTMx86
cpu family  : 6
model   : 4
model name  : Transmeta(tm) Crusoe(tm) Processor TM5800
stepping: 3
cpu MHz : 300.000
cache size  : 512 KB
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr cx8 sep cmov mmx longrun lrti 
constant_tsc
bogomips: 1608.38
clflush size: 32
$ lspci -nn
00:00.0 Host bridge [0600]: Transmeta Corporation LongRun Northbridge 
[1279:0395] (rev 01)
00:00.1 RAM memory [0500]: Transmeta Corporation SDRAM controller [1279:0396]
00:00.2 RAM memory [0500]: Transmeta Corporation BIOS scratchpad [1279:0397]
00:02.0 USB Controller [0c03]: ALi Corporation USB 1.1 Controller [10b9:5237] 
(rev 03)
00:04.0 Multimedia audio controller [0401]: ALi Corporation M5451 PCI AC-Link 
Controller Audio Device [10b9:5451] (rev 01)
00:06.0 Bridge [0680]: ALi Corporation M7101 Power Management Controller [PMU] 
[10b9:7101]
00:07.0 ISA bridge [0601]: ALi Corporation M1533/M1535 PCI to ISA Bridge 
[Aladdin IV/V/V+] [10b9:1533]
00:0c.0 CardBus bridge [0607]: Texas Instruments PCI1410 PC card Cardbus 
Controller [104c:ac50] (rev 01)
00:0f.0 IDE interface [0101]: ALi Corporation M5229 IDE [10b9:5229] (rev c3)
00:12.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. 
RTL-8139/8139C/8139C+ [10ec:8139] (rev 10)
00:13.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB21 
IEEE-1394a-2000 Controller (PHY/Link) [104c:8026]
00:14.0 VGA compatible controller [0300]: ATI Technologies Inc Rage Mobility 
P/M [1002:4c52] (rev 64)
$ grep ^CONFIG /usr/src/linux/.config
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_IKCONFIG=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_KMOD=y
CONFIG_BLOCK=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_CFQ=y
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_X86_PC=y
CONFIG_MCRUSOE=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_TSC=y
CONFIG_HPET_TIMER=y
CONFIG_PREEMPT_NONE=y
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_VM86=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_NOHIGHMEM=y
CONFIG_PAGE_OFFSET=0xC000
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ZONE_DMA_FLAG=1
CONFIG_MTRR=y
CONFIG_SECCOMP=y
CONFIG_HZ_100=y
CONFIG_HZ=100
CONFIG_PHYSICAL_START=0x10
CONFIG_PHYSICAL_ALIGN=0x10
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_SOFTWARE_SUSPEND=y
CONFIG_PM_STD_PARTITION="/dev/hda2"
CONFIG_APM=y
CONFIG_APM_CPU_IDLE=y
CONFIG_APM_DISPLAY_BLANK=y
CONFIG_APM_RTC_IS_GMT=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
CONFIG_X86_LONGRUN=y
CONFIG_PCI=y
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_ISA_DMA_API=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
CONFIG_CARDBUS=y
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
CONFIG_PCCARD_NONSTATIC=

Re: 2.6.22-rcX Transmeta/APM regression

2007-06-29 Thread linux

Okay, after a ridiculous amount of bisecting and recompiling and
rebooting...

First I had to find out that the kernel stops booting as of
bf50467204: "i386: Use per-cpu GDT immediately on boot"
(With theis commit, it silently stops booting.  The GP fault I
posted earlier comes a little later, but I didn't bother finding it.)

and starts again as of
b0b73cb41d: "i386: msr.h: be paranoid about types and parentheses"

However, one commit before the former suspends properly, and the latter
fails to suspend (exactly the same problem at get_fixed_ranges+0x9/0x60),
so I had to bisect further between the two, backporting the msr.h changes
across the msr-index.h splitoff.

Anyway, the patch which introduces the problem is the aptly named 3ebad:
3ebad59056: [PATCH] x86: Save and restore the fixed-range MTRRs of the BSP when 
suspending

2.6.22-rc6 plus that one commit reverted successfully does APM suspend
(and resume) for me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-rcX Transmeta/APM regression

2007-06-30 Thread linux

Responding to various proposed fixes:

> Index: linux/arch/i386/kernel/cpu/mtrr/main.c
> ===
> --- linux.orig/arch/i386/kernel/cpu/mtrr/main.c
> +++ linux/arch/i386/kernel/cpu/mtrr/main.c
> @@ -734,8 +734,11 @@ void mtrr_ap_init(void)
>   */
>  void mtrr_save_state(void)
>  {
> - int cpu = get_cpu();
> + int cpu;
>  
> + if (!cpu_has_mtrr)
> + return;
> + cpu  = get_cpu();
>   if (cpu == 0)
>   mtrr_save_fixed_ranges(NULL);
>   else

This does not change the symptoms in any way.

> --- a/arch/i386/kernel/cpu/mtrr/generic.c~i386-mtrr-crash-fix
> +++ a/arch/i386/kernel/cpu/mtrr/generic.c
> @@ -65,7 +65,8 @@ get_fixed_ranges(mtrr_type * frs)
>  
>  void mtrr_save_fixed_ranges(void *info)
>  {
> - get_fixed_ranges(mtrr_state.fixed_ranges);
> + if (cpu_has_mtrr)
> + get_fixed_ranges(mtrr_state.fixed_ranges);
>  }
>  
>  static void print_fixed(unsigned base, unsigned step, const mtrr_type*types)

This works great, thanks!  Please consider the regression diagnosed and fixed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

A simpler variant on sys_indirect?

2007-06-30 Thread linux

I was just thinking, while sys_indirect is an interesting way to add
features to a system call, the argument marshalling in user space is a
bit of a pain.

An alternate idea would be to instead have a "prefix system call" that
sets some flags that apply to the next system call made by that thread
only.  They wouldn't be global mode flags that would mess up libraries.

Maybe I've just been programming x86s too long, but this seems
like a nicer mental model.

The downsides are that you need to save and restore the prefix flags
across signal delivery, and you have a second user/kernel/user transition.

Most of the options seem to be applied to system calls that resolve
path names.  While that is certainly a very important code path, it's
also of non-trivial length, even with the dcache.  How much would one
extra kernel entry bloat the budget?

And if the kernel entry overhead IS a problem, wouldn't you want to
batch together the non-prefix system calls as well, using something like
the syslet ideas that were kicked around recently?  That would
allow less than 1 kernel entry per system call, even with prefixes.

Oh!  That suggests an interesting possibility that solves the signal
handling problem as well:
- Make a separate prefix system call, BUT
- The flags are reset on each return to user space, THUS
- You *have* to use a batch-system-call mechanism for the prefix
  system calls to do anything.

Of course, this takes us right back to the beginning with respect to
messy user-space argument marshalling.  But at least it's only one
indirect system call mechanism, not two.  Wrapping indirect system call
mechanism #1 (to set syscall options) in indirect system call mechanism
#2 (to batch system calls) seems like a bit of a nightmare.

I'm not at all sure that these are good ideas, but they're not obviously
bad ones, to me.  Is it worth looking for synergy between various
"indirect system call" ideas?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] change futex_wait() to hrtimers

2007-03-14 Thread linux

> BTW. my futex man page says timeout's contents "describe the maximum duration
> of the wait". Surely that should be *minimum*? Michael cc'ed.

Er, the intent of the wording is to say "futex will wait until uaddr
no longer contains val, or the timeout expires, whichever happens first".


One option for selecting different clock resolutions is to use the
clockid_t from the POSIX clock_gettime() family.  That is, specify the
clock that a wait uses, and then have a separate mechanism for turning
a resolution requirement into a clockid_t.

(And there can be default clocks for interfaces that don't specify one
explicitly.)

Although clockid_t is pretty generic, it's biased toward an enumerated
list of clocks rather than a continuous resolution.  Fortunately,
that seems to match the implementation ideas.  The question is how
much the timeout gets rounded, and the choices are currently jiffies
or microseconds.

A related option may be whether rounding down is acceptable.  For some
applications (periodic polling for events), it's fine.  For others,
it's not.  Thus, while it's okay to specify such clocks explicitly,
it'd probably be a good idea to forbid selecting them as the default
for interfaces that don't specify a clock explicitly.

I had some code that suffered 1 ms buzz-loops on Solaris because poll(2)
would round the timeout interval down, but the loop calling it would
explicitly check whether the timeout had expired using gettimeofday()
and would keep re-invoking poll(pollfds, npollfds, 1) until the timeout
really did expire.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

swapper: page allocation failure. order:0, mode:0x20

2007-05-04 Thread linux

I'm not used to seeing order-0 allocation failures on lightly loaded 2 GB
(amd64, so it's all low memory) machines.

Can anyone tell me what happened?  It happened just as I was transferring
a large file to the machine for later crunching (the "sgrep" program is
a local number-crunching application that was getting alignment errors
in SSE code), and the network stopped working.  The "NETDEV WATCHDOG"
message happened a few minutes later, during the head-scratching phase.

I ended up rebooting the machine to get on with the number-crunching,
but this is a bit mysterious.  The ethernet driver is forcedeth.  Does it
appear to be at fault?

Here's a dmesg log, with /proc/*info and lspci appended.
amd64 uniprocessor, with ECC memory.  Stock 2.6.21 + linuxpps patches.

Thanks for any suggestions!


er [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
input: AT Translated Set 2 keyboard as /class/input/input2
input: PC Speaker as /class/input/input3
input: PS/2 Generic Mouse as /class/input/input4
i2c_adapter i2c-0: nForce2 SMBus adapter at 0x1c00
i2c_adapter i2c-1: nForce2 SMBus adapter at 0x1c40
it87: Found IT8712F chip at 0x290, revision 7
it87: in3 is VCC (+5V)
it87: in7 is VCCH (+5V Stand-By)
md: raid0 personality registered for level 0
md: raid1 personality registered for level 1
md: raid10 personality registered for level 10
raid6: int64x1   2052 MB/s
raid6: int64x2   2606 MB/s
raid6: int64x4   2579 MB/s
raid6: int64x8   1838 MB/s
raid6: sse2x12817 MB/s
raid6: sse2x23738 MB/s
raid6: sse2x44021 MB/s
raid6: using algorithm sse2x4 (4021 MB/s)
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
raid5: automatically using best checksumming function: generic_sse
   generic_sse:  7089.000 MB/sec
raid5: using function: generic_sse (7089.000 MB/sec)
EDAC MC: Ver: 2.0.1 Apr 26 2007
netem: version 1.2
Netfilter messages via NETLINK v0.30.
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP cubic registered
Initializing XFRM netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
NET: Registered protocol family 15
802.1Q VLAN Support v1.8 Ben Greear <[EMAIL PROTECTED]>
All bugs added by David S. Miller <[EMAIL PROTECTED]>
powernow-k8: Found 1 AMD Athlon(tm) 64 Processor 3700+ processors (version 
2.00.00)
powernow-k8:0 : fid 0xe (2200 MHz), vid 0x6
powernow-k8:1 : fid 0xc (2000 MHz), vid 0x8
powernow-k8:2 : fid 0xa (1800 MHz), vid 0xa
powernow-k8:3 : fid 0x2 (1000 MHz), vid 0x12
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdf4 ...
md:  adding sdf4 ...
md: sdf3 has different UUID to sdf4
md: sdf2 has different UUID to sdf4
md: sdf1 has different UUID to sdf4
md:  adding sde4 ...
md: sde3 has different UUID to sdf4
md: sde2 has different UUID to sdf4
md: sde1 has different UUID to sdf4
md:  adding sdd4 ...
md: sdd3 has different UUID to sdf4
md: sdd2 has different UUID to sdf4
md: sdd1 has different UUID to sdf4
md:  adding sdc4 ...
md: sdc3 has different UUID to sdf4
md: sdc2 has different UUID to sdf4
md: sdc1 has different UUID to sdf4
md:  adding sdb4 ...
md: sdb3 has different UUID to sdf4
md: sdb2 has different UUID to sdf4
md: sdb1 has different UUID to sdf4
md:  adding sda4 ...
md: sda3 has different UUID to sdf4
md: sda2 has different UUID to sdf4
md: sda1 has different UUID to sdf4
md: created md5
md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: running: 
raid5: device sdf4 operational as raid disk 5
raid5: device sde4 operational as raid disk 4
raid5: device sdd4 operational as raid disk 3
raid5: device sdc4 operational as raid disk 2
raid5: device sdb4 operational as raid disk 1
raid5: device sda4 operational as raid disk 0
raid5: allocated 6362kB for md5
raid5: raid level 5 set md5 active with 6 out of 6 devices, algorithm 2
RAID5 conf printout:
 --- rd:6 wd:6
 disk 0, o:1, dev:sda4
 disk 1, o:1, dev:sdb4
 disk 2, o:1, dev:sdc4
 disk 3, o:1, dev:sdd4
 disk 4, o:1, dev:sde4
 disk 5, o:1, dev:sdf4
md5: bitmap initialized from disk: read 11/11 pages, set 1 bits, status: 0
created bitmap (164 pages) for device md5
md: considering sdf3 ...
md:  adding sdf3 ...
md: sdf2 has different UUID to sdf3
md: sdf1 has different UUID to sdf3
md:  adding sde3 ...
md: sde2 has different UUID to sdf3
md: sde1 has different UUID to sdf3
md:  adding sdd3 ...
md: sdd2 has different UUID to sdf3
md: sdd1 has different UUID to sdf3
md:  adding sdc3 ...
md: sdc2 has different UUID to sdf3
md: sdc1 has different UUID to sdf3
md:  adding sdb3 ...
md: sdb2 has different UUID to sdf3
md: sdb1 has different UUID to sdf3
md:  adding sda3 ...
md: sda2 has different UUID to sdf3
md: sda1 has different UUID to sdf3
md: created md4
md: bind
md: bind
md: bind
md: bind
md: bind
md: bind
md: running: 
raid10: raid set md4 active with 6 out of 6 devices
md4: b

Re: increase Linux kernel address space 3.5 G memory on Redhat Enterprise

2007-05-08 Thread linux

> Hi:
> I am running Redhat Linux Enterprise version 4 update 4 on a dual-core
> 4G memory machine.  There are many references on the web talking about
> increasing default user address space to 3.5 G however lacking specific
> instructions. My questions:
> 
> 1. What is the specific steps to be done for the kernel to support 3.5 G 
>address space?
> 2. Do I need to re-compile kernel to make this happen? If so, any 
>specific instruction?

2. Yes, you need to re-compile the kernel.  Instructions all over the web.
   Basically, "cd /usr/src/linux", make sure a reasonable default .config
   file is installed (the distribution should supply one for its default
   kernel), "make menuconfig" or "make xconfig", change the options you
   want changed, then "make" and "make install".  The latter *usually*
   works; the usual worst case is that it installs it somewhere other than
   your boot loader is looking and rebooting will find the old kernel.

   The fun comes when you left an option out of your new kernel that you
   need to boot - like the hard drive controller!  Then you need to go
   back to an old, known working kernel.

   It's not at all difficult, but you do need to be careful; a mistake
   can be awkward to recover from if you don't plan ahead.

1. First of all, that's not necessarily a good idea.  Doing that would limit
   you to 384 MB of kernel memory, after the usual 128 MB deduction for
   PCI devices.  That has to fit the kernel binary, all page tables, inode
   cache, network buffers, and so on.  For some workloads, that can be
   a bottleneck.

   If your application is heavily biased toward file data that the kernel
   doesn't have to look at, such as databases, it might be okay.

   A much better thing would be to take advantage of the fact that
   every multi-core processor I've heard of (IBM's POWER4, Sun Niagara,
   and a few by some companies you may not have heard of like Intel and
   AMD) is a 64-bit processor.  So you can run a 64bit kernel and get
   terabytes of user address space.  Even 32-bit applications get a full
   4G of address space, as the 64-bit kernel doesn't need to share.

   That would make your user application happier *and* the kernel happier.
   It would increase kernel data structure size, but it's still usually a
   net win.

1b. If you really want to do it, it's not a normally selectable option,
but you can add it to arch/i386/Kconfig by following the pattern of
the others.  You need CONFIG_PAGE_OFFSET=0xE000, and you need to
be sure that CONFIG_HIGHMEM64G is turned OFF.  (If it's on, you're
using PAE, and its 3-level page table structure requires using a
1G boundary.)

Then configure the kernel, select CONFIG_EMBEDDED under "General setup",
then your memory split under "Processor type and features".

Compile, install the new kernel, and reboot.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-03-22 Thread linux

> 3 (I think) seperate instances of this, each involving raid5. Is your
> array degraded or fully operational?

Ding! A drive fell out the other day, which is why the problems only
appeared recently.

md5 : active raid5 sdf4[5] sdd4[3] sdc4[2] sdb4[1] sda4[0]
  1719155200 blocks level 5, 64k chunk, algorithm 2 [6/5] [_U]
  bitmap: 149/164 pages [596KB], 1024KB chunk

H'm... this means that my alarm scripts aren't working.  Well, that's
good to know.  The drive is being re-integrated now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.20.3 AMD64 oops in CFQ code

2007-03-22 Thread linux

c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c 
RIP  [] cfq_dispatch_insert+0x18/0x68
 RSP 
CR2: 0098
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-03-23 Thread linux

As an additional data point, here's a libata problem I'm having trying to
rebuild the array.

I have six identical 400 GB drives (ST3400832AS), and one is giving
me hassles.  I've run SMART short and long diagnostics, badblocks, and
Seagate's "seatools" diagnostic software, and none of these find problems.
It is the only one of the six with a non-zero reallocated sector count
(it's 26).

Anyway, the drive is partitioned into a 45G RAID-10 part and a 350G RAID-5
part.  The RAID-10 part integrated successfully, but the RAID-5 got to
about 60% and then puked:

ata5.00: exception Emask 0x0 SAct 0x1ef SErr 0x0 action 0x2 frozen
ata5.00: cmd 61/c0:00:d2:d0:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 98304 out
 res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/40:08:92:d1:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 32768 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:10:d2:d1:b9/01:00:1c:00:00/40 tag 2 cdb 0x0 data 131072 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:18:d2:d2:b9/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:28:d2:d3:b9/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:30:d2:d4:b9/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:38:d2:d5:b9/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:40:d2:d6:b9/01:00:1c:00:00/40 tag 8 cdb 0x0 data 131072 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5: soft resetting port
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB)
sde: Write Protect is off
SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO 
or FUA
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 
 res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5: soft resetting port
ata5: softreset failed (timeout)
ata5: softreset failed, retrying in 5 secs
ata5: hard resetting port
ata5: softreset failed (timeout)
ata5: follow-up softreset failed, retrying in 5 secs
ata5: hard resetting port
ata5: softreset failed (timeout)
ata5: reset failed, giving up
ata5.00: disabled
ata5: EH complete
sd 4:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sde, sector 91795259
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sde3, disabling device. 
Operation continuing on 5 devices
sd 4:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sde, sector 481942994
raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices
sd 4:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sde, sector 481944018
md: md5: recovery done.
RAID10 conf printout:
 --- wd:5 rd:6
 disk 0, wo:0, o:1, dev:sdb3
 disk 1, wo:0, o:1, dev:sdc3
 disk 2, wo:0, o:1, dev:sdd3
 disk 3, wo:1, o:0, dev:sde3
 disk 4, wo:0, o:1, dev:sdf3
 disk 5, wo:0, o:1, dev:sda3
RAID10 conf printout:
 --- wd:5 rd:6
 disk 0, wo:0, o:1, dev:sdb3
 disk 1, wo:0, o:1, dev:sdc3
 disk 2, wo:0, o:1, dev:sdd3
 disk 4, wo:0, o:1, dev:sdf3
 disk 5, wo:0, o:1, dev:sda3
RAID5 conf printout:
 --- rd:6 wd:5
 disk 0, o:1, dev:sda4
 disk 1, o:1, dev:sdb4
 disk 2, o:1, dev:sdc4
 disk 3, o:1, dev:sdd4
 disk 4, o:0, dev:sde4
 disk 5, o:1, dev:sdf4
RAID5 conf printout:
 --- rd:6 wd:5
 disk 0, o:1, dev:sda4
 disk 1, o:1, dev:sdb4
 disk 2, o:1, dev:sdc4
 disk 3, o:1, dev:sdd4
 disk 5, o:1, dev:sdf4

The first error address is just barely inside the RAID-10 part (which ends at
sector 91,795,410), while the second and third errors (at 481,942,994)
look like where the reconstruction was working.


Anyway, what's annoying is that I can't figure out how to bring the
drive back on line without resetting the box.  It's in a hot-swap enclosure,
but power cycling the drive doesn't seem to help.  I thought libata hotplug
was working?  (SiI3132 card, using the sil24 driver.)

(H'm... after rebooting, reallocated sectors jumped from 26 to 39.
Something is up with that drive.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 2.6.21

2007-04-26 Thread linux

Today, 26 april, is the *21*'th anniversary of nuclear 
explosion at Chernobyl's station (ex USSR).

And linux 2.6.*21* is released. Nice!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux

> Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
> application doesn't want to update the timestamps, it should just omit
> this call, since it does nothing else.

Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
and doesn't care about timestamps.  (In fact, sometimes it's configured
to write to a raw device and there are no timestamps.)

It's used as a poor man's portable async I/O.  The application logs
data to disk, and sometimes needs to sync it to disk to ensure it has
all been written.

To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
as soon as a page is filled up to prompt asynchronous writeback.
"I'm done writing this page and don't intend to write it again.
Please start committing it to stable storage, but don't block me."

Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
is synced to disk.  This caused annoying hiccups before the MS_ASYNC
calls were added.


I agree that msync(MS_ASYNC) has no semantics if time is ignored.
But it's a useful way to tell the OS that the page is not going
to be dirtied again.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why is NCQ enabled by default by libata? (2.6.20)

2007-03-27 Thread linux

Here's some more data.

6x ST3400832AS (Seagate 7200.8) 400 GB drives.
3x SiI3232 PCIe SATA controllers
2.2 GHz Athlon 64, 1024k cache (3700+), 2 GB RAM
Linux 2.6.20.4, 64-bit kernel

Tested able to sustain reads at 60 MB/sec/drive simultaneously.

RAID-10 is across 6 drives, first part of drive.
RAID-5 most of the drive, so depending on allocation policies,
may be a bit slower.

The test sequence actually was:
1) raid5ncq
2) raid5noncq
3) raid10noncq
4) raid10ncq
5) raid5ncq
6) raid5noncq
but I rearranged things to make it easier to compare.

Note that NCQ makes writes faster (oh... I have write cacheing turned off;
perhaps I should turn it on and do another round), but no-NCQ seems to have
a read advantage.  [EMAIL PROTECTED]@#ing bonnie++ overflows and won't print 
file
read times; I haven't bothered to fix that yet.

NCQ seems to have a pretty significant effect on the file operations,
especially deletes.

Update: added
7) wcache5noncq - RAID 5 with no NCQ but write cache enabled
8) wcache5ncq - RAID 5 with NCQ and write cache enabled


RAID=5, NCQ
Version  1.03   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
raid5ncq  7952M 31688  53  34760 10 25327   4 57908  86 167680 13 292.2   0
raid5ncq  7952M 30357  50  34154 10 24876   4 59692  89 165663 13 285.6   0
raid5noncq7952M 29015  48  31627  9 24263   4 61154  91 185389 14 286.6   0
raid5noncq7952M 28447  47  31163  9 23306   4 60456  89 198624 15 293.4   0
wcache5ncq7952M 32433  54  35413 10 26139   4 59898  89 168032 13 303.6   0
wcache5noncq  7952M 31768  53  34597 10 25849   4 61049  90 193351 14 304.8   0
raid10ncq 7952M 54043  89 110804 32 48859   9 58809  87 142140 12 363.8   0
raid10noncq   7952M 48912  81  68428 21 38906   7 57824  87 146030 12 358.2   0

--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min/sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
16:10:16/64  1351  25 + +++   941   3  2887  42 31526  96   382   1
16:10:16/64  1400  18 + +++   386   1  4959  69 32118  95   570   2
16:10:16/64   636   8 + +++   176   0  1649  23 + +++   245   1
16:10:16/64   715  12 + +++   164   0   156   2 11023  32  2161   8
16:10:16/64  1291  26 + +++  2778  10  2424  33 31127  93   483   2
16:10:16/64  1236  26 + +++   840   3  2519  37 30366  91   445   2
16:10:16/64  1714  37 + +++  1652   6   789  11  4700  14 12264  48
16:10:16/64   634  11 + +++  1035   3   338   4 + +++  1349   5

raid5ncq,7952M,31688,53,34760,10,25327,4,57908,86,167680,13,292.2,0,16:10:16/64,1351,25,+,+++,941,3,2887,42,31526,96,382,1
raid5ncq,7952M,30357,50,34154,10,24876,4,59692,89,165663,13,285.6,0,16:10:16/64,1400,18,+,+++,386,1,4959,69,32118,95,570,2
raid5noncq,7952M,29015,48,31627,9,24263,4,61154,91,185389,14,286.6,0,16:10:16/64,636,8,+,+++,176,0,1649,23,+,+++,245,1
raid5noncq,7952M,28447,47,31163,9,23306,4,60456,89,198624,15,293.4,0,16:10:16/64,715,12,+,+++,164,0,156,2,11023,32,2161,8
wcache5ncq,7952M,32433,54,35413,10,26139,4,59898,89,168032,13,303.6,0,16:10:16/64,1291,26,+,+++,2778,10,2424,33,31127,93,483,2
wcache5noncq,7952M,31768,53,34597,10,25849,4,61049,90,193351,14,304.8,0,16:10:16/64,1236,26,+,+++,840,3,2519,37,30366,91,445,2
raid10ncq,7952M,54043,89,110804,32,48859,9,58809,87,142140,12,363.8,0,16:10:16/64,1714,37,+,+++,1652,6,789,11,4700,14,12264,48
raid10noncq,7952M,48912,81,68428,21,38906,7,57824,87,146030,12,358.2,0,16:10:16/64,634,11,+,+++,1035,3,338,4,+,+++,1349,5
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why is NCQ enabled by default by libata? (2.6.20)

2007-03-27 Thread linux

>From [EMAIL PROTECTED] Tue Mar 27 16:25:58 2007
Date: Tue, 27 Mar 2007 12:25:52 -0400 (EDT)
From: Justin Piszcz <[EMAIL PROTECTED]>
X-X-Sender: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
cc: [EMAIL PROTECTED], [EMAIL PROTECTED], linux-ide@vger.kernel.org, 
linux-kernel@vger.kernel.org
Subject: Re: Why is NCQ enabled by default by libata? (2.6.20)
In-Reply-To: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

On Tue, 27 Mar 2007, [EMAIL PROTECTED] wrote:

> Here's some more data.
>
> 6x ST3400832AS (Seagate 7200.8) 400 GB drives.
> 3x SiI3232 PCIe SATA controllers
> 2.2 GHz Athlon 64, 1024k cache (3700+), 2 GB RAM
> Linux 2.6.20.4, 64-bit kernel
>
> Tested able to sustain reads at 60 MB/sec/drive simultaneously.
>
> RAID-10 is across 6 drives, first part of drive.
> RAID-5 most of the drive, so depending on allocation policies,
> may be a bit slower.
>
> The test sequence actually was:
> 1) raid5ncq
> 2) raid5noncq
> 3) raid10noncq
> 4) raid10ncq
> 5) raid5ncq
> 6) raid5noncq
> but I rearranged things to make it easier to compare.
>
> Note that NCQ makes writes faster (oh... I have write cacheing turned off;
> perhaps I should turn it on and do another round), but no-NCQ seems to have
> a read advantage.  [EMAIL PROTECTED]@#ing bonnie++ overflows and won't print 
> file
> read times; I haven't bothered to fix that yet.
>
> NCQ seems to have a pretty significant effect on the file operations,
> especially deletes.
>
> Update: added
> 7) wcache5noncq - RAID 5 with no NCQ but write cache enabled
> 8) wcache5ncq - RAID 5 with NCQ and write cache enabled
>
>
> RAID=5, NCQ
> Version  1.03   --Sequential Output-- --Sequential Input- 
> --Random-
>-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec 
> %CP
> raid5ncq  7952M 31688  53  34760 10 25327   4 57908  86 167680 13 292.2   > 0
> raid5ncq  7952M 30357  50  34154 10 24876   4 59692  89 165663 13 285.6   > 0
> raid5noncq7952M 29015  48  31627  9 24263   4 61154  91 185389 14 286.6   > 0
> raid5noncq7952M 28447  47  31163  9 23306   4 60456  89 198624 15 293.4   > 0
> wcache5ncq7952M 32433  54  35413 10 26139   4 59898  89 168032 13 303.6   > 0
> wcache5noncq  7952M 31768  53  34597 10 25849   4 61049  90 193351 14 304.8   > 0
> raid10ncq 7952M 54043  89 110804 32 48859   9 58809  87 142140 12 363.8   > 0
> raid10noncq   7952M 48912  81  68428 21 38906   7 57824  87 146030 12 358.2   > 0
>
>--Sequential Create-- Random Create
>-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files:max:min/sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec 
> %CP
>16:10:16/64  1351  25 + +++   941   3  2887  42 31526  96   382   1
>16:10:16/64  1400  18 + +++   386   1  4959  69 32118  95   570   2
>16:10:16/64   636   8 + +++   176   0  1649  23 + +++   245   1
>16:10:16/64   715  12 + +++   164   0   156   2 11023  32  2161   8
>16:10:16/64  1291  26 + +++  2778  10  2424  33 31127  93   483   2
>16:10:16/64  1236  26 + +++   840   3  2519  37 30366  91   445   2
>16:10:16/64  1714  37 + +++  1652   6   789  11  4700  14 12264  48
>16:10:16/64   634  11 + +++  1035   3   338   4 + +++  1349   5
>
> raid5ncq,7952M,31688,53,34760,10,25327,4,57908,86,167680,13,292.2,0,16:10:16/64,1351,25,+,+++,941,3,2887,42,31526,96,382,1
> raid5ncq,7952M,30357,50,34154,10,24876,4,59692,89,165663,13,285.6,0,16:10:16/64,1400,18,+,+++,386,1,4959,69,32118,95,570,2
> raid5noncq,7952M,29015,48,31627,9,24263,4,61154,91,185389,14,286.6,0,16:10:16/64,636,8,+,+++,176,0,1649,23,+,+++,245,1
> raid5noncq,7952M,28447,47,31163,9,23306,4,60456,89,198624,15,293.4,0,16:10:16/64,715,12,+,+++,164,0,156,2,11023,32,2161,8
> wcache5ncq,7952M,32433,54,35413,10,26139,4,59898,89,168032,13,303.6,0,16:10:16/64,1291,26,+,+++,2778,10,2424,33,31127,93,483,2
> wcache5noncq,7952M,31768,53,34597,10,25849,4,61049,90,193351,14,304.8,0,16:10:16/64,1236,26,+,+++,840,3,2519,37,30366,91,445,2
> raid10ncq,7952M,54043,89,110804,32,48859,9,58809,87,142140,12,363.8,0,16:10:16/64,1714,37,+,+++,1652,6,789,11,4700,14,12264,48
> raid10noncq,7952M,48912,81,68428,21,38906,7,57824,87,146030,12,358.2,0,16:10:16/64,634,11,+,+++,1035,3,338,4,+,+++,1349,5
>

> I would try with write-caching enabled.

I did.  See the "wcache5" lines?

> Also, the RAID5/RAID10 you mention seems lik

Re: Why is NCQ enabled by default by libata? (2.6.20)

2007-03-27 Thread linux

> I meant you do not allocate the entire disk per raidset, which may alter 
> performance numbers.

No, that would be silly.  It does lower the average performance of the
large RAID-5 area, but I don't know how ext3fs is allocating the blocks
anyway, so

> 04:00.0 RAID bus controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II 
> Controller (rev 01)
> I assume you mean 3132 right?

Yes; did I mistype?

02:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid 
II Controller (rev 01)
03:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid 
II Controller (rev 01)
04:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid 
II Controller (rev 01)

> I also have 6 seagates, I'd need to run one 
> of these tests on them as well, also you took the micro jumper off the 
> Seagate 400s in the back as well right?

Um... no, I don't remember doing anything like that.  What micro jumper?
It's been a while, but I just double-checked the drive manual and
it doesn't mention any jumpers.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Add support for ITE887x serial chipsets

2007-03-27 Thread linux

Minor point: the chip part numbers are actually IT887x, not ITE887x.

I STFW for a data sheet, but didn't have immediate luck.  Does anyone know 
where to find
documentation?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux

> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).

Yes, I noticed.  See
http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
for a bug report on the subject February 2006.

That's why this application is still running on 2.4.

As I mentioned at the time, the SUS says:
(http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
"When MS_ASYNC is specified, msync() returns immediately once all the
write operations are initiated or queued for servicing."

You can argue that putting it on the dirty list constitutes "queued for
servicing", but the intent seems pretty clear to me: MS_ASYNC is supposed
to start the I/O.  Although strict standards-ese parsing says that
either branch of an or is acceptable, it is a common English language
convention that the first alternative is preferred and the second
is a fallback.

It makes sense in this case: start the write or, if that's not possible
(the disk is already busy), queue it for service as soon as the disk
is available.

They perhaps didn't mandate it this strictly, but that's clearly the
intent.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux

> Suggest you use msync(MS_ASYNC), then
> sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
back to the old behavior.

> We can fix your application, and we'll break someone else's.

If you can point to an application that it'll break, I'd be a lot more
understanding.  Nobody did, last year.

> I don't think it's solveable, really - the range of applications is so
> broad, and the "standard" is so vague as to be useless.

I agree that standards are sometimes vague, but that one seemed about
as clear as it's possible to be without imposing unreasonably on
the file system and device driver layers.

What part of "The msync() function writes all modified data to
permanent storage locations [...] For mappings to files, the msync()
function ensures that all write operations are completed as defined
for synchronised I/O data integrity completion." suggests that it's not
supposed to do disk I/O?  How is that uselessly vague?

It says to me that msync's raison d'être is to write data from RAM to
stable storage.  If an application calls it too often, that's
the application's fault just as if it called sync(2) too often.

> This is why we've
> been extending these things with linux-specific goodies which permit
> applications to actually tell the kernel what they want to be done in a
> more finely-grained fashion.

Well, I still think the current Linux behavior is a bug, but there's a
usable (and run-time compatible) workaround that doesn't unreasonably
complicate the code, and that's good enough.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux

> Linux _will_ write all modified data to permanent storage locations.
> Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
> do need msync() to enable data to be written back.
> 
> But it will not start I/O immediately, which is not a requirement in
> the standard, or at least it's pretty vague about that.

As I've said before, I disagree, but I'm not going to start a major
flame war about it.

The most relevant paragraph is:

# When MS_ASYNC is specified, msync() returns immediately once all the
# write operations are initiated or queued for servicing; when MS_SYNC is
# specified, msync() will not return until all write operations are
# completed as defined for synchronised I/O data integrity completion.
# Either MS_ASYNC or MS_SYNC is specified, but not both.

Note two things:
1) In the paragraphs before, what msync does is defined independently
   of the MS_* flags.  Only the time of the return to user space varies.
   Thus, whatever the delay between calling msync() and the data being
   written, it should be the same whether MS_SYNC or MS_ASYNC is used.

   The implementation intended is:
   - Start all I/O
   - If MS_SYNC, wait for I/O to complete
   - Return to user space

2) "all the write operations are initiated or queued for servicing".
   It is a common convention in English (and most languages, I expect)
   that in the "or" is a preference for the first alternative.  The second
   is a permitted alternative if the first is not possible.

   And "queued for servicing", especially "initiated or queued for
   servicing", to me imples queuing waiting for some resource.  To have
   the resource being waited for be a timer expiry seems like rather a
   cheat to me.  It's perhaps doesn't break the letter of the standard,
   but definitely bends it.  It feels like a fiddle.

Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
"I don't expect to write this page any more, so now would be a good time
to clean it."
It would just make my life easier if the kernel procrastinated less.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-28 Thread linux

> But if you didn't notice until now, then the current implementation
> must be pretty reasonable for you use as well.

Oh, I definitely noticed.  As soon as I tried to port my application
to 2.6, it broke - as evidenced by my complaints last year.  The
current solution is simple - since it's running on dedicated boxes,
leave them on 2.4.

I've now got the hint on how to make it work on 2.6 (sync_file_range()),
so I can try again.  But the pressure to upgrade is not strong, so it
might be a while.

You may recall, this subthread started when I responding to "the
only reason to use msync(MS_ASYNC) is to update timestamps" with a
counterexample.  I still think the purpose of the call is a hint to the
kernel that writing to the specified page(s) is complete and now would be
a good time to clean them.  Which has very little to do with timestamps.

Now, my application, which leaves less than a second between the MS_ASYNC
and a subsequent MS_SYNC to check whether it's done, broke, but I can
imagine similar cases where MS_ASYNC would remain a useful hint to reduce
the sort of memory hogging generally associated with "dd if=/dev/zero"
type operations.

Reading between the lines of the standard, that seems (to me, at least)
to obviously be the intended purpose of msync(MS_ASYNC).  I wonder if
there's any historical documentation describing the original intent
behind creating the call.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why is NCQ enabled by default by libata? (2.6.20)

2007-03-29 Thread linux

> But when writing, what is the difference between queuing multiple tagged 
> writes, and sending down multiple untagged cached writes that complete 
> immediately and actually hit the disk later?  Either way the host keeps 
> sending writes to the disk until it's buffers are full, and the disk is 
> constantly trying to commit those buffers to the media in the most 
> optimal order.

Well, theoretically it allows more buffering, without hurting read
cacheing.

With NCQ, the drive gets the command, and then tells the host when it
wants the corresponding data.  It can ask for the data in any order
it likes, when it's decided which write will be serviced next.  So it
doesn's have to fill up its RAM with the write data.  This leaves more
RAM free for things like read-ahead.

Another trick, that I know SCSI can do and I expect NCQ can do, is that
the drive cam ask for the data for a single write in different orders.
This is particularly useful for reads, where a drive asked for blocks
100-199 can deliver blocks 150-199 first, then 100-149 when the drive
spins around.

This is, unfortunately, kind of theoretical.  I don't actually know
how hard drive cacheing algorithms work, but I assume it's mostly a
readahead cache.  The host has much more RAM than the drive, so any
block that it's read won't be requested again for a long time.  So the
drive doesn't want to keep that in cache.  But any sectors that the
drive happens to read nearby requested sectors are worth keeping.


I'm not sure it's a big deal, as 32 (tags) x 128K (largest LBA28 write
size) is 4M, only half of a typical drive's cache RAM.  But it's
possible that there's some difference.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-03 Thread linux

 data 
73728 out
14:56:13:  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/70:10:62:30:ba/01:00:1c:00:00/40 tag 2 cdb 0x0 data 
188416 out
14:56:13:  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:18:d2:31:ba/01:00:1c:00:00/40 tag 3 cdb 0x0 data 
131072 out
14:56:13:  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:20:d2:32:ba/01:00:1c:00:00/40 tag 4 cdb 0x0 data 
131072 out
14:56:13:  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:28:d2:33:ba/01:00:1c:00:00/40 tag 5 cdb 0x0 data 
131072 out
14:56:13:  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:30:d2:34:ba/01:00:1c:00:00/40 tag 6 cdb 0x0 data 
131072 out
14:56:13:  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:38:d2:35:ba/01:00:1c:00:00/40 tag 7 cdb 0x0 data 
131072 out
14:56:13:  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5: soft resetting port
14:56:43: ata5: softreset failed (timeout)
14:56:43: ata5: softreset failed, retrying in 5 secs
14:56:48: ata5: hard resetting port
14:57:20: ata5: softreset failed (timeout)
14:57:20: ata5: follow-up softreset failed, retrying in 5 secs
14:57:25: ata5: hard resetting port
14:57:58: ata5: softreset failed (timeout)
14:57:58: ata5: reset failed, giving up
14:57:58: ata5.00: disabled
14:57:58: ata5: EH complete
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004
14:57:58: end_request: I/O error, dev sde, sector 481965522
14:57:58: raid5: Disk failure on sde4, disabling device. Operation continuing 
on 5 devices
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004
14:57:58: end_request: I/O error, dev sde, sector 481965266
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004
14:57:58: end_request: I/O error, dev sde, sector 481965010
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004
14:57:58: end_request: I/O error, dev sde, sector 481964754
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004
14:57:58: end_request: I/O error, dev sde, sector 481964498
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004
14:57:58: end_request: I/O error, dev sde, sector 481964130
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004
14:57:58: end_request: I/O error, dev sde, sector 481963986
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004
14:57:58: end_request: I/O error, dev sde, sector 481941210
14:57:58: md: md5: recovery done.
14:57:58: RAID5 conf printout:
14:57:58:  --- rd:6 wd:5
14:57:58:  disk 0, o:1, dev:sda4
14:57:58:  disk 1, o:1, dev:sdb4
14:57:58:  disk 2, o:1, dev:sdc4
14:57:58:  disk 3, o:1, dev:sdd4
14:57:58:  disk 4, o:0, dev:sde4
14:57:58:  disk 5, o:1, dev:sdf4
14:57:58: RAID5 conf printout:
14:57:58:  --- rd:6 wd:5
14:57:58:  disk 0, o:1, dev:sda4
14:57:58:  disk 1, o:1, dev:sdb4
14:57:58:  disk 2, o:1, dev:sdc4
14:57:58:  disk 3, o:1, dev:sdd4
14:57:58:  disk 5, o:1, dev:sdf4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/5] hwmon: (core) Inherit power properties to hdev

2018-10-24 Thread linux




Quoting Nicolin Chen :


The new hdev is a child device related to the original parent
hwmon driver and its device. However, it doesn't support the
power features, typically being defined in the parent driver.

So this patch inherits three necessary power properties from
the parent dev to hdev: power, pm_domain and driver pointers.

Note that the dev->driver pointer is the place that contains
a dev_pm_ops pointer defined in the parent device driver and
the pm runtime core also checks this pointer:
   if (!cb && dev->driver && dev->driver->pm)

Signed-off-by: Nicolin Chen 
---
Changelog
v2->v3:
 * N/A
v1->v2:
 * Added device pointers

 drivers/hwmon/hwmon.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
index 975c95169884..14cfab64649f 100644
--- a/drivers/hwmon/hwmon.c
+++ b/drivers/hwmon/hwmon.c
@@ -625,7 +625,12 @@ __hwmon_device_register(struct device *dev,  
const char *name, void *drvdata,

hwdev->name = name;
hdev->class = &hwmon_class;
hdev->parent = dev;
-   hdev->of_node = dev ? dev->of_node : NULL;
+   if (dev) {
+   hdev->driver = dev->driver;
+   hdev->power = dev->power;
+   hdev->pm_domain = dev->pm_domain;
+   hdev->of_node = dev->of_node;
+   }


We'l need to dig into this more; I suspect it may be inappropriate to do this.
With this change, every hwmon driver supporting (runtime ?) suspend/resume
will have the problem worked around in #5, and that just seems wrong.

Guenter


hwdev->chip = chip;
dev_set_drvdata(hdev, drvdata);
dev_set_name(hdev, HWMON_ID_FORMAT, id);
--
2.17.1

[PATCH] ftrace: Remove unused list 'ftrace_direct_funcs'

2024-05-04 Thread linux

From: "Dr. David Alan Gilbert" 

Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API")
stopped using 'ftrace_direct_funcs' (and the associated
struct ftrace_direct_func).  Remove them.

Build tested only (on x86-64 with FTRACE and DYNAMIC_FTRACE
enabled)

Signed-off-by: Dr. David Alan Gilbert 
---
 include/linux/ftrace.h | 1 -
 kernel/trace/ftrace.c  | 8 
 2 files changed, 9 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 54d53f345d149..b01cca36147ff 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -83,7 +83,6 @@ static inline void early_trace_init(void) { }
 
 struct module;
 struct ftrace_hash;
-struct ftrace_direct_func;
 
 #if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_MODULES) && \
defined(CONFIG_DYNAMIC_FTRACE)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index da1710499698b..b18b4ece3d7c9 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5318,14 +5318,6 @@ ftrace_set_addr(struct ftrace_ops *ops, unsigned long 
*ips, unsigned int cnt,
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 
-struct ftrace_direct_func {
-   struct list_headnext;
-   unsigned long   addr;
-   int count;
-};
-
-static LIST_HEAD(ftrace_direct_funcs);
-
 static int register_ftrace_function_nolock(struct ftrace_ops *ops);
 
 /*
-- 
2.45.0

[PATCH] virt: acrn: Remove unusted list 'acrn_irqfd_clients'

2024-05-04 Thread linux

From: "Dr. David Alan Gilbert" 

It doesn't look like this was ever used.

Build tested only.

Signed-off-by: Dr. David Alan Gilbert 
---
 drivers/virt/acrn/irqfd.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/virt/acrn/irqfd.c b/drivers/virt/acrn/irqfd.c
index d4ad211dce7a3..346cf0be4aac7 100644
--- a/drivers/virt/acrn/irqfd.c
+++ b/drivers/virt/acrn/irqfd.c
@@ -16,8 +16,6 @@
 
 #include "acrn_drv.h"
 
-static LIST_HEAD(acrn_irqfd_clients);
-
 /**
  * struct hsm_irqfd - Properties of HSM irqfd
  * @vm:Associated VM pointer
-- 
2.45.0

[PATCH] ftrace: Remove unused global 'ftrace_direct_func_count'

2024-05-06 Thread linux

From: "Dr. David Alan Gilbert" 

Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API")
stopped setting the 'ftrace_direct_func_count' variable, but left
it around.  Clean it up.

Signed-off-by: Dr. David Alan Gilbert 
---
 include/linux/ftrace.h |  2 --
 kernel/trace/fgraph.c  | 11 ---
 kernel/trace/ftrace.c  |  1 -
 3 files changed, 14 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index b01cca36147ff..e3a83ebd1b333 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -413,7 +413,6 @@ struct ftrace_func_entry {
 };
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
-extern int ftrace_direct_func_count;
 unsigned long ftrace_find_rec_direct(unsigned long ip);
 int register_ftrace_direct(struct ftrace_ops *ops, unsigned long addr);
 int unregister_ftrace_direct(struct ftrace_ops *ops, unsigned long addr,
@@ -425,7 +424,6 @@ void ftrace_stub_direct_tramp(void);
 
 #else
 struct ftrace_ops;
-# define ftrace_direct_func_count 0
 static inline unsigned long ftrace_find_rec_direct(unsigned long ip)
 {
return 0;
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index c83c005e654e3..a130b2d898f7c 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -125,17 +125,6 @@ int function_graph_enter(unsigned long ret, unsigned long 
func,
 {
struct ftrace_graph_ent trace;
 
-#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
-   /*
-* Skip graph tracing if the return location is served by direct 
trampoline,
-* since call sequence and return addresses are unpredictable anyway.
-* Ex: BPF trampoline may call original function and may skip frame
-* depending on type of BPF programs attached.
-*/
-   if (ftrace_direct_func_count &&
-   ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
-   return -EBUSY;
-#endif
trace.func = func;
trace.depth = ++current->curr_ret_depth;
 
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b18b4ece3d7c9..adf34167c3418 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -2538,7 +2538,6 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec)
 /* Protected by rcu_tasks for reading, and direct_mutex for writing */
 static struct ftrace_hash __rcu *direct_functions = EMPTY_HASH;
 static DEFINE_MUTEX(direct_mutex);
-int ftrace_direct_func_count;
 
 /*
  * Search the direct_functions hash to see if the given instruction pointer
-- 
2.45.0

Re: [RFC] New kernel-message logging API

2007-09-24 Thread linux

> I don't know. Compare the following two lines:
> 
> printk(KERN_INFO "Message.\n");
> kprint_info("Message.");
> 
> By dropping the lengthy macro (it's not like it's going to change
> while we're running anyway, so why not make it a part of the function
> name?) and the final newline, we actually end up with a net decrease
> in line length.

Agreed.  In fact, you may want to write a header that implements the
kprint_ functions in terms of printk for out-of-core driver writers to
incorporate into their code bases, so they can  upgrade their API while
maintaining backward compatibility.  (If it were me, I'd also give it
a very permissive license, like outright public domain, to encourage use.)

> I thought it would be nice to have something that looks familiar,
> since that would ease an eventual transition. klog is a valid
> alternative, but isn't kp a bit cryptic?

Well, in context:
kp_info("Message.");

Even the "kp_" prefix is actually pretty unnecessary.  It's "info"
and a human-readable string that make it recognizable as a log message.

Another reason to keep it short is just that it's going to get typed a LOT.

Anyway, just MHO.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] New kernel-message logging API

2007-09-25 Thread linux

>> Even the "kp_" prefix is actually pretty unnecessary.  It's "info"
>> and a human-readable string that make it recognizable as a log message.

> While I agree a prefix isn't necessary, info, warn, err
> are already frequently #define'd and used.
>
> kp_ isn't currently in use.
> 
> $ egrep -r -l --include=*.[ch] 
> "^[[:space:]]*#[[:space:]]*define[[:space:]]+(info|err|warn)\b" * | wc -l
> 29

Sorry for being unclear.  I wasn't seriously recommending no prefix,
due to name collisions (exactly your point), but rather saying that no
prefix is necessary for human understanding.

Something to avoid the ambiguity is still useful.  I was just saying
that it can be pretty much anything withouyt confusing the casual
reader.

We're in violent agreement, I just didn't say it very well the first
time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] New kernel-message logging API (take 2)

2007-09-27 Thread linux

> Example: {
>   struct kprint_block out;
>   kprint_block_init(&out, KPRINT_DEBUG);
>   kprint_block(&out, "Stack trace:");
>
>   while(unwind_stack()) {
>   kprint_block(&out, "%p %s", address, symbol);
>   }
>   kprint_block_flush(&out);
> }

Assuming that kprint_block_flush() is a combination of
kprint_block_printit() and kprint_block_abort(), you
coulld make a macro wrapper for this to preclude leaks:

#define KPRINT_BLOCK(block, level, code) \
do { \
struct kprint_block block; \
kprint_block_init(&block, KPRINT_##level); \
do { \
code ; \
kprint_block_printit(&block); \
while (0); \
kprint_block_abort(&block); \
} while(0)

The inner do { } while(0) region is so you can abort with "break".

(Or you can split it into KPRINT_BEGIN() and KPRINT_END() macros,
if that works out to be cleaner.)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.23-rc8 network problem. Mem leak? ip1000a?

2007-09-27 Thread linux

Uniprocessor Althlon 64, 64-bit kernel, 2G ECC RAM,
2.6.23-rc8 + linuxpps (5.0.0) + ip1000a driver.
(patch from http://marc.info/?l=linux-netdev&m=118980588419882)

After a few hours of operation, ntp loses the ability to send packets.
sendto() returns -EAGAIN to everything, including the 24-byte UDP packet
that is a response to ntpq.

-EAGAIN on a sendto() makes me think of memory problems, so here's
meminfo at the time:

### FAILED state ###
# cat /proc/meminfo 
MemTotal:  2059384 kB
MemFree: 15332 kB
Buffers:665608 kB
Cached:  18212 kB
SwapCached:  0 kB
Active: 380384 kB
Inactive:   355020 kB
SwapTotal: 5855208 kB
SwapFree:  5854552 kB
Dirty:   28504 kB
Writeback:   0 kB
AnonPages:   51608 kB
Mapped:  11852 kB
Slab:  1285348 kB
SReclaimable:   152968 kB
SUnreclaim:1132380 kB
PageTables:   3888 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:   6884900 kB
Committed_AS:   590528 kB
VmallocTotal: 34359738367 kB
VmallocUsed:265628 kB
VmallocChunk: 34359472059 kB


Killing and restarting ntpd gets it running again for a few hours.
Here's after about two hours of successful operation.  (I'll try to
remember to run slabinfo before killing ntpd next time.)

### WORKING state ###
# cat /proc/meminfo
MemTotal:  2059384 kB
MemFree: 20252 kB
Buffers:242688 kB
Cached:  41556 kB
SwapCached:200 kB
Active: 285012 kB
Inactive:   147348 kB
SwapTotal: 5855208 kB
SwapFree:  5854212 kB
Dirty:  36 kB
Writeback:   0 kB
AnonPages:  148052 kB
Mapped:  12756 kB
Slab:  1582512 kB
SReclaimable:   134348 kB
SUnreclaim:1448164 kB
PageTables:   4500 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:   6884900 kB
Committed_AS:   689956 kB
VmallocTotal: 34359738367 kB
VmallocUsed:265628 kB
VmallocChunk: 34359472059 kB
# /usr/src/linux/Documentation/vm/slabinfo
Name   Objects ObjsizeSpace Slabs/Part/Cpu  O/S O %Fr %Ef 
Flg
:016  1478  1624.5K  6/3/1  256 0  50  96 *
:024   170  24 4.0K  1/0/1  170 0   0  99 *
:032  1339  3245.0K 11/2/1  128 0  18  95 *
:040   102  40 4.0K  1/0/1  102 0   0  99 *
:064  5937  64   413.6K   101/15/1   64 0  14  91 *
:07256  72 4.0K  1/0/1   56 0   0  98 *
:088  6946  88   618.4K151/0/1   46 0   0  98 *
:096 23851  96 2.5M  616/144/1   42 0  23  90 *
:128   730 128   114.6K 28/6/1   32 0  21  81 *
:136   232 13636.8K  9/6/1   30 0  66  85 *
:192   474 19298.3K 24/4/1   21 0  16  92 *
:256   1385376 256   354.6M  86587/0/1   16 0   0  99 *
:32012 304 4.0K  1/0/1   12 0   0  89 *A
:384   359 384   180.2K44/23/1   10 0  52  76 *A
:512   1384316 512   708.7M 173040/1/18 0   0  99 *
:64072 61653.2K 13/5/16 0  38  83 *A
:704  1870 696 1.3M170/0/1   11 1   0  93 *A
:0001024   4271024   454.6K111/9/14 0   8  96 *
:0001472   1501472   245.7K 30/0/15 1   0  89 *
:00020481589912048   325.7M 39759/25/14 1   0  99 *
:0004096514096   245.7K 30/9/12 1  30  85 *
Acpi-State  51  80 4.0K  1/0/1   51 0   0  99 
anon_vma  1032  1628.6K  7/5/1  170 0  71  57 
bdev_cache  43 72036.8K  9/1/15 0  11  83 Aa
blkdev_requests 42 28812.2K  3/0/1   14 0   0  98 
buffer_head  59173 10411.1M2734/1690/1   39 0  61  54 a
cfq_io_context 223 15240.9K 10/6/1   26 0  60  82 
dentry   98641 19219.7M 4813/274/1   21 0   5  96 a
ext3_inode_cache115690 68886.3M 10545/77/1   11 1   0  92 a
file_lock_cache 23 168 4.0K  1/0/1   23 0   0  94 
idr_layer_cache118 52869.6K 17/1/17 0   5  89 
inode_cache   1365 528   798.7K195/0/17 0   0  90 a
kmalloc-131072   1  131072   131.0K  1/0/11 5   0 100 
kmalloc-163848   16384   131.0K  8/0/11 2   0 100 
kmalloc-327681   3276832.7K  1/0/11 3   0 100 
kmalloc-8 1535   812.2K  3/1/1  512 0  33  99 
kmalloc-819

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2243 matches

Mail list logo