Re: [ 00/45] 3.0.88-stable review
Quoting Greg Kroah-Hartman : This is the start of the stable review cycle for the 3.0.88 release. There are 45 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Sun Jul 28 20:54:53 UTC 2013. Anything received after that time might be too late. The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.0.88-rc1.gz and the diffstat can be found below. We have additional build failures. Total builds: 54 Total build errors: 20 Link: http://desktop.roeck-us.net/buildlogs/v3.0.87-45-g367423c.2013-07-27.03:09:16 Previously: Total builds: 54 Total build errors: 17 Additional failures are i386/allmodconfig, i386/allyesconfig and mips/malta. I don't have time to track down the culprit tonight (I am still in Down Under ;). I hope I can do it tomorrow. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/59] 3.4.55-stable review
Quoting Greg Kroah-Hartman : This is the start of the stable review cycle for the 3.4.55 release. There are 59 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Sun Jul 28 20:48:22 UTC 2013. Anything received after that time might be too late. The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.4.55-rc1.gz and the diffstat can be found below. Cross platform build looks good: Total builds: 58 Total: build errors: 8 Details: http://desktop.roeck-us.net/buildlogs/v3.4.54-59-g956f996.2013-07-27.08:29:14 Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/79] 3.10.4-stable review
Quoting Greg Kroah-Hartman : This is the start of the stable review cycle for the 3.10.4 release. There are 79 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Sun Jul 28 20:45:08 UTC 2013. Anything received after that time might be too late. The whole patch series can be found in one patch at: kernel.org/pub/linux/kernel/v3.0/stable-review/patch-3.10.4-rc1.gz and the diffstat can be found below. Cross build is ok: Total builds: 64 Total build errors: 3 Details: http://desktop.roeck-us.net/buildlogs/v3.10/v3.10.3-79-g6d0cdc6.2013-07-27.15:42:05 Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 000/103] 3.10.3-stable review
Quoting Greg Kroah-Hartman : This is the start of the stable review cycle for the 3.10.3 release. There are 103 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Thu Jul 25 22:01:33 UTC 2013. Anything received after that time might be too late. Build results below. Same as with previous release. Guenter Build x86_64:defconfig passed Build x86_64:allyesconfig passed Build x86_64:allmodconfig passed Build x86_64:allnoconfig passed Build x86_64:alldefconfig passed Build i386:defconfig passed Build i386:allyesconfig passed Build i386:allmodconfig passed Build i386:allnoconfig passed Build i386:alldefconfig passed Build mips:defconfig passed Build mips:bcm47xx_defconfig passed Build mips:bcm63xx_defconfig passed Build mips:nlm_xlp_defconfig passed Build mips:ath79_defconfig passed Build mips:ar7_defconfig passed Build mips:fuloong2e_defconfig passed Build mips:e55_defconfig passed Build mips:cavium_octeon_defconfig passed Build mips:powertv_defconfig passed Build mips:malta_defconfig passed Build powerpc:defconfig passed Build powerpc:allyesconfig failed Build powerpc:allmodconfig passed Build powerpc:chroma_defconfig passed Build powerpc:maple_defconfig passed Build powerpc:ppc6xx_defconfig passed Build powerpc:mpc83xx_defconfig passed Build powerpc:mpc85xx_defconfig passed Build powerpc:mpc85xx_smp_defconfig passed Build powerpc:tqm8xx_defconfig passed Build powerpc:85xx/sbc8548_defconfig passed Build powerpc:83xx/mpc834x_mds_defconfig passed Build powerpc:86xx/sbc8641d_defconfig passed Build arm:defconfig passed Build arm:allyesconfig failed Build arm:allmodconfig failed Build arm:exynos4_defconfig passed Build arm:multi_v7_defconfig passed Build arm:kirkwood_defconfig passed Build arm:omap2plus_defconfig passed Build arm:tegra_defconfig passed Build arm:u8500_defconfig passed Build arm:at91sam9rl_defconfig passed Build arm:ap4evb_defconfig passed Build arm:bcm_defconfig passed Build arm:bonito_defconfig passed Build arm:pxa910_defconfig passed Build arm:mvebu_defconfig passed Build m68k:defconfig passed Build m68k:m5272c3_defconfig passed Build m68k:m5307c3_defconfig passed Build m68k:m5249evb_defconfig passed Build m68k:m5407c3_defconfig passed Build m68k:sun3_defconfig passed Build m68k:m5475evb_defconfig passed Build sparc:defconfig passed Build sparc:sparc64_defconfig passed Build xtensa:defconfig passed Build xtensa:iss_defconfig passed Build microblaze:mmu_defconfig passed Build microblaze:nommu_defconfig passe Build blackfin:defconfig passed Build parisc:defconfig passed --- Total builds: 64 Total build errors: 3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Test - Please Disregard
Test. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://vger.kernel.org/lkml/
DMA and my Maxtor drive
I get this when DMA is enabled: Oct 20 15:39:07 cr753963-a kernel: hdb: timeout waiting for DMA Oct 20 15:39:07 cr753963-a kernel: hdb: irq timeout: status=0x6e { DriveReady DeviceFault DataRequest CorrectedError Index } ide0: reset: success Oct 20 15:39:07 cr753963-a kernel: hdb: DMA disabled Oct 20 15:39:07 cr753963-a kernel: ide0: reset: success It only happens when there lots of data is being transferred, or compiled on the drive.. The drive status is this: /dev/hdb: Model=Maxtor 82560A4, FwRev=AA8Z2726, SerialNo=C40LTQGA Config={ Fixed } RawCHS=4962/16/63, TrkSize=0, SectSize=0, ECCbytes=20 BuffType=DualPortCache, BuffSize=256kB, MaxMultSect=16, MultSect=off CurCHS=4962/16/63, CurSects=5001696, LBA=yes, LBAsects=5001728 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 *mdma2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
2.4.0pre9 and an analog joystick
I just switched from 2.2.17pre9 to 2.4.0pre9, and my joystick won't work anymore. It's an analog joystick connected to an AudioPCI sound card. I can get it initialized, but I can not access it, it seems it does not map it to js0 Oct 24 23:15:21 cr753963-a kernel: gameport0: NS558 ISA at 0x200 size 8 speed 917 kHz Oct 24 23:15:31 cr753963-a kernel: input0: Analog 2-axis 4-button joystick at gameport0.0 [TSC timer, 463 MHz clock, 1193 ns res] and I can't get any further :[ Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.0pre9 and an analog joystick
On Tue, 24 Oct 2000, Brian Gerst wrote: > [EMAIL PROTECTED] wrote: > > > > I just switched from 2.2.17pre9 to 2.4.0pre9, and my joystick won't work > > anymore. It's an analog joystick connected to an AudioPCI sound card. I > > can get it initialized, but I can not access it, it seems it does not map > > it to js0 > > > > Oct 24 23:15:21 cr753963-a kernel: gameport0: NS558 ISA at 0x200 size 8 > > speed 917 kHz > > Oct 24 23:15:31 cr753963-a kernel: input0: Analog 2-axis 4-button joystick > > at gameport0.0 [TSC timer, 463 MHz clock, 1193 ns res] > > > > and I can't get any further :[ > > > > Dave > > insmod joydev Ok, I can get it to work with modules, but it will not work if it's directly compiled into the kernel, is this a known bug? Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Networking problems with 2.4.0 and 2.4.1
Since using kernel 2.4.0 and 2.4.1 I have been having very weird problems with my network. Suddenly the network connection drops and dies until I take down the interface, and then successfully ping a machine. This is the only thing that I can get out of syslog that is relevant: Jan 31 14:17:29 cr753963-a kernel: eth1: 21143 10baseT link beat good. Jan 31 14:17:50 cr753963-a kernel: NETDEV WATCHDOG: eth1: transmit timed out Jan 31 14:17:50 cr753963-a kernel: eth1: 21041 transmit timed out, status fc6908c5, CSR12 01c8, CSR13 ef05, CSR14 ff3f, resetting... Jan 31 14:17:50 cr753963-a kernel: eth1: 21143 100baseTx sensed media. The only problem is, is that eth1 is a 10mbit card. This also happens when I remove eth1, and only have eth0 in the computer. I put eth1 to see if it would fix the problem. Relevant info: Jan 30 21:26:37 cr753963-a kernel: eth1: Digital DC21041 Tulip rev 33 at 0xe400, 21041 mode, 00:E0:29:11:0F:3A, IRQ 10. Jan 30 21:26:37 cr753963-a kernel: eth1: 21041 Media table, default media (10baseT). Jan 30 21:26:37 cr753963-a kernel: eth1: 21041 media #0, 10baseT. Jan 30 21:26:37 cr753963-a kernel: eth1: 21041 media #4, 10baseT-FD. Jan 30 21:26:37 cr753963-a kernel: ne.c: ISAPnP reports Generic PNP at i/o 0x220, irq 5. Jan 30 21:26:37 cr753963-a kernel: ne.c:v1.10 9/23/94 Donald Becker ([EMAIL PROTECTED]) Jan 30 21:26:37 cr753963-a kernel: Last modified Nov 1, 2000 by Paul Gortmaker Jan 30 21:26:37 cr753963-a kernel: NE*000 ethercard probe at 0x220: 00 40 f6 24 34 08 Jan 30 21:26:37 cr753963-a kernel: eth0: NE2000 found at 0x220, using IRQ 5. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
entropy in the input samples from the "holdover" material, the problem would go away, but that's an entropy measurement problem! Until this cloud is dissipated by further analysis, it's not possible to say "this is shiny and new and better; they's use it!" in good conscience. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
> Waiting for 256bits of entropy before outputting data is a good goal. > Problem becomes how do you measure entropy in a reliable way? This had > me lynched last time I asked it so I'll stop right there. It's a problem. Also, with the current increase in wireless keyboards and mice, that source should possibly not be considered secure any more. On the other hand, clock jitter in GHz clocks is a *very* rich source of seed material. One rdtsc per timer tick, even accounted for at 0.1 bit per rdtsc (I think reality is more like 3-5 bits) would keep you in clover. > I'll not make any claim that random-fortuna.c should be mv'd to random.c, the > patch given allows people to kernel config it in under the Cryptographic > Options menu. Perhaps a disclaimer in the help menu would be in order to > inform users that Fortuna has profound consequences for those expecting > Info-theoretic /dev/random? I don't mean to disparage you efforts, but then what's the upside to the resultant code maintenance hassles? Other than hack value, what's the advantage of even offering the choice? An option like that is justified when the two options have value for different people and it's not possible to build a single merged solutions that satisfies both markets. Also, Ted very deliberately made /dev/random non-optional so it could be relied on without a lot of annoying run-time testing. Would a separate /dev/fortuna be better? > The case where an attacker has some small amount of unknown in the pool is a > problem that effects BOTH random-fortuna.c and random.c (and any other > replacement for that matter). Just an FYI. Yes, but entropy estimation is supposed to deal with that. If the attacker is never allowed enough information out of the pool to distinguish various input states, the output is secure. > As for the "shifting property" problem of an attacker controlling some input > to the pooling system, I've tried to get around this: (Code that varies the pool things get added to based on r->key[i++ & 7]) > The add_entropy_words() function uses the first 8 bytes of the central > pool to aggravate the predictability of where entropy goes. It's still a > linear progression untill the central pool is re-keyed, then you don't know > where it went. The central pool is reseeded every 0.1ms. You need to think more carefully. You're just elaborating it without adding security. Yes, the attacker *does* know! The entire underlying assumption is that the attacker knows the entire initial state of the pool, owing to some compromise or another. The assumption is that the only thing the attacker doesn't know is the exact value of the incoming seed material. (But the attacker does have an excellent model of its distribution.) Given that, the attacker knows the initial value of the key[] array, and thus knows the order in which the pools are fed. The question then arises, come next reseed, can the attacker (with a small mountain of computers, able to brute-force 40-bit problems in the blink of an eye) infer the *complete* state of the pool from the output. The confusion is over the word "random". In programming jargon, the word is most often used to mean "arbitrary" or "the choice doesn't matter". But that doesn't capture the idea of "unpredictable to a skilled and determined opponent" that is needed in /dev/random. So while the contents of key[] may be "random-looking", they're not *unpredictable*, any more than the digits of pi are. The attacker just has to, after each reseeding, brute-force the seed bits fed to the (predictable) pools chosen to mix in, and then use that information to infer the seeds added to the not-selected pools. If the attacker's uncertainty about the state of some of the subpools increases to the catastrophic reseeding level, then the Fortuna design goal is achieved. If the seed samples are independent, then it's easy to see that the schedule works. But if the seed samples are correlated, it gets a lot trickier. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
> What if someone accesses the seed file directly before the machine > boots? Well, either (a) they have broken root already, or (b) have > direct access to the disk. In either case, the attacker with such > powers can just has easily trojan an executable or a kernel, and > you've got far worse problems to worry about that the piddling worries > of (cue Gilbert and Sullivan overly dramatic music, ) the > dreaded state-extension attack. So the only remaining case is where an attacker can read the random seed file before boot but can't install any trojans. Which seems like an awfully small case. (Although in some scenarios, passive attacks are much easier to mount than active ones.) > Actually, if you check the current random.c code, you'll see that it > has catastrophic reseeding in its design already. Yes, I know. Fortuna's claim to fame is that it tries to achieve that without explicitly measuring entropy. > My big concern with Fortuna is that it really is the result of > cryptographic masturbation. It fundamentally assumes that crypto > primitives are secure (when the recent break of MD4, MD5, and now SHA1 > should have been a warning that this is a Bad Idea (tm)), and makes > assumptions that have no real world significance (assume the attacker > has enough privileges that he might as well be superuser and can > trojan your system to a fare-thee-well now, if you can't recover > from a state extension attack, your random number generator is fatally > flawed.) I'm not a big fan of Fortuna either, but the issues are separate. I agree that trusting crypto prmitives that you don't have to is a bad idea. If my application depends on SHA1 being secure, then I might as well go ahead and use SHA1 in my PRNG. But a kernel service doesn't know what applications are relying on. (Speaking of which, perhaps it's time, in light of the breaking of MD5, to revisit the cut-down MD4 routine used in the TCP ISN selection? I haven't read the MD5 & SHA1 papers in enough detail to understand the flaw, but perhaps some defenses could be erected?) But still, all else being equal, an RNG resistant to a state extension attack *is* preferable to one that's not. And the catastrophic reseeding support in /dev/random provides exactly that feature. What Fortuna tries to do is sidestep the hard problem of entropy measurement. And that's very admirable. It's a very hard thing to do in general, and the current technique of heuristics plus a lot of derating is merely adequate. If a technique could be developed that didn't need an accurate entropy measurement, then things would be much better. > In addition, Frotuna is profligate with entropy, and wastes it in > order to be able make certain claims of being able to recover from a > state-extension attack. Hopefully everyone agrees that entropy > collected from the hardware is precious (especially if you don't have > special-purpose a hardware RNG), and shouldn't be wasted. Wasting > collected entropy for no benefit, only to protect against a largely > theoretical attack --- where if a bad guy has enough privileges to > compromise your RNG state, there are far easier ways to compromise > your entire system, not just the RNG --- is Just Stupid(tm). Just to be clear, I don't remember it ever throwing entropy away, but it hoards some for years, thereby making it effectively unavailable. Any catastrophic reseeding solution has to hold back entropy for some time. And I think that, even in the absence of special-purpose RNG hardware, synchronization jitter on modern GHz+ CPUs is a fruitful source of entropy. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
r, and can be changed without affecting the subpool structure that is Fortuna's real novel contribution. That was just what Niels and Bruce came up with to make the whole thing concrete. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
> And the argument that "random.c doesn't rely on the strength of crypto > primitives" is kinda lame, though I see where you're coming from. > random.c's entropy mixing and output depends on the (endian incorrect) > SHA-1 implementation hard coded in that file to be pre-image resistant. > If that fails (and a few other things) then it's broken. /dev/urandom depends on the strength of the crypto primitives. /dev/random does not. All it needs is a good uniform hash. Do a bit of reading on the subject of "unicity distance". (And as for the endianness of the SHA-1, are you trying to imply something? Because it makes zero difference, and reduces the code size and execution time. Which is obviously a Good Thing.) As for hacking Fortuna in, could you give a clear statement of what you're trying to achieve? Do you like: - The neat name, - The strong ciphers used in the pools, or - The multi-pool reseeding strategy, or - Something else? If you're doing it just for hack value, or to learn how to write a device driver or whatever, then fine. But if you're proposing it as a mainline patch, then could we discuss the technical goals? I don't think anyone wants to draw and quarter *you*, but your code is going to get some extremely critical examination. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
down to very low rates, that it sequesters some entropy for literally years. Ted thinks that's inexcusable, and I can't really disagree. This can be fixed to a significant degree by tweaking the number of subpools. 3) Fortuna's design doesn't actually *work*. The authors' analysis only works in the case that the entropy seeds are independent, but forgot to state the assumption. Some people reviewing the design don't notice the omission. It's that assumption which lets to "divide up" the seed material among various sub-pools. Without it, seed information leaks from the sequestered sub-pools to the more exposed ones, decreasing the "value" of the sequestered pools. I've shown a contrived pathological example, but I haven't managed to figure out how to characterize the leakage in a more general way. But let me give a realistic example. Again, suppose we have an entropy source that delivers one fresh random bit each time it is sampled. But suppose that rather than delivering a bare bit, it delivers the running sum of the bits. So adjacent samples are either the same or differ by +1. This seems to me an extremely plausible example. Consider a Fortuna-like thing with two pools. The first pool is seeded with n, then the second with n+b0, then the first again with n+b0+b1. n is the arbitrary starting count, while b0 and b1 are independent random bits. Assuming that an attacker can see the first pool, they can find n. After the second step, their uncertainty about the second pool is 1 bit, the value of b0. But the third step is interesting. The attacker can see the value of b0+b1. If the sum is 0 or 2, the value of b0 is determined uniquely. Only in the case that b0+b1 = 1 is there uncertainty. So we have only *half* a bit of uncertainty (one bit, half of the time) in the second pool. Where did the missing entropy go? Well, remember the Shannon formula for entropy, H(p_1,...,p_n) = - Sum(p_i * log(p_i)). If the log is to the base 2, the result is in bits. Well, p_0 = 1/4, p_1 = 1/2, and p_2 = 1/4. The logs of those are -2, -1, and -2, respectively. So the sum works out to 2 * 1/4 + 1 * 1/2 + 2 * 1/4 = 1.5. Half a bit of entropy has leaked from the second pool back into the first! I probably just don't have enough mathematical background, but I don't currently know how to bound this leakage. In pathological cases, *all* of the entropy leaks into the lowest-order pool, at which point the whole elaborate structure of Fortuna is completely useless. *That* is my big problem with Fortuna. If someone can finish the analysis and actually bound the leakage, then we can construct something that works. But I've pushed the idea around for a while and not figured it out. > I'll take my patch and not bother you anymore. I'm sure I've taken a > lot of your time as it is. And you've spent a lot of time preparing that patch. It's not a bad idea to revisit the ideas occasionally, but let's talk about the real *meat* of the issue. If you think my analysis of Fortuna's issues above is flawed, please say so! If you disagree about the importance of the issues, that's worth discussing too, although I can't promise that such a difference of opinions will ever be totally resolved. But arguing about the relative importance of good and bad points is meaningful. Ideally, we manage to come up with a solution that has all the good points. The only thing that's frustrating is discussing it with someone who doesn't even seem to *see* the issues. > Not to sound like a "I'm taking my ball and going home" - just explaining > that I like the Fortuna design, I think it's elegant, I want it for my > systems. GPL requires I submit changes back, so I did with the unpleasant > side-dish of my opinion on random.c. Actually, the GNU GPL doesn't. It only requires that you give out the source if and when you give out the binary. You can make as many private changes as you like. (Search debian-legal for "desert island test".) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
>> First, a reminder that the design goal of /dev/random proper is >> information-theoretic security. That is, it should be secure against >> an attacker with infinite computational power. > I am skeptical. > I have never seen any convincing evidence for this claim, > and I suspect that there are cases in which /dev/random fails > to achieve this standard. I'm not sure which claim you're skeptical of. The claim that it's a design goal, or the claim that it achieves it? I'm pretty sure that's been the *goal* since the beginning, and it says so in the comments: * Even if it is possible to * analyze SHA in some clever way, as long as the amount of data * returned from the generator is less than the inherent entropy in * the pool, the output data is totally unpredictable. That's basically the information-theoretic definition, or at least alluding to it. "We're never going to give an attacker the unicity distance needed to *start* breaking the crypto." The whole division into two pools was because the original single-pool design allowed (information-theoretically) deriving previous /dev/random output from subsequent /dev/urandom output. That's discussed in section 5.3 of the paper you cited, and has been fixed. There's probably more discussion of the subject in linux-kernel around the time that change went in. Whether the goal is *achieved* is a different issue. random.c tries pretty hard, but makes some concessions to practicality, relying on computational security as a backup. (But suggestions as to how to get closer to the goal are still very much appreciated!) In particular, it is theoretically possible for an attacker to exploit knowledge of the state of the pool and the input mixing transform to feed in data that permutes the pool state to cluster in SHA1 collisions (thereby reducing output entropy), or to use the SHA1 feedback to induce state collisions (therby reducing pool entropy). But that seems to bring whole new meaning to the word "computationally infeasible", requiring first preimage solutions over probability distributions. Also, the entropy estimation may be flawed, and is pretty crude, just heavily derated for safety. And given recent developments in keyboard skiffing, and wireless keyboard deployment, I'm starting to think that the idea (taken from PGP) of using the keyboard and mouse as an entropy source is one whose time is past. Given current processor clock rates and the widespread availability of high-resolution timers, interrupt synchronization jitter seems like a much more fruitful source. I think there are many bits of entropy in the lsbits of the RDTSC time of interrupts, even from the periodic timer interrupt! Even derating that to 0.1 bit per sample, that's still a veritable flood of seed material. /dev/random has an even more important design goal of being universally available; it should never cost enough to make disabling it attractive. If this conflicts with information-theoretic security, the latter will be compromised. But if a practical information-theoretic /dev/random is (say) just too bulky for embedded systems, perhaps making a scaled-back version available for such hosts (as a config option) could satisfy both goals. Ted, you wrote the thing in the first place; is my summary of the goals correct? Would you like comment patches to clarify any of this? Thank you for pointing out the paper; Appendix A is particularly interesting. And the [BST03] reference looks *really* nice! I haven't finished it yet, but based on what I've read so far, I'd like to *strongly* recommnd that any would-be /dev/random hackers read it carefully. It can be found at http://www.wisdom.weizmann.ac.il/~tromer/papers/rng.pdf Happily, it *appears* to confirm the value of the LFSR-based input mixing function. Although the suggested construction in section 4.1 is different, and I haven't seen if the proof can be extended. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
>> /dev/urandom depends on the strength of the crypto primitives. >> /dev/random does not. All it needs is a good uniform hash. > > That's not at all clear. I'll go farther: I think it is unlikely > to be true. > > If you want to think about cryptographic primitives being arbitrarily > broken, I think there will be scenarios where /dev/random is insecure. > > As for what you mean by "good uniform hash", I think you'll need to > be a bit more precise. Well, you just pointed me to a very nice paper that *makes* it precise: Boaz Barak, Ronen Shaltiel, and Eran Tromer. True random number generators secure in a changing environment. In Workshop on Cryptographic Hardware and Embedded Systems (CHES), pages 166-180, 2003. LNCS no. 2779. I haven't worked through all the proofs yet, but it looks to be highly applicable. >> Do a bit of reading on the subject of "unicity distance". > > Yes, I've read Shannon's original paper on the subject, as well > as many other treatments. I hope it's obvious that I didn't mean to patronize *you* with such a suggestion! Clearly, you're intimately familiar with the concept, and any discussion can go straight on to more detailed issues. I just hope you'll grant me that understanding the concept is pretty fundamental to any meaningful discussion of information-theoretic security. > I stand by my comments above. Cool! So there's a problem to be solved! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
> Correct me if I'm wrong here, but uniformity of the linear function isn't > sufficent even if we implemented like this (right now it's more a+X than > a X). > > The part which suggests choosing an irreducible poly and a value "a" in the > preprocessing stage ... last I checked the value for a and the poly need to > be secret. How do you generate poly and a, Catch-22? Perhaps I'm missing > something and someone can point it out. No, the value (the parameter pi) are specifically described as "the public parameter". See the "Preprocessing" paragraph at the end of section 1.2 on page 3. "This string is then hardwired into the implementation and need not be kept secret." All that's required is that the adversary can't tailor his limited control over the input based on knowing pi. There's a simple proof in all the papers that if an adversary knows *everything* about the randomness extraction function, and has total control over the input distribution, you're screwed. Basically, suppose you have a 1024-bit input block, the attacker is required to choose a distribution with at least 1023 bits of entropy, and you only want 1 bit out. Should be easy, right? Well, with any *fixed* function, the possible inputs are divided into those that hash to 0, and those that hash to 1. One of those sets must have at least 2^1023 members. Suppose it's 0. The attacker can choose the input distribution to be "uniformly at random from the >= 2^1023 inputs that hash to 0" and keep the promise while totally breaking your extraction function. But this paper says that if the attacker has to choose 2^t possible input distributions (based on t bits of control over the input) *before* the random parameter pi is chosen, then they're locked out. *After* learning pi, they can choose *which* of the 2^t input distributions to use. The thing is, you need a parameterized family of hash functions. They choose a random multiplier mod GF(2^n). Their construction is based on the well-known 2-universal family of hash functions hash(x) = (a*x+b) mod p. The /dev/random input mix is based on choosing a "random" polynomial (since there was a lot of efficiency pressure, it isn't actually very random; the question is, is it non-random enough to help an attacker). Remiander modulo a uniformly chosen random irreducible polynomial is a well-known ("division hash") family of universal hash functions, but it's a little bit weaker than the above, and I have to figure out of the proof extends. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
re entropy of the event?" becomes a > crucial concern here. What if, by your leading example, there is 1/2 bit > of entropy in each event? Will the estimator even account for 1/2 bits? > Or will it see each event as 3 bits of entropy? How much of a margin > of error can we tolerate? H'm... the old code *used* to handle fractional bits, but the new code seems to round down to the nearest bit. May have to get fixed to handle low-rate inputs. As for margin of error, any persistent entropy overestimate is Bad. a 6-fold overestimate is disastrous. What we can do is refuse to drain the main pool below, say, 128 bits of entropy. Then we're safe against any *occasional* overestimates as long as they don't add up to 128 bits. > /dev/random will output once it has at least 160 bits of entropy > (iirc), 1/2 bit turning into 3 bits would mean that 160bits of output > it effectively only 27 bits worth of true entropy (again, assuming the > catastrophic reseeder and output function don't waste entropy). > > It's a lot of "ifs" for my taste. /dev/random will output once it has as many bits of entropy as you're asking for. If you do a 20-byte read, it'll output once it has 160 bits. If you do a 1-byte read, it'll output once it has 8 bits. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel guide to space
my rule is that a comment or broken expression requires braces: if (foo) { /* We need to barify it, or else pagecache gets FUBAR'ed */ bar(); } if (foo) { bar(p->foo[hash(garply) % LARGEPRIME]->head, flags & ~(FLAG_FOO | FLAG_BAR | FLAG_BAZ | FLAG_QUUX)); } > Thus we may be better to slighty encourage use of {}s even if they are > not needed: > > if(foo) { > bar(); > } It's not horrible to include them, but it reduces clutter sometimes to leave them out. >> if (foobar(.) + barbar * foobar(bar + >> foo * >> oof)) { >> } > > Ugh, that's as ugly as it can get... Something like below is much > easier to read... > > if (foobar(.) + > barbar * foobar(bar + foo * oof)) { > } Strongly agreed! If you have to break an expression, do it at the lowest precedence point possible! > Even easier is > if (foobar(.) > + barbar * foobar(bar + foo * oof)) { > } > > Since a statement cannot start with binary operators > and as such we are SURE that there must have been something before. I don't tend to do this, but I see the merit. However, C uses a number of operators (+ - * &) in both unary and binary forms, so it's not always unambiguous. In such cases, I'll usually move the brace onto its own line to make the end of the condition clearer: if (foobar(.) + barbar * foobar(bar + foo * oof)) { } Of course, better yet is to use a temporary or something to shrink the condition down to a sane size, but sometimes you just need if (messy_condition_one && messy_condition_two && messy_condition_three) { } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: a 15 GB file on tmpfs
> I have a 15 GB file which I want to place in memory via tmpfs. I want to do > this because I need to have this data accessible with a very low seek time. It should work fine. tmpfs has the same limits as any other file system, 2 TB or more, and more than that with CONFIG_LBD. NOTE, however, that tmpfs does NOT guarantee the data will be in RAM! It uses the page cache just like any other file system, and pages out unused data just like any other file system. If you just want average-case fast, it'll work fine. If you want guaranteed fast, you'll have to work harder. > I want to know if this is possible before spending 10,000 euros on a machine > that has 16 GB of memory. So create a 15 GB file on an existing machine. Make it sparse, so you don't need so much RAM, but test to verify that the kernel doesn't wrap at 4 GB, and can keep the data at offsets 0, 4 GB, 8 GB, and 12 GB separate. Works for me (test code below). > The machine we plan to buy is a HP Proliant Xeon machine and I want to run a > 32 bit linux kernel on it (the xeon we want doesn't have the 64-bit stuff > yet) If you're working with > 4GB data sets, I would recommend you think VERY hard before deciding not to get a 64-bit machine. If you could just put all 15 GB into your application's address space: - The application would be much simpler and faster. - The kernel wouldn't be slowed by HIGHMEM workarounds. It's not that bad, but it's definitely noticeable. - Your expensive new machine won't be obsolete quite as fast. I'd also like to mention that AMD's large L2 TLB is enormously helpful when working with large data sets. It's not discussed much on the web sites that benchmark with games, but it really helps crunch a lot of data. #define _GNU_SOURCE #define _FILE_OFFSET_BITS 64 #include #include #include #include #include #define STRIDE (1<<20) int main(int argc, char **argv) { int fd; off_t off; if (argc != 2) { fprintf(stderr, "Wrong number of arguments: %u\n", argc); return 1; } fd = open(argv[1], O_RDWR|O_CREAT|O_LARGEFILE, 0666); if (fd < 0) { perror(argv[1]); return 1; } for (off = 0; off < 0x4LL; off += STRIDE) { char buf[40]; off_t res; ssize_t ss1, ss2;; ss1 = sprintf(buf, "%llu", off); res = lseek(fd, off, SEEK_SET); if (res == (off_t)-1) { perror("lseek"); return 1; } ss2 = write(fd, buf, ++ss1); if (ss2 != ss1) { perror("write"); return 1; } } for (off = 0; off < 0x4LL; off += STRIDE) { char buf[40], buf2[40]; off_t res; ssize_t ss1, ss2;; ss1 = sprintf(buf, "%lld", off); res = lseek(fd, off, SEEK_SET); if (res == (off_t)-1) { perror("lseek"); return 1; } ss2 = read(fd, buf2, ++ss1); if (ss2 != ss1 || memcmp(buf, buf2, ss1) != 0) { fprintf(stderr, "Mismatch at %llu: %.*s vs. %s\n", off, (int)ss2, buf2, buf); return 1; } } printf("All tests succeeded.\n"); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CCITT-CRC16 in kernel
> Does anybody know what the CRC of a known string is supposed > to be? I have documentation that states that the CCITT CRC-16 > of "123456789" is supposed to be 0xe5cc and "A" is supposed > to be 0x9479. The kernel one doesn't do this. In fact, I > haven't found anything on the net that returns the "correct" > value regardless of how it's initialized or how it's mucked > with after the CRC (well I could just set the CRC to 0 and > add the correct number). Anyway, how do I use the crc_citt > in the kernel? I've grepped through some drivers that use > it and they all seem to check the result against some > magic rather than performing the CRC of data, but not the > CRC, then comparing it to the CRC. One should not have > to use magic to verify a CRC, one should just perform > a CRC on the data, but not the CRC, then compare the result > with the CRC. Am I missing something here? There are two common 16-bit CRC polynomials. The original IBM CRC-16 is x^16 + x^15 + x^2 + 1. The more popular CRC-CCITT is x^16 + x^12 + x^5 + 1. Both of thse include (x+1) as a factor, so provide parity detection, detecting all odd-bit errors, at the expense of reducing the largest detectable 2-bit error from 65535 bits to 32767. All CRC algorithms work on bit strings, so an endianness convention for bits within a byte is always required. Unless specified, the little-endian RS-232 serial transmission order is generally assumed. That isl the least significant bit of the first byte is "first". This bit string is equated to a polynomial where the first bit is the coefficient of the highest power of x, and the last bit (msbit of the last byte) is the coefficient of x^0. (Some people think of this as big-endian, and get all confused.) Using this bit-ordering, and omitting the x^16 term as is conventional (it's implicit in the implementation), the polynomials come out as: CRC-16: 0xa001 CRC-CCITT: 0x8408 The mathematically "cleanest" CRC has the unfortunate property that leading or trailing zero bits can be added or removed without affecting the CRC computation. That is, they are not detected as errors. For fixed-size messages, this does not matter, but for variable-sized messages, a way to detect inserted or deleted padding is desirable. To detect leading padding, it is customary to invert the first 16 bits of the message. This is equivalent to initializing the CRC accumulator to all-ones rather than 0, and is invariably implemented that way. This change is duplicated on CRC verification, and has no effect on the final result. To detect trailing padding, it is customary to invert all 16 bits of the CRC before appending it to the message. This has an effect on CRC verification. One way to CRC-check a message is to compute the CRC of the entire message *including* the CRC. You can see this in many link-layer protocol specifications which place the trailing frame delimiter after the CRC, because the decoding hardware doesn't need to know in advance where the message stops and the CRC starts. If the CRC is NOT inverted, the CRC of a correct message should be zero. If the CRC is inverted, the correct CRC is a non-zero constant. You can still use the same "checksum everything, including the original CRC" technique, but you have to compare with a non-zero result value. For CRC-16, the final result is x^15 + x^3 + x^2 + 1 (0xb001). For CRC-CCITT, the final result is x^13+x^11+x^10+x^8+x^x^3+x^2+x+1 (0xf0b8). The *other* think you have to do is append the checksum to the message correctly. As mentioned earlier, the lsbit of a byte is considered first, so the lsbyte of the 16-bit accumulator is appended first. Anyway, with all this, and using preset-to-all-ones: CRC-CCITT of "A" is 0x5c0a, or f5 a3 when inverted and converted to bytes. CRC-CCITT of "123456789" is 0x6f91, or 63 90. (When preset to zero, the values are 0x538d and 0x2189, respectively. That would be 8d 53 or 89 21 if *not* inverted.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CCITT-CRC16 in kernel
>> Using this bit-ordering, and omitting the x^16 term as is >> conventional (it's implicit in the implementation), the polynomials >> come out as: >> CRC-16: 0xa001 >> CRC-CCITT: 0x8408 > > Huh? That's the problem. > > X^16 + X^12 + X^5 + X^0 = 0x1021, not 0xa001 > > Also, > > X^16 + X^15 + X^2 + X^0 = 0x8005, not 0x8408 You're wrong in two ways: 1) You've got CRC-16 and CRC-CCITT mixed up, and 2) You've got the bit ordering backwards. Remember, I said very clearly, the lsbit is the first bit, and the first bit is the highest power of x. You can reverse the convention and still have a CRC, but that's not the way it's usually done and it's more awkward in software. CRC-CCITT = X^16 + X^12 + X^5 + X^0 = 0x8408, and NOT 0x1021 CRC-16 = X^16 + X^15 + X^2 + X^0 = 0xa001, and NOT 0x8005 > Attached is a program that will generate a table of polynomials > for the conventional CRC lookup-table code. If you look at > the table in the kernel code, offset 1, you will see that > the polynomial is 0x1189. This corresponds to the CRC of > the value 1. It does not correspond to either your polynomials > or the ones documented on numerous web pages. No, it doesn't. The table entry at offset *128* is the CRC polynomial, which is 0x8408, exactly as the comment just above the table says. > I think somebody just guessed and came up with "magic" because the > table being used isn't correct. The table being used is 100% correct. There is no mistake. If you think you've found a mistake, there's something you're not understanding. Sorry to be so blunt, but it's true. >> The *other* think you have to do is append the checksum to the message >> correctly. As mentioned earlier, the lsbit of a byte is considered >> first, so the lsbyte of the 16-bit accumulator is appended first. > Right, but the hardware did that. I have no control over that. I > have to figure out if: > > (1) It started with 0x or something else. > (2) It was inverted after. > (3) The result was byte-swapped. > > With the "usual" CRC-16 that I used before, using the lookup- > table that is for the 0x1021 polynomial, hardware was found > to have inverted and byte-swapped, but started with 0xefde > (0x1021 inverted). Trying to use the in-kernel CRC, I was > unable to find anything that made sense. You can get rid of the starting value and inversion by XORing together two messages (with valid CRCs) of equal length. The result has a valid CRC with preset to 0 and no inversion. You can figure that out later. Then, the only questions are the polynomial and bit ordering. (You can also have a screwed-up CRC byte ordering, but that's rare except in software written by people who don't know better. Hardware invariably gets it right.) As I said, the commonest case is to consider the lsbit first. However, some implementations take the msbit of each byte first. Here's code to do it both ways. This is the bit-at-a-time version, not using a table. You can verify that the first implementation, fed an initial crc=0, poly=0x8408, and all possible 1-byte messages, produces the table in crc-ccitt.c. /* * Expects poly encoded so 0x8000 is x^0 and 0x0001 is x^15. * CRC should be appended lsbyte first. */ uint16_t crc_lsb_first(uint16_t crc, uint16_t poly, unsigned char const *p, size_t len) { while (len--) { unsigned i; crc ^= (unsigned char)*p++; for (i = 0; i < 8; i++) crc = (crc >> 1) ^ ((crc & 1) ? poly : 0); } return crc; } /* * Expects poly encoded so 0x0001 is x^0 and 0x8000 is x^15. * CRC should be appended msbyte first. */ uint16_t crc_msb_first(uint16_t crc, uint16_t poly, unsigned char const *p, size_t len) { while (len--) { unsigned i; crc ^= (uint16_t)(unsigned char)*p++ << 8; for (i = 0; i < 8; i++) crc = (crc << 1) ^ ((crc & 0x8000) ? poly : 0); } return crc; } If you're trying to reverse-engineer an unknown CRC, get two valid messages of the same length, form their XOR, and try a few different polynomials. (There's a way to do it more efficiently using a GCD, but on a modern machine, it's faster to try all 32768 possible polynomials than to write and debug the GCD code.) After that, you can figure out the preset and final inversion, if any. For fixed-length messages, you can merge them into a single 16-bit constant that you can include at the beginning or the end, but if you have variable-length messages, it matters. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CCITT-CRC16 in kernel
>> CRC-CCITT = X^16 + X^12 + X^5 + X^0 = 0x8408, and NOT 0x1021 >> CRC-16 = X^16 + X^15 + X^2 + X^0 = 0xa001, and NOT 0x8005 >> > > Thank you very much for your time, but what you say is completely > different than anything else I have found on the net. > > Do the math: > > 2^ 16 = 65536 > 2^ 12 = 4096 > 2^ 5 =32 > 2^ 0 = 1 > -- > 69655 = 0x11021 > > That's by convention 0x1021 as the X^16 is thrown away. I have > no clue how you could possibly get 0x8408 out of this, nor > how the CRC of 1 could possibly lie at offset 128 in a table > of CRC polynomials. Now I read it in the header, but that > doesn't make it right. The thing is that X is not 2. x is a formal variable with no defined value. x^0 is represented as to 0x8000 x^5 is represented as to 0x0400 x^12 is represented as to 0x0008 x^16 is not represented by any bit TOTAL:0x8408 > The "RS-232C" order to which you refer simply means that the > string of "bits" needs to handled as a string of bytes, not > words or longwords, in other words, not interpreted as > words, just bytes. If this isn't correct then ZMODEM and > a few other protocols are wrong. You certainly don't > swap every BIT in a string do you? You are not claiming > that (0x01 == 0x80) and (0x02 == 0x40), etc, are you? Not at all. To repeat: - A CRC is computed over a string of *bits*. All of its error-corection properties are described in terms of *bit* patterns and *bit* positions and runs of adjacent *bits*. It does not know or care about larger structures such a bytes. - The CRC algorithm requires that the *first* bit it sees is the coefficient of the highest power of x, and the *last* bit it sees is the coefficient of x^0. This is because it's basically long division. - If you are working in software, you (the implementor) must define a mapping between a byte string and a bit string. There are only two mappings that make any sense at all: 1) The least-significant bit of each byte is considered "first", and the most-significant is considered "last". 2) The most-significant bit of each byte is considered "first", and the least-significant is considered "last". The logic of the CRC *does not care* which one you choose, but you have to choose one. If the bytes are to be converted to bit-serial form, it is best to choose the form actually used for transmission to preserve the burst error detection properies of the CRC. Note that: - Many people (including, apparently, you) find the second choice a bit easier to visualize, as bit i is the coefficient of x^i. - The first choice is a) Easier to implement in software, and b) Matches RS-232 transmission order, and c) Is used by hardware such as the Z8530 SCC and MPC860 QUICC, and d) Is the form invariably used by experienced software implementors. If you have some wierd piece of existing hardware, it might have chosen either. Just try them both and see which works. However, if your hardware uses the opposite bit ordering within bytes, DO NOT ATTEMPT to "fix" lib/crc-ccitt.c. It will break all of the existing users of the code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CCITT-CRC16 in kernel
> The "Bible" has been: > http://www.joegeluso.com/software/articles/ccitt.htm This fellow is just plain Confused. First of all, The Standard Way to do it is to preset to -1 *and* to invert the result. Any combination is certainly valid, but if you don't invert the CRC before appending it, you fail to detect added or deleted trailing zero bits, which can be A Bad Thing in some applications. Secondly, I see what he's on about "trailing zero bits", but he isn't aware that *everyone* uses the "implicit" algorithm, so the reason that the specs don't explain that very well is that the spec writers forgot there was any other way to do it. So his long-hand calculations are just plain WRONG. Presetting the CRC accumulator to -1 is equivalent to *inverting the first 16 bits of the message*, and NOT to prepending 16 1 bits. Also, he's got his bit ordering wrong. The correct way to do it, long-hand, is this: Polynomial: x^16 + x^12 + x^5 + 1. In bits, that's 100010011 Message: ascii "A", 0x41. Convert to bits, lsbit first: 1010 Append trailing padding to hold CRC: 1010 Invert first 16 bits: 0101 Now let's do the long division. You can compute the quotient, but it's not needed for anything, so I'm not bothering to write it down: 0101 100010011 - 1110010111010 100010011 - 1101100110110 100010011 - 1010000101110 100010011 - 100111000 100010011 - 010100111010 - Final remainder XOR trailing padding with computed CRC (since we used padding of zero, that's equivalent to overwriting it): 1010010100111010 Or, if the CRC is inverted: 1010110101000101 To double-check, let's verify that CRC. I'm verifying the non-inverted CRC, so I expect a remainder of zero: Received message: 1010010100111010 Invert first 16 bits: 0101101000111010 0101101000111010 100010011 - 11100110100111011 100010011 - 11011101000110101 100010011 - 1010101101001 100010011 - 100010011 100010011 - 000 - Final remainder Now, note how in each step, the decision whether to XOR with the 17-bit polynomial is made based soley on the leading bit of the remainder. The trailing 16 bits are modified (by XOR with the polynomial), but not examined. This leads to a standard optimization, where the bits from the dividend are not merged into the working remainder until the moment they are needed. Each step, the leading bit of the 17-bit remainder is XORed with the next dividend bit, and the polynomial is XORed in as required to force the leading bit to 0. Then the remainder is shifted, discarding the leading 0 bit and shifting in a trailing 0 bit. This technique avoids the need for explicit padding, and is the way that the computation is invariably performed in all but pedagogical implementations. Also, the awkward 17-bit size of the remainder register can be reduced to 16 bits with care, as at any given moment, one of the bits is known to be zero. It is usually the trailing bit, but between the XOR and the shift, it is the leading bit. (Again, recall that in a typical software implementation, the "leading bit" is the lsbit and the "trailing bit" is the msbit. Because the CRC algorithm does not use addition or carries or anything of the sort, it does not care which convention software uses.) > I have spent over a week grabbing everything on the Web that > could help decipher the CCITT CRC and they all show this > same kind of code and same kind of organization. Nothing > I could find on the Web is like the linux kernel ccitt_crc. > Go figure. Funny, I can find it all over the place: http://www.nongnu.org/avr-libc/user-manual/group__avr__crc.html http://www.aerospacesoftware.com/checks.htm http://www.bsdg.org/SWAG/CRC/0011.PAS.html http://www.ethereal.com/lists/ethereal-dev/200406/msg00414.html http://pajhome.org.uk/progs/crcsrcc.html http://koders.com/c/fidE2A434B346BFDCD29DA556A54E37C99E403ED26B.aspx > Do you suppose it was bit-swapped to bypass a patent? There's no patent. That's just the way that the entire SDLC family of protocols (HDLC, LAPB, LAPD, SS#7, X.25, AX.25, PPP, IRDA, etc.) do it. They transmit lsbit-first, so they compute lsbit-first. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Actually, is there any place *other* than write() to the page cache that warrants a non-temporal store? Network sockets with scatter/gather and hardware checksum, maybe? This is pretty much synonomous with what is allowed to go into high memory, no? While we're on the subject, for the copy_from_user source, prefetchnta is probably indicated. If user space hasn't caused it to be cached already (admittedly, the common case), we *know* the kernel isn't going to look at that data again. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fortuna
se a fixed seed. Also, unless I'm misunderstanding the definition very badly, any "strong extractor" can use a fixed secret seed. > I'm not sure whether any of the above will be practically relevant. > They may be too theoretical for real-world use. But if you're interested, > I could try to give you more information about any of these categories. I'm doing some reading to see if something practical can be dug out of the pile. I'm also looking at "compressors", which are a lot like our random pools; they reduce the size of an input while preserving its entropy, just not necesarily to 100% density like an extractor. This is attractive because our entropy measurement is known to be heavily derated for safety. An extractor, in producing an output that is as large as our guaranteed entropy, will throw away any additional entropy that might be remaining. Th other thing that I absolutely need is some guarantee that things will still mostly work if our entropy estimate is wrong. If /dev/random produces 128 bits of output that only have 120 bits of entropy in them, then your encryption is still secure. But these extractor constructions are very simple and linear. If everything falls apart if I overestimate the source entropy by 1 bit, it's probably a bad idea. Maybe it can be salvaged with some cryptographic techniques as backup. >> 3) Fortuna's design doesn't actually *work*. The authors' analysis >>only works in the case that the entropy seeds are independent, but >>forgot to state the assumption. Some people reviewing the design >>don't notice the omission. > Ok, now I understand your objection. Yup, this is a real objection. > You are right to ask questions about whether this is a reasonable assumption. > > I don't know whether /dev/random makes the same assumption. I suspect that > its entropy estimator is making a similar assumption (not exactly the same > one), but I don't know for sure. Well, the entropy *accumulator* doesn't use any such assumption. Fortuna uses the independence assumption when it divides up the seed material round-robin among the various subpools. The current /dev/random doesn't do anything like that. (Of course, non-independence affects us by limiting us to the conditional entropy of any given piece of seed material.) > I also don't know whether this is a realistic assumption to make about > the physical sources we currently feed into /dev/random. That would require > some analysis of the physics of those sources, and I don't have the skills > it would take to do that kind of analysis. And given the variety of platforms that Linux runs on, it gets insane. Yes, it can be proved based on fluid flow computations that hard drive rotation rates are chaotic and thus disk access timing is a usable entropy source, but then someone installs a solid-state disk. That's why I like clock jitter. That just requires studying oscillators and PLLs, which are universal across all platforms. > Actually, this example scenario is not a problem. I'll finish the > analysis for you. Er... thank you, but I already knew that; I omitted the completion because it seemed obvious. And yes, there are many other distributions which are worse. But your 200-round assumption is flawed; I'm assuming the Fortuna schedule, which is that subpool i is dumped into the main pool (and thus inforation-theoretically available at the output) every 2^i rounds. So the second pool is dumped in every 2 rounds, not every 200. And with 1/3 the entropy rate, if the first pool is brute-forceable (which is our basic assumption), then the second one certainly is. Now, this simple construction doesn't extend to more pools, but it's trying to point out the lack of a *disproof* of a source distribution where higher-order pools get exponentially less entropy per seed due to the exposure of lower-order pools. Which would turn Fortuna into an elaborate exercise in bit-shuffling for no security benefit at all. This can all be dimly seen through the papers on extractors, where low-k sources are really hard to work with; all the designs want you to accumulate enough input to get a large k. > If you want a better example of where the two-pool scheme completely > falls apart, consider this: our source picks a random bit, uses this > same bit the next two times it is queried, and then picks a new bit. > Its sequence of outputs will look like (b0,b0,b1,b1,b2,b2,..,). If > we alternate pools, then the first pool sees the sequence b0,b1,b2,.. > and the second pool sees exactly the same sequence. Consequently, an > adversary who can observe the entire evolution of the first pool can > deduce everything there is to know about the second pool. This just > illustrates that these multiple-poo
Re: enforcing DB immutability
[A discussion on the git list about how to provide a hardlinked file that *cannot* me modified by an editor, but must be replaced by a new copy.] [EMAIL PROTECTED] wrote all of: >>> perhaps having a new 'immutable hardlink' feature in the Linux VFS >>> would help? I.e. a hardlink that can only be readonly followed, and >>> can be removed, but cannot be chmod-ed to a writeable hardlink. That i >>> think would be a large enough barrier for editors/build-tools not to >>> play the tricks they already do that makes 'readonly' files virtually >>> meaningless. >> >> immutable hardlinks have the following advantage: a hardlink by design >> hides the information where the link comes from. So even if an editor >> wanted to play stupid games and override the immutability - it doesnt >> know where the DB object is. (sure, it could find it if it wants to, >> but that needs real messing around - editors wont do _that_) > > so the only sensible thing the editor/tool can do when it wants to > change the file is precisely what we want: it will copy the hardlinked > files's contents to a new file, and will replace the old file with the > new file - a copy on write. No accidental corruption of the DB's > contents. This is not a horrible idea, but it touches on another sore point I've worried about for a while. The obvious way to do the above *without* changing anything is just to remove all write permission to the file. But because I'm the owner, some piece of software running with my permissions can just deicde to change the permissions back and modify the file anyway. Good old 7th edition let you give files away, which could have addressed that (chmod a-w; chown phantom_user), but BSD took that ability away to make accounting work. The upshot is that, while separate users keeps malware from harming the *system*, if I run a piece of malware, it can blow away every file I own and make me unhappy. When (notice I'm not saying "if") commercial spyware for Linux becomes common, it can also read every file I own. Unless I have root access, Linux is no safer *for me* than Redmondware! Since I *do* have root access, I often set up sandbox users and try commercial binaries in that environment, but it's a pain and laziness often wins. I want a feature that I can wrap in a script, so that I can run a commercial binary in a nicely restricted enviromment. Or maybe I even want to set up a "personal root" level, and run my normal interactive shells in a slightly restricted enviroment (within which I could make a more-restricted world to run untrusted binaries). Then I could solve the immutable DB issue by having a "setuid" binary that would make checked-in files unwriteable at my normal permission level. Obviously, a fundamental change to the Unix permissions model won't be available to solve short-term problems, but I thought I'd raise the issue to get people thinking about longer-term solutions. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
> Since fxsave leaves the FPU state intact, there ought to be a better way > to do this but it gets tricky. Maybe using the TSC to put a timestamp > in every thread save area? > > when saving FPU state: > put cpu# and timestamp in thread state info > also store timestamp in per-cpu data > > on task switch: > compare cpu# and timestamps for next task > if equal, clear TS and set TS_USEDFPU > > when state becomes invalid for some reason: > zero cpu's timestamp > > But the extra overhead might be too much in many cases. Simpler: - Thread has "CPU that I last used FPU on" pointer. Never NULL. - Each CPU has "thread whose FPU state I hold" pointer. May be NULL. When *loading* FPU state: - Set up both pointers. On task switch: - If the pointers point to each other, then clear TS and skip restore. ("Preloaded") When state becomes invalid (kernel MMX use, or whatever) - Set CPU's pointer to NULL. On thread creation: - If current CPU's thread pointer points to the newly allocated thread, clear it to NULL. - Set thread's CPU pointer to current CPU. The UP case just omits the per-thread CPU pointer. (Well, stores it in zero bits.) An alternative SMP thread-creation case would be to have a NULL value for the thread-to-CPU pointer and initialize the thread's CPU pointer to that, but that then complicates the UP case. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
> OK, I guess when I get some time, I'll start testing all the i386 bitop > functions, comparing the asm with the gcc versions. Now could someone > explain to me what's wrong with testing hot cache code. Can one > instruction retrieve from memory better than others? To add one to Linus' list, note that all current AMD & Intel chips record instruction boundaries in L1 cache, either predecoding on L1 cache load, or marking the boundaries on first execution. The P4 takes it to an extreme, but P3 and K7/K8 do it too. The result is that there are additional instruction decode limits that apply to cold-cache code. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Add prefetch switch stack hook in scheduler function
> include/asm-alpha/mmu_context.h |6 ++ > include/asm-arm/mmu_context.h |6 ++ > include/asm-arm26/mmu_context.h |6 ++ > include/asm-cris/mmu_context.h |6 ++ > include/asm-frv/mmu_context.h |6 ++ > include/asm-h8300/mmu_context.h |6 ++ > include/asm-i386/mmu_context.h |6 ++ > include/asm-ia64/mmu_context.h |6 ++ > include/asm-m32r/mmu_context.h |6 ++ > include/asm-m68k/mmu_context.h |6 ++ > include/asm-m68knommu/mmu_context.h |6 ++ > include/asm-mips/mmu_context.h |6 ++ > include/asm-parisc/mmu_context.h|6 ++ > include/asm-ppc/mmu_context.h |6 ++ > include/asm-ppc64/mmu_context.h |6 ++ > include/asm-s390/mmu_context.h |6 ++ > include/asm-sh/mmu_context.h|6 ++ > include/asm-sh64/mmu_context.h |6 ++ > include/asm-sparc/mmu_context.h |6 ++ > include/asm-sparc64/mmu_context.h |6 ++ > include/asm-um/mmu_context.h|6 ++ > include/asm-v850/mmu_context.h |6 ++ > include/asm-x86_64/mmu_context.h|5 + > include/asm-xtensa/mmu_context.h|6 ++ > kernel/sched.c |9 - > 25 files changed, 151 insertions(+), 1 deletion(-) I think this pretty clearly points out the need for some arch-generic infrastructure in Linux. An awful lot of arch hooks are for one or two architectures with some peculiarities, and the other 90% of the implementations are identical. For example, this is 22 repetitions of #define MIN_KERNEL_STACK_FOOTPRINT L1_CACHE_BYTES with one different case. It would be awfully nice if there was a standard way to provide a default implementation that was automatically picked up by any architecture that didn't explicitly override it. One possibility is to use #ifndef: /* asm-$PLATFORM/foo.h */ #define MIN_KERNEL_STACK_FOOTPRINT IA64_SWITCH_STACK_SIZE inline void prefetch_task(struct task_struct const *task) { ... } #define prefetch_task prefetch_task /* asm-generic/foo.h */ #include #ifndef MIN_KERNEL_STACK_FOOTPRINT #define MIN_KERNEL_STACK_FOOTPRINT L1_CACHE_BYTES #endif #ifndef prefetch_task inline void prefetch_task(struct task_struct const *task) { } /* The #define is OPTIONAL... */ #define prefetch_task prefetch_task #endif But both understanding and maintaining the arch code could be much easier if the shared parts were collapsed. A comment in the generic versions can explain what the assumptions are. If there are cases where there is more than one implementation with multiple users, it can be stuffed into a third category of headers. E.g. and or some such, using the same duplicate-suppression technique and #included at the end of - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to get dentry from inode number?
> How can I get a full pathname from an inode number ? (Our data > structure only keep track inode number instead of pathname in > order to keep thin, so don't have any information but inode > number.) Except in extreme circumstances (there's some horrible kludgery in the NFS code), you don't. Just store a dentry pointer to begin with; it's easy to map from dentry to inode. In addition to files with multiple names, you can have files with no names, made by the usual Unix trick of deleting a file after opening it. The NFS kludgery is required by the short-sighted design of the NFS protocol. Don't emulate it, or you will be lynched by a mob of angry kernel developers with torches and pitchforks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)
> static inline int new_find_first_bit(const unsigned long *b, unsigned size) > { > int x = 0; > do { > unsigned long v = *b++; > if (v) > return __ffs(v) + x; > if (x >= size) > break; > x += 32; > } while (1); > return x; > } Wait a minute... suppose that size == 32 and the bitmap is one word of all zeros. Dynamic execution will overflow the buffer: int x = 0; unsigned long v = *b++; /* Zero */ if (v) /* False, v == 0 */ if (x >= size) /* False, 0 < 32 */ x += 32; } while (1); unsigned long v = *b++; /* Buffer overflow */ if (v) /* Random value, suppose non-zero */ return __ffs(v) + x;/* >= 32 */ That should be: static inline int new_find_first_bit(const unsigned long *b, unsigned size) int x = 0; do { unsigned long v = *b++; if (v) return __ffs(v) + x; } while ((x += 32) < size); return size; } Note that we assume that the trailing long is padded with zeros. In truth, it should probably be either static inline unsigned new_find_first_bit(u32 const *b, unsigned size) int x = 0; do { u32 v = *b++; if (v) return __ffs(v) + x; } while ((x += 32) < size); return size; } or static inline unsigned new_find_first_bit(unsigned long const *b, unsigned size) unsigned x = 0; do { unsigned long v = *b++; if (v) return __ffs(v) + x; } while ((x += CHAR_BIT * sizeof *b) < size); return size; } Do we actually store bitmaps on 64-bit machines with 32 significant bits per ulong? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Need better is_better_time_interpolator() algorithm
> (frequency) * (1/drift) * (1/latency) * (1/(jitter_factor * cpus)) (Note that 1/cpus, being a constant for all evaluations of this expression, has no effect on the final ranking.) The usual way it's done is with some fiddle factors: quality_a^a * quality_b^b * quality_c^c Or, equivalently: a * log(quality_a) + b * log(quality_b) + c * log(quality_c) Then you use the a, b and c factors to weight the relative importance of them. Your suggestion is equivalent to setting all the exponents to 1. But you can also say that "a is twice as important as b" in a consistent manner. Note that computing a few bits of log_2 is not hard to do in integer math if you're not too anxious about efficiency: unsigned log2(unsigned x) { unsigned result = 31; unsigned i; assert(x); while (!x & (1u<<31)) { x <<= 1; result--; } /* Think of x as a 1.31-bit fixed-point number, 1 <= x < 2 */ for (i = 0; i < NUM_FRACTION_BITS; i++) { unsigned long long y = x; /* Square x and compare to 2. */ y *= x; result <<= 1; if (y & (1ull<<63)) { result++; x = (unsigned)(y >> 32); } else { x = (unsigned)(y >> 31); } } return result; } Setting NUM_FRACTION_BITS to 16 or so would give enough room for reasonable-sized weights and not have the total overflow 32 bits. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.24-rc6 oops in net_tx_action
Kernel is 2.6.24-rc6 + linuxpps patches, which are all to the serial port driver. 2.6.23 was known stable. I haven't tested earlier 2.6.24 releases. I think it happened once before; I got a black-screen lockup with keyboard LEDs blinking, but that was with X running so I couldn't see a console oops. But given that I installed 2.6.24-rc6 about 24 hours ago, that's a disturbing pattern. (N.B. I was pretty careful, but the following was transcribed by hand.) BUG: unable to handle kernel paging request at virtual address 00100104 printing eip: b02b3d6a *pde= Oops: 0002 [#1] Pid 3162, comm: ntop Not tainted (2.6.24-rc6 #36) EIP: 0060[] EFLAGS: 00210046 CPU: 0 EIP is at net_tx_action+0x8b/0xec EAX: 00100100 EBX: efa63924 ECX: 0801fbff EDX: 00200200 ESI: 0010 EDI: 0010 EBP: 012c ESP: b0444fc8 DS: 007b ES: 007b FS: GS: 0033 SS: 0068 Process ntop (pid: 3162, ti=b0444000 task=e9122f90 task.ti=e92ec000) Stack: 000a b02b3a84 b044007b 000a7ac5 0001 b0457a44 0009 b0118016 e92ecf74 e92ec000 00200046 b0103c3c Call Trace: [] net_tx_action+0x5a/0xa8 [] __do_softirq+0x35/0x75 [] do_softirq+0x3e/0x8f [] do_gettimeofday+0x2c/0xc6 [] handle_level_irq+0x0/0x8d [] irq_exit+0x29/0x58 [] do_IRQ+0xaf/0xc2 [] sys_gettimeofday+0x27/0x53 [] common_interrupt+0x23/0x28 === Code: 24 04 ec 61 3d b0 c7 04 24 87 01 3a b0 e8 ad 10 e6 ff e8 44 fd e4 ff c7 05 1c b9 46 b0 01 00 00 00 fa 39 fe 75 20 8b 03 8b 53 04 <89> 50 04 89 02 a1 fc b8 46 b0 c7 03 f8 b8 46 b0 89 1d fc b8 46 EIP: [] at net_tx_action+0x8b/0xec SS:ESP 0068:b0444fc8 Kernel panic - not syncing: Fatal exception in interrupt Network config is a little complex; there are 5 physical network interfaces and a bunch of netfilter rules. A quad-port 100baseT Tulip card which provides "outside-facing" interfaces (two uplinks, a DMZ, and a spare), and a gigabit VIA velocity card for the internal network. The hardware has ECC memory (1 GB, kernel starts at 2.75G) and mirrored drives, and has generally been very stable for a long time, modulo some disk hiccups. $ lspci 00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03) 00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03) 00:04.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02) 00:04.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01) 00:04.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01) 00:04.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02) 00:09.0 Ethernet controller: VIA Technologies, Inc. VT6120/VT6121/VT6122 Gigabit Ethernet Adapter (rev 11) 00:0a.0 Mass storage controller: Promise Technology, Inc. PDC20268 (Ultra100 TX2) (rev 01) 00:0b.0 Mass storage controller: Promise Technology, Inc. PDC20268 (Ultra100 TX2) (rev 01) 00:0c.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02) 00:0d.0 PCI bridge: Digital Equipment Corporation DECchip 21152 (rev 03) 01:00.0 VGA compatible controller: nVidia Corporation NV6 [Vanta/Vanta LT] (rev 15) 02:04.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 (rev 41) 02:05.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 (rev 41) 02:06.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 (rev 41) 02:07.0 Ethernet controller: Digital Equipment Corporation DECchip 21142/43 (rev 41) The tulip drivers have been solid forever. The VIA velocity driver is more suspect; I made an effort a while ago to get tagged VLANs working on it, which was a notable failure. Still, this oops is in core network code. As you might guess, it makes people somewhat grumpy when the main firewall/router takes a dive, but I can experiment after hours. Here's the kernel config: $ grep ^CONFIG /usr/src/linux/.config CONFIG_X86_32=y CONFIG_X86=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_QUICKLIST=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_SUPPORTS_OPROFILE=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_X86_BIOS_REBOOT=y CONFIG_KTIME_SCALAR=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_IKCONFIG=y CONFIG_LOG_BUF_SHIFT=15 CONFIG_FAIR_GROUP_SCHED=y CONFIG_FAIR_USER_SCHED=y CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y CO
Re: 2.6.24-rc6 oops in net_tx_action
> [EMAIL PROTECTED] <[EMAIL PROTECTED]> : >> Kernel is 2.6.24-rc6 + linuxpps patches, which are all to the serial >> port driver. >> >> 2.6.23 was known stable. I haven't tested earlier 2.6.24 releases. >> I think it happened once before; I got a black-screen lockup with >> keyboard LEDs blinking, but that was with X running so I couldn't see a >> console oops. But given that I installed 2.6.24-rc6 about 24 hours ago, >> that's a disturbing pattern. > It is probably this one: > > http://marc.info/?t=11978279403&r=1&w=2 Thanks! I got the patch from http://marc.info/?l=linux-netdev&m=119756785219214 (Which didn't make it into -rc7; please fix!) and am recompiling now. Actually, I grabbed the hardware mitigation followon patch while I was at it. I notice that the comment explaining the format of CSR11 and what 0x80F1 means got lost; perhaps it would be nice to resurrect it? 0x80F1 8000 = Cycle size (timer control) 7800 = TX timer in 16 * Cycle size 0700 = No. pkts before Int. (0 = interrupt per packet) 00F0 = Rx timer in Cycle size 000E = No. pkts before Int. 0001 = Continues mode (CM) (Boy, that tulip driver could use a whitespace overhaul.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc6 oops in net_tx_action
>> Thanks! I got the patch from >> http://marc.info/?l=linux-netdev&m=119756785219214 >> (Which didn't make it into -rc7; please fix!) >> and am recompiling now. > Jeff is busy so he's asked me to pick up the more important > driver bug fixes that get posted. > > I'll push this around, thanks. Much obliged. It's only 11 hours of uptime, but no problems so far, even trying abusive things like "ping -f -l64 -s8000". -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFT] Port 0x80 I/O speed
Here are a variety of machines: 600 MHz PIII (Katmai), 440BX chipset, 82371AB/EB/MB PIIX4 ISA bridge: cycles: out 794, in 348 cycles: out 791, in 348 cycles: out 791, in 349 cycles: out 791, in 348 cycles: out 791, in 348 433 MHz Celeron (Mendocino), 440 BX chipset, same ISA bridge: cycles: out 624, in 297 cycles: out 623, in 296 cycles: out 624, in 297 cycles: out 623, in 297 cycles: out 623, in 296 1100 MHz Athlon, nForce2 chipset, nForce2 ISA bridge: cycles: out 1295, in 1162 cycles: out 1295, in 1162 cycles: out 1295, in 1162 cycles: out 1295, in 1162 cycles: out 1295, in 1162 800 MHz Transmeta Crusoe TM5800, Transmeta/ALi M7101 chipset. cycles: out 1212, in 388 cycles: out 1195, in 375 cycles: out 1197, in 377 cycles: out 1196, in 376 cycles: out 1196, in 377 2200 MHz Athlon 64, K8T890 chipset, VT8237 ISA bridge: cycles: out 1844674407370814, in 1844674407365758 cycles: out 1844674407370813, in 1844674407365756 cycles: out 1844674407370805, in 1844674407365750 cycles: out 1844674407370813, in 1844674407365755 cycles: out 1844674407370814, in 1844674407365756 Um, huh? That's gcc 4.2.3 (Debian version 4.2.2-4), -O2. Very odd. I can run it with -O0: cycles: out 4894, in 4894 cycles: out 4905, in 4917 cycles: out 4910, in 4896 cycles: out 4909, in 4896 cycles: out 4894, in 4898 cycles: out 4911, in 4898 or with -O2 -m32: cycles: out 4914, in 4927 cycles: out 4913, in 4927 cycles: out 4913, in 4913 cycles: out 4914, in 4913 cycles: out 4913, in 4929 cycles: out 4912, in 4912 cycles: out 4913, in 4915 With -O2, the cycle counts come out (before division) as out: 0xFFEA6F4F in: 0xFCE68BB6 I think the "A" constraint doesn't work quite the same in 64-bit code. The compiler seems to be using %rdx rather than %edx:%eax. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SCHED_FIFO & system()
Hello, I have some strange behavior in one of my systems. I have a real-time kernel thread under SCHED_FIFO which is running every 10ms. It is blocking on a semaphore and released by a timer interrupt every 10ms. Generally this works really well. However, there is a module in the system that makes a / system() / call from c-code in user-space; system("run_my_script"); By calling and running a bash script. Independent of how the actual script looks like the real time kernel thread does not get scheduled under the time of 80ms -- the time it takes for the system() call to finish. I can see when running a LTT session that the wake_up event occurs for the real time thread 10ms into the system call but nevertheless the real time kernel thread does not get scheduled. The thread that calls system("run_my_script") is configured as SCHED_OTHER. The Kernel is 2.6.21. Anybody who recognize this or similar situations? Cheers // Matias -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: /dev/urandom uses uninit bytes, leaks user data
>> There is a path that goes from user data into the pool. This path >> is subject to manipulation by an attacker, for both reading and >> writing. Are you going to guarantee that in five years nobody >> will discover a way to take advantage of it? Five years ago >> there were no public attacks against MD5 except brute force; >> now MD5 is on the "weak" list. > Yep, I'm confident about making such a guarantee. Very confident. For the writing side, there's a far easier way to inject potentially hostile data into the /dev/random pool: "echo evil inentions > /dev/random". This is allowed because it's a very specific design goal that an attacker cannot improve their knowledge of the state of the pool by feeding in chosen text. Which in turn allows /dev/random to get potential entropy from lots of sources without worrying about how good they are. It tries to account for entropy it's sure of, but it actually imports far more - it just don't know how much more. One of those "allowed, but uncredited" sources is whatever you want to write to /dev/random. So you can, if you like, get seed material using wget -t1 -q --no-cache -O /dev/random 'http://www.fourmilab.ch/cgi-bin/Hotbits?fmt=bin&nbytes=32' 'http://www.random.org/cgi-bin/randbyte?nbytes=32&format=f' 'http://www.randomnumbers.info/cgibin/wqrng.cgi?limit=255&amount=32' 'http://www.lavarnd.org/cgi-bin/randdist.cgi?pick_num=16&max_num=65536' I don't trust them, but IF the data is actually random, and IF it's not observed in transit, then that's four nice 256-bit random seeds. (Note: if you actually use the above, be very careful not to abuse these free services by doing it too often. Also, the latter two actually return whole HTML pages with the numbers included in ASCII. If anyone knows how to just download raw binary, please share.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: permit link(2) to work across --bind mounts ?
> Why does link(2) not support hard-linking across bind mount points > of the same underlying filesystem ? Whenever we get mount -r --bind working properly (which I use to place copies of necessary shared libraries inside chroot jails while allowing page cache sharing), this feature would break security. mkdir /usr/lib/libs.jail for i in $LIST_OF_LIBRARIES; do ln /usr/lib/$i /usr/lib/libs.jail/$i done mount -r /usr/lib/libs.jail /jail/lib chown prisoner /usr/log/jail mount /usr/log/jail /jail/usr/log chrootuid /jail prisoner /bin/untrusted & Although protections should be enough, but I'd rather avoid having the prisoner link /jail/lib/libfoo.so (write returns EROFS) to /jail/usr/log where it's potentially writeable. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] OpenBSD Networking-related randomization port
> could you please also react to this feedback: > > http://marc.theaimsgroup.com/?l=linux-kernel&m=110698371131630&w=2 > > to quote a couple of key points from that very detailed security > analysis: > > " I'm not sure how the OpenBSD code is better in any way. (Notice that > it uses the same "half_md4_transform" as Linux; you just added another > copy.) Is there a design note on how the design was chosen? " Just note that, in addition to the security aspects, there are also a whole set of multiprocessor issues. OpenBSD added SMP support in June 2004, and it looks like this code dates back to before that. It might be worth looking at what OpenBSD does now. Note that I have NOT looked at the patch other than the TCP ISN generation. However, given the condition of the ISN code, I am inclined to take a "guilty until proven innocent" view of the rest of it. Don't merge it until someone has really grokked it, not just kibitzed about code style issues. (The homebrew 15-bit block cipher in this code does show how much the world needs a small block cipher for some of these applications.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Linux Kernel Subversion Howto
I really got bored of this thread.Can you all question your self on thing? If someone starts reading right now the sources of the linux kernel will be able to understand every aspect and part of the code??? Do you understand every aspect? Is it still "opensource" or starts to be a "closedsource" software "product " despite the fact that is still free to the community. DONT say again the source is there ,you dont have but to read it. Someone in this list ( very popular ) has said some years ago that even now Micro$oft gives out the source noone will be able to do some changes and some of us wont be able to understand this product ever never. So in the name of this idea is still linux an "opensource" idea or a "closed" one delivered from those who have the information already to maintain it and give it to us freely. I tend to develop in linux only 4 years now and i tried to make some changes in the kernel so that they can conform according to my needs... i have to say that it was very very time consuming.no doc...no comments...no explanationI think that a lot of us have the same question "Opensource" || "Closedsource" ??????? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] OpenBSD Networking-related randomization port
> [EMAIL PROTECTED] writes: >> (The homebrew 15-bit block cipher in this code does show how much the >> world needs a small block cipher for some of these applications.) > > Doesn't TEA fill this niche? It's certainly used for this in the Linux > kernel, e.g. in reiserfs (although I have my doubts it is really useful > there) Sorry; ambiguous parsing. I meant "(small block) cipher", not "small (block cipher)". TEA is intended for the latter niche. What I meant was a cipher that could encrypt blocks smaller than 64 bits. It's easy to make a smaller hash by just thowing bits away, but a block cipher is a permutation, and has to be invertible. For example, if I take a k-bit counter and encrypt it with a k-bit block cipher, the output is guaranteed not to repeat in less than 2^k steps, but the value after a given value is hard to predict. There is a well-known technique for reducing the block size of a cipher by a small factor, such as from a power of 2 to a prime number slightly lower. That is: unsigned encrypt_mod_n(unsigned x, unsigned n) { assert(x < n); do { x = encrypt(x); } while (x >= n); return x; } It takes a bit of thinking to realize why this creates an bijection from [0..n-1] -> [0..n-1], but it's kind of a neat "aha!" when it does. Remember, encrypt() is a bijection from [0..N-1] -> [0..N-1] for some N >= n. Typically N = 2^k for some k. However, this technique requires N/n calls to encrypt(). I.e. n calls to encrypt_mod_n() will cause N calls to encrypt(). It's generally considered practical up to N/n = 2, so we can encrypt modulo any modulus n if we have encrypt() functions for any N = 2^k a power of 2. I.e. a k-bit block cipher. For example, suppose we want to encrypt 7-digit North American telephone numbers. These are of the form NXX-, where N is a digit other than 0 or 1, and X is any digit. there are 8e6 possibilities. Using this scheme and a 23-bit block cipher, we can encrypt them to different valid 7-digit telephone numbers. Likewise, 10-digit number with area codes, +1 NXX NXX- (but not starting with N11) are also possible. There are 792 area codes and 8e6 numbers for a total of 777600 < 2^33 combinations. This sort of thing is very useful for adding encryption to protocols and file formats not designed for it. However, the standard literature is notably lacking in block ciphers in funny block sizes. There was one AES submission (The Hasty Pudding Cipher, http://www.cs.arizona.edu/~rcs/hpc/) that supported variable block sizes, but it was eliminated fairly early. To start with, consider very small blocks: 1, 2 or 3 bits. There are only two possible things encrypt() can do with a 1-bit value: either invert it or leave it alone. There are 4! = 24 possible 2-bit encryption operations. Ideally, the key should specify them all with equal probability, but 24 does not evenly divide the (power of 2 sized) keyspace. It is interesting to look at how iniformly the possibilities are covered. It's fun to consider a Feistel network, dividing the plaintext into 1-bit L and R values, and alternating L ^= f(R), R ^= f(L) for (not nexessarily invertible) round functions f. Since there are only 4 possible 1-bit functions (1, 0, x and !x), you can consider each round to have an independent 2-bit round subkey and see how the cipher's uniformity develops as you increase the number of rounds and the key length to go with it. There are 8! = 40320 3-bit encryption operations. Again, all should be covered uniformly. An odd number of bits makes a Feistel design more challenging. But if you don't allow odd numbers of bits, you have to push the shrinking technique it to N/n = 4, which starts to get unpleasant. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] OpenBSD Networking-related randomization port
linux> It's easy to make a smaller hash by just thowing bits away, linux> but a block cipher is a permutation, and has to be linux> invertible. linux> For example, if I take a k-bit counter and encrypt it with linux> a k-bit block cipher, the output is guaranteed not to linux> repeat in less than 2^k steps, but the value after a given linux> value is hard to predict. > Huh? What if my cipher consists of XOR-ing with a k-bit pattern? > That's a permutation on the set of k-bit blocks but it happens to > decompose as a product of (non-overlapping) swaps. > > In general for more realistic block ciphers like DES it seems > extremely unlikely that the cipher has only a single orbit when viewed > as a permutation. I would expect a real block cipher to behave more > like a random permutation, which means that the expected number of > orbits for a k-bit cipher should be about ln(2^k) or roughly .7 * k. I think you misunderstand; your comments don't seem to make sense unless I assume you're imagining output feedback mode: x[0] = encrypt(IV) x[1] = encrypt(x[0]) x[2] = encrypt(x[1]) etc. Obviously, this pattern will repeat after some unpredictable interval. (However, owing to the invertibility of encryption, looping can be easily detected by noticing that x[i] = IV.) But I was talking about counter mode: x[0] = encrypt(0) x[1] = encrypt(1) x[2] = encrypt(2) etc. It should be obvious that this will not repeat until the counter overflows k bits and you try to compute encrypt(2^k) = encrypt(0). One easy way to generate unpredictable 16-bit port numbers that don't repeat too fast is: highbit = 0; for (;;) { generate_random_encryption_key(key); for (i = 0; i < 2; i++) use(highbit | encrypt15(i, key)); highbit ^= 0x8000; } Note that this does NOT use all 32K values before switching to another key; if that were the case, an attacker who kept a big bitmap of reviously seen values could preduct the last few values based on knowing what hadn't been seen already. Of course, you can always wrap a layer of Knuth's Algorithm B (randomization by shuffling) around anything: #include "basic_rng.h" #define SHUFFLE_SIZE 32 /* Power of 2 is more efficient */ struct better_rng_state { struct basic_rng_state basic; unsigned y; unsigned z[SHUFFLE_SIZE]; }; void better_rng_seed(struct better_rng_state *state, unsigned seed) { unsigned i; basic_rng_seed(&state->basic, seed); for (i = 0; i < SHUFFLE_SIZE; i++) state->z[i] = basic_rng(&state->basic); state->y = basic_rng(&state->basic) % SHUFFLE_SIZE; } unsigned better_rng(struct better_rng_state *state) { unsigned x = state->z[state->y]; state->y = (state->z = basic_rng(&state->basic)) % SHUFFLE_SIZE; return x; } (You can reduce code size by reducing modulo SHUFFLE_SIZE when you use state->y rather than when storing into it, but I have done it the other way to make clear exactly how much "effective" state is stored. You can also just initialize state->y to a fixed value.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] OpenBSD Networking-related randomization port
> It adds support for advanced networking-related randomization, in > concrete it adds support for TCP ISNs randomization Er... did you read the existing Linux TCP ISN generation code? Which is quite thoroughly randomized already? I'm not sure how the OpenBSD code is better in any way. (Notice that it uses the same "half_md4_transform" as Linux; you just added another copy.) Is there a design note on how the design was chosen? I don't wish to be *too* discouraging to someone who's *trying* to help, but could you *please* check a little more carefully in future to make sire it's actually an improvement? I fear there's some ignorance of what the TCP ISN does, why it's chosen the way it is, and what the current Linux algorithm is designed to do. So here's a summary of what's going on. But even as a summary, it's pretty long... First, a little background on the selection of the TCP ISN... TCP is designed to work in an environment where packets are delayed. If a packet is delayed enough, TCP will retransmit it. If one of the copies floats around the Internet for long enough and then arrives long after it is expected, this is a "delayed duplicate". TCP connections are between (host, port, host port) quadruples, and packets that don't match some "current connection" in all four fields will have no effect on the current connection. This is why systems try to avoid re-using source port numbers when making connections to well-known destination ports. However, sometimes the source port number is explicitly specified and must be reused. The problem then arises, how do we avoid having any possible delayed packets from the previous use of this address pair show up during the current connection and confuse the heck out of things by acknowledging data that was never received, or shutting down a connection that's supposed to stay open, or something like that? First of all, protocols assume a maximum packet lifetime in the Internet. The "Maximum Segment Lifetime" was originally specified as 120 seconds, but many implementations optimize this to 60 or 30 seconds. The longest time that a response can be delayed is 2*MSL - one delay for the packet eliciting the response, and another for the response. In truth, there are few really-hard guarantees on how long a packet can be delayed. IP does have a TTL field, and a requirement that a packet's TTL field be decremented for each hop between routers *or each second of delay within a router*, but that latter portion isn't widely implemented. Still, it is an identified design goal, and is pretty reliable in practice. The solution is twofold: First, refuse to accept packets whose acks aren't in the current transmission window. That is, if the last ack I got was for byte 1000, and I have sent 1100 bytes (numbers 0 through 1099), then if the incoming packet's ack isn't somewhere between 1000 and 1100, it's not relevant. If it's 950, it might be an old ack from the current connection (which doesn't include anything interesting), but in any case it can be safely ignored, and should be. The only remaining issue is, how to choose the first sequence number to use in a connection, the Initial Sequence Number (ISN)? If you start every connection at zero, then you have the risk that packets from an old connection between the same endpoints will show up at a bad time, with in-range sequence numbers, and confuse the current connection. So what you do is, start at a sequence number higher than the last one used in the old connection. Then there can't be any confusion. But this requires remembering the last sequence number used on every connection ever. And there are at least 2^48 addresses allowed to connect to each port on the local machine. At 4 bytes per sequence number, that's a Petabyte of storage... Well, first of all, after 2*MSL, you can forget about it and use whatever sequence number you like, because you know that there won't be any old packets floating around to crash the party. But still, it can be quite a burden on a busy web server. And you might crash and lose all your notes. Do you want to have to wait 2*MSL before rebooting? So the TCP designers (I'm not on page 27 of RFC 793, if you want to follow along) specified a time of day based ISN. If you use a clock to generate an ISN which counts up faster than your network connection can send data (and thus crank up its sequence numbers), you can be sure that your ISN is always higher than the last one used by an old connection without having to remember it explicitly. RFC 793 specifies a 250,000 bytes/second counting rate. Most implementations since Ethernet used a 1,000,000 byte/second counting rate, which matches the capabilities of 10base5 and 10base2 quite well, and is easy to get from the gettimeofday() call. Note that there are
Re: Patch 4/6 randomize the stack pointer
> Why not compromise, if possible? 256M of randomization, but move the > split up to 3.5/0.5 gig, if possible. I seem to recall seeing an option > (though I think it was UML) to do 3.5/0.5 before; and I'm used to "a > little worse" meaning "microbenches say it's worse, but you won't notice > it," so perhaps this would be a good compromise. How well tuned can > 3G/1G be? Come on, 1G is just a big friggin' even number. Ah, grasshopper, so much you have to learn... In particular, prople these days are more likely to want to move the split DOWN rather than UP. First point: it is important that the split happens at an "even" boundary for the highest-level page table. This makes it as simple as possible to copy the shared global pages into each process' page tables. On typical x86, each table is 1024 entries long, so the top table maps 4G/1024 = 4M sections. However, with PAE (Physical Address Extensions), a 32-bit page table entry is no longer enough to hold the 36-bit physical address. Instead, the entries are 64 bits long, so only 512 fit into a page. With a 4K page and 18 more bits from page tables, two levels will map only 30 bits of the 32-bit virtual address space. So Intel added a small, 4-entry third-level page table. With PAE, you are indeed limited to 1G boundaries. (Unless you want to seriously overhaul mm setup and teardown.) Secondly, remember that, unless you want to pay a performance penalty for enabling one of the highmem options, you have to fit ALL of physical memory, PLUS memory-mapped IO (say around 128M) into the kernel's portion of the address space. 512M of kernel space isn't enough unless you have less than 512M (like 384M) of memory to keep track of. That is getting less common, *especially* on servers. (Which are presumably an important target audience for buffer overflow defenses.) Indeed, if you have a lot of RAM and you don't have a big database that needs tons of virtual address space, it's usually worth moving the split DOWN. Now, what about the case where you have gobs of RAM and need a highmem option anyway? Well, there's a limit to what you can use high mem for. Application memory and page cache, yes. Kernel data structures, no. You can't use it for dcache or inodes or network sockets or page tables or the struct page array. And the mem_map array of struct pages (on x86, it's 32 bytes per page, or 1/128 of physical memory; 32M for a 4G machine) is a fixed overhead that's subtracted before you even start. Fully-populated 64G x86 machines need 512M of mem_map, and the remaining space isn't enough to really run well in. If you crunch kernel lowmem too tightly, that becomes the performance-limiting resource. Anyway, the split between user and kernel address space is mandated by: - Kernel space wants to be as bit as physical RAM if possible, or not more than 10x smaller if not. - User space really depends on the application, but larger than 2-3x physical memory is pointless, as trying to actually use it all will swap you to death. So for 1G of physical RAM, 3G:1G is pretty close to perfect. It was NOT pulled out of a hat. Depending on your applications, you may be able to get away with a smaller user virtual address space, which could allow you to work with more RAM without needing to slow the kernel with highmem code. You'll find another discussion of the issues at http://kerneltrap.org/node/2450 http://lwn.net/Articles/75174/ Finally, could I suggest a little more humility when addressing the assembled linux-kernel developers? I've seen Linus have to eat his words a time or two, and I know I can't do as well. http://marc.theaimsgroup.com/?m=91723854823435 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Copyright / licensing question
I'll respond in terms of U.S. law; if you want something else, please mention it. You might find a lot of useful information at http://fairuse.stanford.edu/Copyright_and_Fair_Use_Overview/chapter9/index.html http://www.usg.edu/admin/legal/copyright/#part3d3a http://en.wikipedia.org/wiki/Fair_use ttp://www.nolo.com/lawcenter/ency/article.cfm/ObjectID/C3E49F67-1AA3-4293-9312FE5C119B5806/catID/2EB060FE-5A4B-4D81-883B0E540CC4CB1E > 1. For explaining the internals of a filesystem in detail, I need to >take their code from kernel sources 'as it is' in the book. Do I need >to take any permissions from the owner/maintainer regarding this ? >Will it violate any license if reproduce the driver source code in >my book ?? This is exactly the sort of "Comment and criticism" that is anticipated and covered by the fair use exemption. In judging whether the use is fair, 17 USC 107 says: # § 107. Limitations on exclusive rights: Fair use # # Release date: 2004-04-30 # # Notwithstanding the provisions of sections 106 and 106A, the fair use # of a copyrighted work, including such use by reproduction in copies # or phonorecords or by any other means specified by that section, for # purposes such as criticism, comment, news reporting, teaching (including # multiple copies for classroom use), scholarship, or research, is not # an infringement of copyright. In determining whether the use made of a # work in any particular case is a fair use the factors to be considered # shall include: # (1) the purpose and character of the use, including whether such use # is of a commercial nature or is for nonprofit educational purposes; # (2) the nature of the copyrighted work; # (3) the amount and substantiality of the portion used in relation to # the copyrighted work as a whole; and # (4) the effect of the use upon the potential market for or value of * the copyrighted work. # The fact that a work is unpublished shall not itself bar a finding # of fair use if such finding is made upon consideration of all the above # factors. Going through those in your case, they are: 1. The Transformative Factor: The Purpose and Character of Your Use It's commercial use, but the non-commercial exemptions are a relatively recent addition to copyright law. The original, classic "fair use" is commentary and criticism. I.e. are you adding something to the quoted material? Have you added new information or insights? This is one of the most important factors, and in your case, assuming the book is worth anything at all, the answer is clearly "yes". On this ground alone, you're probably safe. 2. The Nature of the Copyrighted Work Scope for fair use is broader for published than unpublished works (because the potential future value of an unpublished work is affected more by copious excerpting), and broader for factual works than fiction (because facts and ideas cannot be copyrighted, so it takes more quoting to include a threshold amount of copyrightable "expression"). The Linux kernel is clearly "published", and while the second part is a little fuzzy (and I'm not eager enough to chase it back to original case law), I think the functional nature of software places it in the "factual" category. 3. The Amount and Substantiality of the Portion Taken Your publisher won't let you waste enough paper to print a huge fraction of the Linux kernel. Yes, it may be a lot of code, but it's not going to be "most" by a long shot. In general the standard is that "no more was taken than was necessary" to achieve the purpose for which the copying was done. I think you'll do this anyway, and the law doesn't require you to be super anal about eliding every snippet and #define that's not directly referenced. The Lions book, in contrast, included most of 6th edition Unix, leading to the need for negotiations. Also, the 6th edition wasn't publushed, leading to problems with the previous factor. The legally fuzzy isse is what constitutes a "work" here. The function? The source file? The tarball? I'd have to look for a case involving copying of entire entries from an encyclopedia or dictionary to get it fully untangled. However, you're helped here by the GPL, which can be used to show the original author's intentions. It defines the "work" as an entire program, that compiles to an executable that does something. As long as your excerpts don't compile to a working kernel, you're pretty safe. 4. The Effect of the Use Upon the Potential Market Will it hurt the copyright owner? This is typicaly expressed in terms of income, which doesn't apply very much. But your intent is clearly to *add* value to the Linux kernel, so this factor militates in your favor. > 2. I will write some custom drivers also for illustration. F
Re: [PATCH] OpenBSD Networking-related randomization port
*Sigh*. This thread is heading into the weeds. I have things I should be doing instead, but since nobody seems to actually be looking at what the patch *does*, I guess I'll have to dig into it a bit more... Yes, licensing issues need to be resolved before a patch can go in. Yes, code style standards needs to be kept up. And yes, SMP-locking issues need to be looked at. (And yes, ipv6 needs to be looked at, too!) But before getting sidetracked into the fine details, could folks please take a step back from the trees and look at the forest? Several people have asked (especially when the first patch came out), but I haven't seen any answers to the Big Questions: 1) Does this patch improve Linux's networking behaviour in any way? 2) Are the techniques in this patch a good way to achieve those improvements? Let's look at the various parts of the change: - Increases the default random pool size. Opinion: whatever. No real cost, except memory. Increases the maximum amount that can be read from /dev/random without blocking. Note that this is already adjustable at run time, so the question is why put it in the kernel config. If you want this, I'd suggest instead an option under CONFIG_EMBEDDED to shrink the pools and possibly get rid of the run-time changing code, then you could increase the default with less concern. - Changes the TCP ISN generation algorithm. I have't seen any good side to this. The current algorithm can be used for OS fingerprinting based on starting two TCP connections from different sources (ports or IPs) and noticing that the ISNs only differ in the low 24 bits, but is that a serious issue? If it is, there are better ways to deal with it that still preserve the valuable timer property. I point out that the entire reason for the cryptographically marginal half_md4_transform oprtation was that a full MD5 was a very noticeable performance bottleneck; the hash was only justified by the significant real-world attacks. obsd_get_random uses two calls to half_md4_transform. Which is the same cost as a full MD4 call. Frankly, they could just change half_md4_transform to return 64 bits instead of 32 and make do with one call. - Changes to the IP ID generation algorithm. All it actually does is change the way the initial inet->id is initialized for the inet_opt structure associated with the TCP socket. And if you look at ip_output.c:ip_push_pending_frames(), you'll see that, if DF is set (as is usual for a TCP connection), iph->id (the actual IP header ID) is set to htons(inet->id++). So it's still an incrementing sequence. This is in fact (see the comment in ip.h:ip_select_ident()) a workaround for a Microsoft VJ compression bug. The fix was added in 2.4.4 (via DaveM's zerocopy-beta-3 patch); before that, Linux 2.4 sent a constant zero as the IP ID of DF packets. See discussion at http://www.postel.org/pipermail/end2end-interest/2001-May/thread.html http://tcp-impl.lerc.nasa.gov/tcp-impl/list/archive/2378.html I'm not finding the diagnosis of the problem. I saw one report at http://oss.sgi.com/projects/netdev/archive/2001-01/msg6.html and Dave Miller is pretty much on top of it when he posts http://marc.theaimsgroup.com/?l=linux-kernel&m=98275316400452&w=2 but I haven't found the actual debugging leading to the conclusion. This also led to some discussion of the OpenBSD IP ID algorithm that I haven't fully waded through at http://mail-index.netbsd.org/tech-net/2003/11/ If the packet is fragmentable (the only time the IP ID is really needed by the receiver), it's done by route.c:__ip_select_ident(). Wherein the system uses inet_getid to assign p->ip_id_count++ based on the route cache's struct inet_peer *p. (If the route cache is OOM, the system falls back on random IP ID assignment.) This latter technique nicely prevents the sort of stealth port scanning that was mentioned earlier in this thread, and prevents a person at address A from guessing the IP ID range I'm using to talk to address B. So note that the boast about "Randomized IP IDs" in the grsecurity description at http://www.gentoo.org/proj/en/hardened/grsecurity.xml is, as far as I can tell from a quick look at the code, simply false. As for the algorithm itself, it's described at http://www.usenix.org/events/usenix99/full_papers/deraadt/deraadt_html/node18.html but it's not obvious to me that it'd be hard to cryptanalyze given a stream of consecutive IDs. You need to recover: - The n value for each inter-ID gap, - The LCRNG state ru_a, ru_x, ru_b, - The 15-bit XOR masks ru_seed and ru_seed2, and - The discrete log generator ru_j (ru_g = 2^ru_j mod RU_N). Which is actually just a multiplier (mod RU_N-1 = 32748) on th
Re: thoughts on kernel security issues
I followed the start of this thread when it was about security mailing lists and bug-disclosure rules, and then lost interest. I just looked in again, and I seem to be seeing discussion of merging grsecurity pathes into mainline. I haven't yet found an message where this is proposed explicitly, so if I am inferring incorrectly, I apologize. (And you can ignore the rest of this missive.) However, I did look carefully at an earlier patch that claimed to be a Linux port of some OpenBSD networking randomization code, ostensibly to make packet-guessing attacks more difficult. http://marc.theaimsgroup.com/?l=linux-kernel&m=110693283511865 It was further claimed that this code came via grsecurity. I did verify that the code looked a lot like pieces of OpenBSD, but didn't look at grsecurity at all. However, I did look in some detail at the code itself. http://marc.theaimsgroup.com/?l=linux-netdev&m=110736479712671 What I concluded was that it was broken beyond belief, and the effect on the networking code varied from (due to putting the IP ID generation code in the wrong place) wasting a lot of time randomizing a number that could be a constant zero if not for working around a bug in Microsoft's PPP stack, to (RPC XID generation) severe protocol violation. Not to mention race conditions out the wazoo due to porting single-threaded code. After careful review, I couldn't find a single redeeming feature, or even a good idea that was merely implemented badly. See the posting for details and more colorful criticism. Now, as I said, I have *not* gone to the trouble of seeing if this patch really did come from grsecurity, or if it was horribly damaged in the process of splitting it out. So I may be unfairly blaming grsecurity, but I didn't feel like seeking out more horrible code to torture my sanity with. My personal, judgemental opinion was that if that was typical of grsecurity, it's a festering pile of pus that I'm not going to let anywhere near my kernel, thank you very much. But to the extent that this excerpt constitutes reasonable grounds for suspicion, I would like to recommend a particularly careful review of any grsecurity patches. In addition to Linus' dislike of monolithic patches. Just my $0.02. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
What's the status of kernel PNP?
I just noticed that 2.4.6-ac1 parport won't compile (well, link) without the kernel PnP stuff configured. So I tried turning it on. It prints a line saying that it found my modem at boot time, but doesn't actually configure it, so I have to run isapnp anyway if I want to use it. Okay, RTFM time... Documentation/isapnp.txt doesn't say anything about boot time (only /proc/isapnp usage after boot and some function call interfaces for kernel programming that are hard to follow). kernel-parameters.txt gives a hint, although it required reading the source code to figure out what to pass as "isapnp=" to turn verbose up. A lot of google searching comes up with a lot of stale data but the only 2.4-relevant kernel ISAPNP howto is written in Japanese. Lots of stuff describes it as a feature in the 2.4 kernels, but I can't find anything on how to use it. MAINTAINERS claims that it's maintained, but the web page is down (the whole site has moved, and /~pnp doesn't exist on the new site) and the only mailing list archives I can find for pnp-devel (at geocrawler) doesn't have any updates since the year 2000 - and those are all spam. I'm a little suspect about that maintained status, although I haven't written the maintainer yet. But the upshot of all of this is that I can't figure out WTF to do with this "feature", since I haven't noticed it actually doing anything except taking up kernel memory. On another machine, with an ISA PCMCIA adapter, which works with isapnp and David Hinds' PCMCIA package, but if I try to use the 2.4 cardbus code, it fails to probe the PCMCIA adapter, apparently because the PnP code again didn't set it up. (And there's no obvious way to force a re-probe after boot unless I build the whole thing as a module.) Again, the PnP code cheerfully points out that the PCMCIA adpater exists, but doesn't appear to grasp the concept that I didn't put the adapter into the machine because it looks pretty. Can someome point me at TFM or some other source of information? I'd be much obliged. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make pipe data structure be a circular list of pages, rather
ation can read directly (e.g. DMA). - If all else fails, it may be necessary to allocate a bounce buffer accesible to both source and destination and copy the data. Because of the advantages of PCI writes, I assume that having the source initiate the DMA is preferable, so the above are listed in decreasing order of preference. The obvious thing to me is for there to be a way to extract the following information from the source and destination: "I am already mapped in the following address spaces" "I can be mapped into the following address spaces" "I can read from/write to the following address spaces" Only the third list is required to be non-empty. Note that various kinds of "low memory" count as separate "address spaces" for this purpose. Then some generic code can look for the cheapest way to get the data from the source to the destination. Mapping one and passing the address to the other's read/write routine if possible, or allocating a bounce buffer and using a read followed by a write. Questions for those who know odd machines and big iron better: - Are there any cases where a (source,dest) specific routine could do better than this "map one and access it from the other" scheme? I.e. do we need to support the n^2 case? - Now many different kinds of "address spaces" do we need to deal with? Low kernel memory (below 1G), below 4G (32-bit DMA), and high memory are the obvious ones. And below 16M for legacy DMA, I assume. x86 machines have I/O space as well. But what about machines with multiple PCI buses? Can you DMA from one to the other, or is each one an "address space" that can have intra-bus DMA but needs a main-memory bounce buffer to cross buses? Can this be handled by a "PCI bus number" that must match, or are there cases where there are more intricate hierarchies of mutual accessibility? (A similar case is a copy from a process' address space. Normally, simgle-copy pipes require a kmap in one of the processes. But if the copying code can notice that source and dest both have the same mm structure, a direct copy becomes possible.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mmap tricks and writing to files without reading first
) I'm not sure which of those would be "best" in the sense of minimum overhead. Does anyone have any suggestions? Or a completely different way to zero out a chunk of a file without reading it in? I don't want to actually make a hole in the log file or I'd fragment it and increase the risk of ENOSPC problems. I could create just a single zero page and writev() multiple copies of it, but then I have to worry about the system page size (I'm not sure if the kernel will DTRT and not page in half of an 8K page if I writev() two 4K vectors to it), and it prevents me from using pwrite(). I haven't tracked down the splice() idea that sct mentioned in http://www.ussg.iu.edu/hypermail/linux/kernel/0002.3/0057.html It appears that sendfile() can't be used for the purpose. Finally, is there a standard semantics for the interaction between mmap() and read()/write()? I have a dim recollection of seeing Linus rant that anything other than making writes via one path instantly available to the other is completely brain-dead, which would make the most sense if some standard somewhere allows weaker synchronization, but I can't seem to find that rant again, and it was a long time ago. I also note that there doesn't seem to be an msync() flag for "make changes visible to read(2) users (i.e. flush them to the buffer cache), but DON'T schedule a disk write yet", which I assume a weaker synchronization model would provide. I also can't find any mention of the possibility of weaker ordering in the descriptions of mmap() I've seen at www.opengroup.org. But it doesn't come right out and clearly require strong ordering, either, and I can just imagine some vendor with a virtually-addresed cache getting creative and saying "show me where it says I can't do that!". The one phrase that concerns me is the caution in SUSv2 that # The application must ensure correct synchronisation when using mmap() # in conjunction with any other file access method, such as read() and # write(), standard input/output, and shmat(). http://www.opengroup.org/onlinepubs/007908799/xsh/mmap.html Great, but what's "correct"? The part of the semantics I particularly need to be clearly defined is what happens if my application crashes after writing to an mmap buffer but before msync() or munmap(). Thanks for any enlightenment on this somewhat confusing issue! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make pipe data structure be a circular list of pages, rather than
[EMAIL PROTECTED] wrote: > You seem to have misunderstood the original proposal, it had little to do > with file descriptors. The idea was that different subsystems in the OS > export pull() and push() interfaces and you use them. The file decriptors > are only involved if you provide them with those interfaces(which you > would, it makes sense). You are hung up on the pipe idea, the idea I > see in my head is far more generic. Anything can play and you don't > need a pipe at all, you need I was fantasizing about more generality as well. In particular, my original fantasy allowed data to, in theory and with compatible devices, be read from one PCI device, passed through a series of pipes, and written to another without ever hitting main memory - only one PCI-PCI DMA operation performed. A slightly more common case would be zero-copy, where data gets DMAed from the source into memory and from memory to the destination. That's roughly Larry's pull/push model. The direct DMA case requires buffer memory on one of the two cards. (And would possibly be a fruitful source of hardware bugs, since I suspect that Windows Doesn't Do That.) Larry has the "full-page gift" optimization, which could in theory allow data to be "renamed" straight into the page cache. However, the page also has to be properly aligned and not in some awkward highmem address space. I'm not currently convinced that this would happen often enough to be worth the extra implementation hair, but feel free to argue otherwise. (And Larry, what's the "loan" bit for? When is loan != !gift ?) The big gotcha, as Larry's original paper properly points out, is handling write errors. We need some sort of "unpull" operation to put data back of the destination can't accept it. Otherwise, what do you return from splice()? If the source is seekable, that's easy, and a pipe isn't much harder, but for a general character device, we need a bit of help. The way I handle this in user-level software, to connect modules that provide data buffering, is to split "pull" into two operations. "Show me some buffered data" and "Consume some buffered data". The first returns a buffer pointer (to a const buffer) and length. (The length must be non-zero except at EOF, but may be 1 byte.) The second advances the buffer pointer. The advance distance must be no more than the length returned previously, but may be less. In typical single-threaded code, I allow not calling the advance function or calling it multiple times, but they're typically called 1:1, and requiring that would give you a good place to do locking. A character device, network stream, or the like, would acquire an exclusive lock. A block device or file would not need to (or could make it a shared lock or refcount). The same technique can be used when writing data to a module that does buffering: "Give me some bufer space" and "Okay, I filled some part of it in." In some devices, the latter call can fail, and the writer has to be able to cope with that. By allowing both of those (and, knowing that PCI writes are more efficient than PCI reads, giving the latter preference if both are available), you can do direct device-to-device copies on splice(). The problem with Larry's separate pull() and push() calls is that you then need a user-visible abstraction for "pulled but not yet pushed" data, which seems lile unnecessary abstraction violation. The main infrastructure hassle you need to support this *universally* is the unget() on "character devices" like pipes and network sockets. Ideally, it would be some generic buffer front end that could be used by the device for normal data as well as the special case. Ooh. Need to think. If there's a -EIO problem on one of the file descriptors, how does the caller know which one? That's an argument for separate pull and push (although the splice() library routine still has the problem). Any suggestions? Does userland need to fall back on read()/write() for a byte? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
Just a note on the attribute((acquire,release)) proposal: It's nice to be able to annotate functions, but please don't forget to provide a way to write such functions. Ultimately, there will be an asm() or assignment that is the acquire or release point, and GCC needs to know that so it can compile the function itself (possibly inline). Having just a function attribute leaves the problem that void __attribute__((noreturn)) _exit(int status) { asm("int 0x80" : : (__NR_exit) "a", (status) "b" ); } generates a complaint about a noreturn function returning, because there's no way to tell GCC about a non-returning statement. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
May I just say, that this is f***ing brilliant. It completely separates the threadlet/fibril core from the (contentious) completion notification debate, and allows you to use whatever mechanism you like. (fd, signal, kevent, futex, ...) You can also add a "macro syscall" like the original syslet idea, and it can be independent of the threadlet mechanism but provide the same effects. If the macros can be designed to always exit when donew, a guarantee never to return to user space, then you can always recycle the stack after threadlet_exec() returns, whether it blocked in the syscall or not, and you have your original design. May I just suggest, however, that the interface be: tid = threadlet_exec(...) Where tid < 0 means error, tid == 0 means completed synchronously, and tod > 0 identifies the child so it can be waited for? Anyway, this is a really excellent user-space API. (You might add some sort of "am I synchronous?" query, or maybe you could just use gettid() for the purpose.) The one interesting question is, can you nest threadlet_exec() calls? I think it's implementable, and I can definitely see the attraction of being able to call libraries that use it internally (to do async read-ahead or whatever) from a threadlet function. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
> It's brilliant for disk I/O, not for networking for which > blocking is the norm not the exception. > > So people will have to likely do something like divide their > applications into handling for I/O to files and I/O to networking. > So beautiful. :-) > > Nobody has proposed anything yet which scales well and handles both > cases. The truly brilliant thing about the whole "create a thread on blocking" is that you immediately make *every* system call asynchronous-capable, including the thousands of obscure ioctls, without having to boil the ocean rewriting 5/6 of the kernel from implicit (stack-based) to explicit state machines. You're right that it doesn't solve everything, but it's a big step forward while keeping a reasonably clean interface. Now, we have some portions of the kernel (to be precise, those that currently support poll() and select()) that are written as explicit state machines and can block on a much smaller context structure. In truth, the division you assume above isn't so terrible. My applications are *already* written like that. It's just "poll() until I accumulate a whole request, then fork a thread to handle it." The only way to avoid allocating a kernel stack is to have the entire handling code path, including the return to user space, written in explicit state machine style. (Once you get to user space, you can have a threading library there if you like.) All the flaming about different ways to implement completion notification is precisely because not much is known about the best way to do it; there aren't a lot of applications that work that way. (Certainly that's because it wasn't possible before, but it's clearly an area that requires research, so not committing to an implementation is A Good Thing.) But once that is solved, and "system call complete" can be reported without returning to a user-space thread (which is basically an alternate system call submission interface, *independent* of the fibril/threadlet non-blocking implementation), then you can find the hot paths in the kernel and special-case them to avoid creating a whole thread. To use a networking analogy, this is a cleanly layered protocol design, with an optimized fast path *implementation* that blurs the boundaries. As for the overhead of threading, there are basically three parts: 1) System call (user/kernel boundary crossing) costs. These depend only on the total number of system calls and not on the number of threads making them. They can be mitigated *if necessary* with a syslet-like "macro syscall" mechanism to increase the work per boundary crossing. The only place threading might increase these numbers is thread synchronization, and futexes already solve that pretty well. 2) Register and stack swapping. These (and associated cache issues) are basically unavoidable, and are the bare minimum that longjmp() does. Nothing thread-based is going to reduce this. (Actually, the kernel can do better than user space because it can do lazy FPU state swapping.) 3) MMU context switch costs. These are the big ones, particular on x86 without TLB context IDs. However, these fall into a few categories: - Mandatory switches because the entire application is blocked. I don't see how this can be avoided; these are the cases where even a user-space longjmp-based thread library would context switch. - Context switches between threads in an application. The Linux kernel already optimizes out the MMU context switch in this case, and the scheduler already knows that such context switches are cheaper and preferred. The one further optimization that's possible is if you have a system call that (in a common case) blocks multiple times *without accessing user memory*. This is not a read() or write(), but could be something like fsync() or ftruncate(). In this case, you could temporarily mark the thread as a "kernel thread" that can run in any MMU context, and then fix it explicitly when you unmark it on the return path. I can see the space overhead of 1:1 threading, but I really don't think there's much time overhead. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2 of 4] Introduce i386 fibril scheduling
First of all, may I say, this is a wonderful piece of work. It absolutely reeks of The Right Thing. Well done! However, while I need to study it in a lot more detail, I think Ingo's implementation ideas make a lot more immediate sense. It's the same idea that I thought up. Let me make it concrete. When you start an async system call: - Preallocate a second kernel stack, but don't do anything with it. There should probably be a per-CPU pool of preallocated threads to avoid too much allocation and deallocation. - Also at this point, do any resource limiting. - Set the (normally NULL) "thread blocked" hook pointer to point to a handler, as explained below. - Start down the regular system call path. - In the fast-path case, the system call completes without blocking and we set up the completion structure and return to user space. We may want to return a special value to user space to tell it that there's no need to call asys_await_completion. I think of it as the Amiga's IOF_QUICK. - Also, when returning, check and clear the thread-blocked hook. Note that we use one (cache-hot) stack for everything and do as little setup as possible on the fast path. However, if something blocks, it hits the slow path: - If something would block the thread, the scheduler invokes the thread-blocked hook before scheduling a new thread. - The hook copies the necessary state to a new (preallocated) kernel stack, which takes over the original caller's identity, so it can return immediately to user space with an "operation in progress" indicator. - The scheduler hook is also cleared. - The original thread is blocked. - The new thread returns to user space and execution continues. - The original thread completes the system call. It may block again, but as its block hook is now clear, no more scheduler magic happens. - When the operation completes and returns to sys_sys_submit(), it notices that its scheduler hook is no longer set. Thus, this is a kernel-only worker thread, and it fills in the completion structure, places itself back in the available pool, and commits suicide. Now, there is no chance that we will ever implement kernel state machines for every little ioctl. However, there may be some "async fast paths" state machines that we can use. If we're in a situation where we can complete the operation without a kernel thread at all, then we can detect the "would block" case (probably in-line, but you could use a different scheduler hook function) and set up the state machine structure. Then return "operation in progress" and let the I/O complete in its own good time. Note that you don't need to implement all of a system call as an explicit state machine; only its completion. So, for example, you could do indirect block lookups via an implicit (stack-based) state machine, but the final I/O via an explicit one. And you could do this only for normal block devices and not NFS. You only need to convert the hot paths to the explicit state machine form; the bulk of the kernel code can use separate kernel threads to do async system calls. I'm also in the "why do we need fibrils?" camp. I'm studying the code, and looking for a reason, but using the existing thread abstraction seems best. If you encountered some fundamental reason why kernel threads were Really Really Hard, then maybe it's worth it, but it's a new entity, and entia non sunt multiplicanda praeter necessitatem. One thing you can do for real-time tasks is, in addition to the non-blocking flag (return EAGAIN from asys_submit rather than blocking), you could have an "atomic" flag that would avoid blocking to preallocate the additional kernel thread! Then you'd really be guaranteed no long delays, ever. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is "Memory split" Kconfig option only for EMBEDDED?
> I have not had yet any problems with VMSPLIT_3G_OPT ever since I > used it -- which dates back to when it was a feature of Con > Kolivas's patchset (known as LOWMEM1G), [even] before it got > merged in mainline. > > (Excluding the cases Adrian Bunk listed: WINE, which I don't use, and > also 'some Java programs' which I have not seen.) Seconded. I have several servers with 1G of memory, and appreciate the option very much; I maintained it as a custom patch long before it became a CONFIG option. Turning on CONFIG_EMBEDDED makes it a bit annoying to be sure not to play with any of the other far more dangerous options that enables. (I suppose I could just maintain a local patch to remove that from Kconfig.) The last I remember hearing, the vm system wasn't very happy with highem much smaller than lowmem (128M/896M = 1/7) anyway. There's nothing wrong with a stern warning, but I'd think that disabling CONFIG_NET would break a lot more user-space programs, and that's not protected. How about the following (which also fixes a bug if you select VMSPLIT_2G and HIGHEM; with 64-bit page tables, the split must be on a 1G boundary): choice depends on EXPERIMENTAL prompt "Memory split" default VMSPLIT_3G help Select the desired split between kernel and user memory. If you are not absolutely sure what you are doing, leave this option alone! There are important efficiency reasons why the user address space and the kernel address space must both fit into the 4G linear virtual address space provided by the x86 architecture. Normally, Linux divides this into 3G for user virtual memory and 1G for kernel memory, which holds up to 896M of RAM plus all memory-mapped peripheral (e.g. PCI) devices. Excess RAM is ignored. If the "High memory support" options are enabled, the excess memory is available as "high memory", which can be used for user data, including file system caches, but not kernel data structures. However, accessing high memory from the kernel is slightly more costly than low memory, as it has to be mapped into the kernel address range first. This option lets systems choose to have a larger "low memory" space, either to avoid the need for high memory support entirely, or for workloads which require particularly large kernel data structures. The downside is that the available user address space is reduced. While most programs do not care, this is an incompatible change to the kernel binary interface, and must be made with caution. Some programs that process a lot of data will work more slowly or fail, and some programs that do clever things with virtual memory will crash immediately. In particular, changing this option from the default breaks valgrind version 3.1.0, VMware, and some Java virtual machines. config VMSPLIT_3G bool "Default 896MB lowmem (3G/1G user/kernel split)" config VMSPLIT_3G_OPT depends on !HIGHMEM bool "1G lowmem (2.75G/1.25G user/kernel split) CAUTION" config VMSPLIT_2G bool "1.875G lowmem (2G/2G user/kernel split) CAUTION" config VMSPLIT_2G_OPT depends on !HIGHMEM bool "2G lowmem (1.875G/2.125G user/kernel split) CAUTION" config VMSPLIT_1G bool "2.875G lowmem (1G/3G user/kernel split) CAUTION" config VMSPLIT_1G_OPT depends on !HIGHMEM bool "3G lowmem (896M/3.125G user/kernel split) CAUTION" endchoice config PAGE_OFFSET hex default 0xB000 if VMSPLIT_3G_OPT default 0x8000 if VMSPLIT_2G default 0x7800 if VMSPLIT_2G_OPT default 0x4000 if VMSPLIT_1G default 0x3800 if VMSPLIT_1G_OPT default 0xC000 (Copyright on the above abandoned to the public domain.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
on (a.k.a. obfuscation). And it lets you optimize them better. I apologize for not having counted them before. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
>> to keep the amount of code between ll and sc to an absolute minimum >> to avoid interference which causes livelock. Processor timeouts >> are generally much longer than any reasonable code sequence. > "Generally" does not mean you can just ignore it and hope the C compiler > does the right thing. Nor is it enough for just SOME of the architectures > to have the properties you require. If it's an order of magnitude larger than the common case, then yes you can. Do we worry about writing functions so big that they exceed branch displacement limits? That's detected at compile time, but LL/SC pair distance is in principle straightforward to measure, too. > Ralf tells us that MIPS cannot execute any loads, stores, or sync > instructions on MIPS. Ivan says no loads, stores, taken branches etc > on Alpha. > > MIPS also has a limit of 2048 bytes between the ll and sc. I agree with you about the Alpha, and that will have to be directly coded. But on MIPS, the R4000 manual (2nd ed, covering the R4400 as well) says > The link is broken in the following circumstances: >· if any external request (invalidate, snoop, or intervention) >changes the state of the line containing the lock variable to >invalid >· upon completion of an ERET (return from exception) >instruction >· an external update to the cache line containing the lock >variable Are you absolutely sure of what you are reporting about MIPS? Have you got a source? I've been checking the most authoritative references I can find and can't find mention of such a restriction. (The R8000 User's Manual doesn't appear to mention LL/SC at all, sigh.) One thing I DID find is the "R4000MC Errata, Processor Revision 2.2 and 3.0", which documents several LL/SC bugs (Numbers 10, 12, 13) and #12 in particular requires extremely careful coding in the workaround. That may completely scuttle the idea of using generic LL/SC functions. > So you almost definitely cannot have gcc generated assembly between. I > think we agree on that much. We don't. I think that if that restriction applies, it's worthless, because you can't achieve a net reduction in arch-dependent code. GCC specifically says that if you want a 100% guarantee of no reloads between asm instructions, place them in a single asm() statement. > In truth, however, realizing that we're only talking about three > architectures (wo of which have 32 & 64-bit versions) it's probably not > worth it. If there were five, it would probably be a savings, but 3x > code duplication of some small, well-defined primitives is a fair price > to pay for avoiding another layer of abstraction (a.k.a. obfuscation). > > And it lets you optimize them better. > > I apologize for not having counted them before. > I also disagree that the architectures don't matter. ARM and PPC are > pretty important, and I believe Linux on MIPS is growing too. Er... I definitely don't see where I said, and I don't even see where I implied - or even hinted - that MIPS, ARM and PPC "don't matter." I use Linux on ARM daily. I just thought that writing a nearly-optimal generic primitive is about 3x harder than writing a single-architecture one, so even for primitives yet to be written, its just as easy to do it fully arch-specific. Plus you have corner cases like the R5900 that don't have LL/SC at all. (Can it be used multiprocessor?) > One proposal that I could buy is an atomic_ll/sc API, which mapped > to a cmpxchg emulation even on those llsc architectures which had > any sort of restriction whatsoever. This could be used in regular C > code (eg. you indicate powerpc might be able to do this). But it may > also help cmpxchg architectures optimise their code, because the > load really wants to be a "load with intent to store" -- and is > IMO the biggest suboptimal aspect of current atomic_cmpxchg. Or, possibly, an interface like do { oldvalue = ll(addr) newvalue = ... oldvalue ... } while (!sc(addr, oldvalue, newvalue)) Where sc() could be a cmpxchg. But, more importantly, if the architecture did implement LL/SC, it could be a "try plain SC; if that fails try CMPXCHG built out of LL/SC; if that fails, loop" Actually, I'd want something a bit more integrated, that could have the option of fetching the new oldvalue as part of the sc() implementation if that failed. Something like DO_ATOMIC(addr, oldvalue) { ... code ... } UNTIL_ATOMIC(addr, oldvalue, newvalue); or perhaps, to encourage short code sections, DO_ATOMIC(addr, oldvalue, code, newvalue); The problem is, that's already not optimal for spinlocks, where you want to use a non-linked load while spinning. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] ensure unique i_ino in filesystems without permanent
> Good catch on the inlining. I had meant to do that and missed it. Er... if you want it to *be* inlined, you have to put it into the .h file so the compiler knows about it at the call site. "static inline" tells gcc not avoid emitting a callable version. Something like this the following. (You'll also need to add a "#include ", unless you expand the "bool", "false" and "true" macros to their values "_Bool", "0" and "1" by hand.) --- linux-2.6/include/linux/fs.h.super 2006-12-12 08:53:34.0 -0500 +++ linux-2.6/include/linux/fs.h2006-12-12 08:54:14.0 -0500 @@ -1879,7 +1879,32 @@ extern struct inode_operations simple_dir_inode_operations; struct tree_descr { char *name; const struct file_operations *ops; int mode; }; struct dentry *d_alloc_name(struct dentry *, const char *); -extern int simple_fill_super(struct super_block *, int, struct tree_descr *); +extern int __simple_fill_super(struct super_block *s, int magic, + struct tree_descr *files, bool registered); extern int simple_pin_fs(struct file_system_type *, struct vfsmount **mount, int *count); extern void simple_release_fs(struct vfsmount **mount, int *count); +/* + * Fill a superblock with a standard set of fields, and add the entries in the + * "files" struct. Assign i_ino values to the files sequentially. This function + * is appropriate for filesystems that need a particular i_ino value assigned + * to a particular "files" entry. + */ +static inline int simple_fill_super(struct super_block *s, int magic, + struct tree_descr *files) +{ + return __simple_fill_super(s, magic, files, false); +} + +/* + * Just like simple_fill_super, but does an iunique_register on the inodes + * created for "files" entries. This function is appropriate when you don't + * need a particular i_ino value assigned to each files entry, and when the + * filesystem will have other registered inodes. + */ +static inline int registered_fill_super(struct super_block *s, int magic, + struct tree_descr *files) +{ + return __simple_fill_super(s, magic, files, true); +} + - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
+#define F3(x,y,z) \ + movlx, TMP2;\ + andly, TMP2;\ + movlx, TMP; \ + orl y, TMP; \ + andlz, TMP; \ + orl TMP2, TMP *Sigh*. You don't need TMP2 to compute the majority function. You're implementing it as (x & y) | ((x | y) & z). Look at the rephrasing in lib/sha1.c: #define f3(x,y,z) ((x & y) + (z & (x ^ y))) /* majority */ By changing the second OR to x^y, you ensure that the two halves of the first disjunciton are distinct, so you can replace the OR with XOR, or better yet, +. Then you can just do two adds to e. That is, write: /* Bitwise select: x ? y : z, which is (z ^ (x & (y ^ z))) */ #define F1(x,y,z,dest) \ movlz, TMP; \ xorly, TMP; \ andlx, TMP; \ xorlz, TMP; \ addlTMP, dest /* Three-way XOR (x ^ y ^ z) */ #define F2(x,y,z,dest) \ movlz, TMP; \ xorlx, TMP; \ xorly, TMP; \ addlTMP, dest /* Majority: (x^y)|(y&z)|(z&x) = (x & z) + ((x ^ z) & y) #define F3(x,y,z,dest) \ movlz, TMP; \ andlx, TMP; \ addlTMP, dest; \ movlz, TMP; \ xorlx, TMP; \ andly, TMP; \ addlTMP, dest Since y is the most recently computed result (it's rotated in the previous round), I arranged the code to delay its use as late as possible. Now you have one more register to play with. I thought I had some good sha1 asm code lying around, but I can't seem to find it. (I have some excellent PowerPC asm if anyone wants it.) Here's a basic implementation question: SHA-1 is made up of 80 rounds, 20 of each of 4 types. There are 5 working variables, a through e. The basic round is: t = F(b, c, d) + K + rol32(a, 5) + e + W[i]; e = d; d = c; c = rol32(b, 30); b = a; a = t; where W[] is the input array. W[0..15] are the input words, and W[16..79] are computed by a sort of LFSR from W[0..15]. Each group of 20 rounds has a different F() and K. This is the smallest way to write the function, but all the register shuffling makes for a bit of a speed penalty. A faster way is to unroll 5 iterations and do: e += F(b, c, d) + K + rol32(a, 5) + W[i ]; b = rol32(b, 30); d += F(a, b, c) + K + rol32(e, 5) + W[i+1]; a = rol32(a, 30); c += F(e, a, b) + K + rol32(d, 5) + W[i+2]; e = rol32(e, 30); b += F(d, e, a) + K + rol32(c, 5) + W[i+3]; d = rol32(d, 30); a += F(c, d, e) + K + rol32(b, 5) + W[i+4]; c = rol32(c, 30); then loop over that 4 times each. This is somewhat larger, but still reasonably compact; only 20 of the 80 rounds are written out long-hand. Faster yet is to unroll all 80 rounds directly. But it also takes the most code space, and as we have learned, when your code is not the execution time hot spot, less cache is faster code. Is there a preferred implementation? Another implementation choice has to do with the computation of W[]. W[i] is a function of W[i-3], W[i-8], W[i-14] and W[i-16]. It is possible to keep a 16-word circular buffer with only the most recent 16 values of W[i%16] and compute each new word as it is needed. However, the offsets i%16 repeat every 16 rounds, which is an awkward fit with the 5-round repeating pattern of the main computation. One option is to compute all the W[] values in a pre-pass beforehand. Simple and small, but uses 320 bytes of data on the stack or wherever. An intermediate one is to keep a 20-word buffer, and compute 20 words at a time just before each of the 20 round groups. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
ndif popl%ebx popl%esi popl%edi popl%ebp ret .size sha_transform5, .-sha_transform5 # Size is 0xDE6 = 3558 bytes .globl sha_stackwipe .type sha_stackwipe, @function # void sha_stackwipe(void) # After one or more sha_transform calls, we have left the contents of W[] # on the stack, and from any 16 of those 80 words, the entire input # can be reconstructed. If the caller cares, this function obliterates # the relevant portion of the stack. # 2 words of argument + 4 woirds of saved registers + 80 words of W[] sha_stackwipe: xorl%eax,%eax movl$86,%ecx # Damn, I had hoped that loop; pushl %eax would work.. 1: decl%ecx pushl %eax jne 1b addl$4*86,%esp ret .size sha_stackwipe, .-sha_stackwipe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G
Given that incomprehensible help texts are a bit of a pet peeve of mine (I just last weekend figured out that you don't need to select an I2C algorithm driver to have working I2c - I had thought it was a "one from column A, one from column B" thing), let me take a crack... PAE doubles the size of each page table entry, increasing kernel memory consumption and slowing page table access. However, it enables: - Addressing more than 4G of physical RAM (CONFIG_HIGHMEM is also required) - Marking pages as readable but not executable using the NX (no-execute) bit, which protects applications from stack overflow attacks. - Swap files or partitions larger than 64G each. (Only needed with >4G RAM or very heavy tmpfs use.) A kernel compiled with this option cannot boot on a processor without PAE support. Enabling this also disables the (expert use only) CONFIG_VMSPLIT_[23]G_OPT options. Does that seem reasonably user-oriented? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] random: fix folding
> Folding is done to minimize the theoretical possibility of systematic > weakness in the particular bits of the SHA1 hash output. The result of > this bug is that 16 out of 80 bits are un-folded. Without a major new > vulnerability being found in SHA1, this is harmless, but still worth > fixing. Actually, even WITH a major new vulnerability found in SHA1, it's harmless. Sorry to put BUG in caps earlier; it actually doesn't warrant the sort of adjective I used. The purpose of the folding is to ensure that the feedback includes bits underivable from the output. Just outputting the first 80 bits and feeding back all 160 would achieve that effect; the folding is of pretty infinitesimal benefit. Note that last five rounds have as major outputs e, d, c, b, and a, in that order. Thus, the first words are the "most hashed" and the ones most worth using as output... which happens naturally with no folding. The folding is a submicroscopic bit of additional mixing. Frankly, the code size savings probably makes it worth deleting it. (That would also give you more flexibility to select the output/feedback ratio in whatever way you like.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+
2,W[i+4]+K4); } digest[0] += a; digest[1] += b; digest[2] += c; digest[3] += d; digest[4] += e; } extern void sha_transform2(uint32_t digest[5], const char in[64]); extern void sha_transform3(uint32_t digest[5], const char in[64]); extern void sha_transform5(uint32_t digest[5], const char in[64]); extern void sha_stackwipe(void); void sha_init(uint32_t buf[5]) { buf[0] = 0x67452301; buf[1] = 0xefcdab89; buf[2] = 0x98badcfe; buf[3] = 0x10325476; buf[4] = 0xc3d2e1f0; } #include #include #include #include #if 1 void sha_stackwipe2(void) { uint32_t buf[90]; memset(buf, 0, sizeof buf); asm("" : : "r" (&buf)); /* Force the compiler to do the memset */ } #endif #define TEST_SIZE (10*1024*1024) int main(void) { uint32_t W[80]; uint32_t out[5]; char const text[64] = "Hello, world!\n"; char *buf; uint32_t *p; unsigned i; struct timeval start, stop; sha_init(out); sha_transform(out, text, W); printf(" One: %08x %08x %08x %08x %08x\n", out[0], out[1], out[2], out[3], out[4]); sha_init(out); sha_transform4(out, text, W); printf(" Four: %08x %08x %08x %08x %08x\n", out[0], out[1], out[2], out[3], out[4]); sha_init(out); sha_transform2(out, text); printf(" Two: %08x %08x %08x %08x %08x\n", out[0], out[1], out[2], out[3], out[4]); sha_init(out); sha_transform3(out, text); printf("Three: %08x %08x %08x %08x %08x\n", out[0], out[1], out[2], out[3], out[4]); sha_init(out); sha_transform5(out, text); printf(" Five: %08x %08x %08x %08x %08x\n", out[0], out[1], out[2], out[3], out[4]); sha_stackwipe(); #if 1 /* Set up a large buffer full of stuff */ buf = malloc(TEST_SIZE); p = (uint32_t *)buf; memcpy(p, W+80-16, 16*sizeof *p); for (i = 0; i < TEST_SIZE/sizeof *p - 16; i++) { uint32_t a = p[i+13] ^ p[i+8] ^ p[i+2] ^ p[i]; p[i+16] = rol32(a, 1); } sha_init(out); gettimeofday(&start, 0); for (i = 0; i < TEST_SIZE; i += 64) sha_transform(out, buf+i, W); gettimeofday(&stop, 0); printf(" One: %08x %08x %08x %08x %08x -- %lu us\n", out[0], out[1], out[2], out[3], out[4], 100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec); sha_init(out); gettimeofday(&start, 0); for (i = 0; i < TEST_SIZE; i += 64) sha_transform4(out, buf+i, W); gettimeofday(&stop, 0); printf(" Four: %08x %08x %08x %08x %08x -- %lu us\n", out[0], out[1], out[2], out[3], out[4], 100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec); sha_init(out); gettimeofday(&start, 0); for (i = 0; i < TEST_SIZE; i += 64) sha_transform2(out, buf+i); gettimeofday(&stop, 0); printf(" Two: %08x %08x %08x %08x %08x -- %lu us\n", out[0], out[1], out[2], out[3], out[4], 100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec); sha_init(out); gettimeofday(&start, 0); for (i = 0; i < TEST_SIZE; i += 64) sha_transform3(out, buf+i); gettimeofday(&stop, 0); printf("Three: %08x %08x %08x %08x %08x -- %lu us\n", out[0], out[1], out[2], out[3], out[4], 100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec); sha_init(out); gettimeofday(&start, 0); for (i = 0; i < TEST_SIZE; i += 64) sha_transform5(out, buf+i); gettimeofday(&stop, 0); printf(" Five: %08x %08x %08x %08x %08x -- %lu us\n", out[0], out[1], out[2], out[3], out[4], 100*(stop.tv_sec-start.tv_sec)+stop.tv_usec-start.tv_usec); sha_stackwipe(); #endif return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
msleep(1000) vs. schedule_timeout_uninterruptible(HZ+1)
I was looking at some of the stupider code that calls msleep(), particularly that which does msleep(jiffies_to_msecs(jiff)) and I noticed that msleep() just calls schedule_timeout_uninterruptible(). But it does it in a loop. The basic question is, when does the loop make a difference? Is it only when you're on a wait queue? Or are there other kinds of unexpected wakeups that can arrive? I see all kinds of uses of both kinds for simple "wait a while" operations, and I'm not sure if one is more correct than the other. (And, in drivers/media/video/cpia2/cpia2_v4l.c:cpia2_exit(), a lovely example of calling schedule_timeout() without set_current_state() first.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why can't we sleep in an ISR?
Sleeping in an ISR is not fundamentally impossible - I could design a multitasker that permitted it - but has significant problems, and most multitaskers, including Linux, forbid it. The first problem is the scheduler. "Sleeping" is actually a call into the scheduler to choose another process to run. There are times - so-called critical sections - when the scheduler can't be called. If an interrupt can call the scheduler, then every criticial section has to disable interrupts. Otherwise, an interrupt might arrive and end up calling the scheduler. This increases interrupt latency. If interrupts are forbidden to sleep, then there's no need to disable interrupts in critical sections, so interrupts can be responded to faster. Most multitaskers find this worth the price. The second problem is shared interrupts. You want to sleep until something happens. The processor hears about that event via an interrupt. Inside an ISR, interrupts are disabled. You have to somehow enable the interrupt that will wake up the sleeping ISR without enabling the interrupt that the ISR is in the middle of handling (or the handler will start a second time and make a mess). This is complicated and prone to error. And, in the case of shared interrupts (as allowed by PCI), it's possible that the the interrupt you need to wait for is exactly the same interrupt as what you're in the middle of handling. So it might be impossible! The third problem is that you're obviously increasing the latency of the interrupt whose handler you're sleeping in. Finally, if you're even *thinking* of wanting to sleep in an ISR, you probably have a deadlock waiting to happen. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.22-rcX Transmeta/APM regression
Hardware: Fujitsu Lifebook P-2040, TM5800 800 MHz processor 2.6.21: Closing the lid causes APM suspend. Opening it resumes just fine. 2.6.22-rc5/-rc6: On resume, backlight comes on, but system is otherwise frozen. Nothing happens until I hold the power button to force a power off. I'm trying to bisect, but there's a large range of commits which crash on boot in init_transmeta, which is slowing me down. However, I did manage to find a kernel version that gives an error message instead of a blank screen, which might be useful. I can even switch VTs and type into the shell afterwards, but actually trying to do anything hangs. Which includes anything like run a command to capture this to a file or another machine on the network, even if I took care to cache the necessary executables and libraries before suspending. So the following is transcribed by hand. general protection fault: [#1] Modules linked in: CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010246 (2.6.21-gba7cc09c #16) EIP is at get_fixed_ranges+0x9/0x60 eax: c0338d24 ebx: c03589a0 ecx: 0250 edx: esi: c0338d24 edi: 000a ebp: esp: cefa4f5c ds: 007b es: 007b fs: gs: 000 ss: 0068 Process kapmd (pid: 70, ti=cefa4000 task=cef89550 task.ti=cefa4000) Stack: c03589a0 c010b1a0 c0238e4d c010b1a0 000a c010addf c010b1a0 000a c010b5f1 cefa4fc4 cefa4fc0 cefa4fbc cefa4fb8 cefa4fb8 0001 e45a3b0f cef89550 c0110f20 c02f53e4 c02f5e34 c010b1a0 Call Trace: [] apm+0x0/0x500 [] __save_process_rstate+0xd/0x50 [] apm+0x0/0x500 [] suspend+0x1f/0xb0 [] apm+0x0/0x500 [] apm+0x451/0x500 [] default_wake_function+0x0/0x10 [] apm+0x0/0x500 [] apm+0x0/0x500 [] kthread+0x39/0x60 [] kthread+0x0/0x60 [] kernel_thread_helper+0x7/0x10 === Code: 46 83 c7 04 39 ee 0f 8c 40 ff ff ff 83 c4 3c 31 c0 5b 5e 5f 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 56 b9 50 02 00 00 53 89 c6 <0f> 32 89 06 89 d0 b1 58 31 d2 89 46 04 0f 32 89 46 08 89 d0 b1 EIP: [] get_fixed_ranges+0x9/0x60 SS:ESP 0068:cefa4f5c The init_transmeta crash looks like the following: Calibrating delay using timer specific routine.. 1630.69 BogoMIPS (lpj=8153474) Mount-cache hash table entries: 512 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (32 bytes/line) CPU: L2 Cache: 512K (128 bytes/line) CPU: Processor revision 1.4.1.0, 800 MHz CPU: Code Morphing Software revision 4.2.6-8-168 CPU: 20010703 00:29 official release 4.2.6#2 general protection fault: [#1] Modules linked in: CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010286 (2.6.21-g1e7371c1 #18) EIP is at init_transmeta+0x1d5/0x230 eax: ebx: ecx: 80860004 edx: esi: edi: ebp: 80860004 esp: c030fed0 ds: 007b es: 007b fs: gs: ss: 0068 Process swapper (pid: 0, ti=c030f000 task=c02ed280 task.ti=c030f000) Stack: c02b5c52 c030ff1b 0002 0006 0008 00a8 cefc2600 0246 c030d2e0 0320 0020 c01b553f 3200 30313030 20333037 323a3030 666f2039 69636966 /* "20010703 00:29 offici" */ Call Trace: [] idr_get_new_above_int+0x10f/0x1f0 [] identify_cpu+0x20e/0x370 [] idr_get_new+0xd/0x30 [] proc_register+0x30/0xe0 [] identify_boot_cpu+0xd/0x20 [] check_bugs+0x8/0x100 [] start_kernel+0x203/0x210 [] unknown_bootoption+0x0/0x210 === Code: 00 c6 84 24 8b 00 00 00 00 89 7c 24 04 c7 04 24 52 5c 2b c0 ed 8d fc df ff bd 04 00 86 80 89 e9 0f 32 89 c6 93 c8 ff 89 d7 89 c2 <0f> 30 31 c9 b8 01 00 00 00 0f a2 8b 44 24 28 89 e9 89 50 0c b8 EIP: [] init_transmeta+0x1d5/0x230 SS:ESP 0068:c030fed0 Kernel panic - not syncing: Attempted to kill the idle task! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rcX Transmeta/APM regression
> .config and contents of /proc/cpuinfo would be helpful... Apologies! I'm still working on the bisection, but... The following is from 2.6.21-gae1ee11b, which works. $ cat /tmp/cpuinfo processor : 0 vendor_id : GenuineTMx86 cpu family : 6 model : 4 model name : Transmeta(tm) Crusoe(tm) Processor TM5800 stepping: 3 cpu MHz : 300.000 cache size : 512 KB fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr cx8 sep cmov mmx longrun lrti constant_tsc bogomips: 1608.38 clflush size: 32 $ lspci -nn 00:00.0 Host bridge [0600]: Transmeta Corporation LongRun Northbridge [1279:0395] (rev 01) 00:00.1 RAM memory [0500]: Transmeta Corporation SDRAM controller [1279:0396] 00:00.2 RAM memory [0500]: Transmeta Corporation BIOS scratchpad [1279:0397] 00:02.0 USB Controller [0c03]: ALi Corporation USB 1.1 Controller [10b9:5237] (rev 03) 00:04.0 Multimedia audio controller [0401]: ALi Corporation M5451 PCI AC-Link Controller Audio Device [10b9:5451] (rev 01) 00:06.0 Bridge [0680]: ALi Corporation M7101 Power Management Controller [PMU] [10b9:7101] 00:07.0 ISA bridge [0601]: ALi Corporation M1533/M1535 PCI to ISA Bridge [Aladdin IV/V/V+] [10b9:1533] 00:0c.0 CardBus bridge [0607]: Texas Instruments PCI1410 PC card Cardbus Controller [104c:ac50] (rev 01) 00:0f.0 IDE interface [0101]: ALi Corporation M5229 IDE [10b9:5229] (rev c3) 00:12.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ [10ec:8139] (rev 10) 00:13.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB43AB21 IEEE-1394a-2000 Controller (PHY/Link) [104c:8026] 00:14.0 VGA compatible controller [0300]: ATI Technologies Inc Rage Mobility P/M [1002:4c52] (rev 64) $ grep ^CONFIG /usr/src/linux/.config CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_IKCONFIG=y CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y CONFIG_BASE_SMALL=0 CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_KMOD=y CONFIG_BLOCK=y CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_CFQ=y CONFIG_DEFAULT_IOSCHED="cfq" CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_X86_PC=y CONFIG_MCRUSOE=y CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=5 CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_TSC=y CONFIG_HPET_TIMER=y CONFIG_PREEMPT_NONE=y CONFIG_X86_UP_APIC=y CONFIG_X86_UP_IOAPIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y CONFIG_VM86=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y CONFIG_NOHIGHMEM=y CONFIG_PAGE_OFFSET=0xC000 CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_SELECT_MEMORY_MODEL=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_SPARSEMEM_STATIC=y CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_ZONE_DMA_FLAG=1 CONFIG_MTRR=y CONFIG_SECCOMP=y CONFIG_HZ_100=y CONFIG_HZ=100 CONFIG_PHYSICAL_START=0x10 CONFIG_PHYSICAL_ALIGN=0x10 CONFIG_PM=y CONFIG_PM_DEBUG=y CONFIG_SOFTWARE_SUSPEND=y CONFIG_PM_STD_PARTITION="/dev/hda2" CONFIG_APM=y CONFIG_APM_CPU_IDLE=y CONFIG_APM_DISPLAY_BLANK=y CONFIG_APM_RTC_IS_GMT=y CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y CONFIG_CPU_FREQ_DEBUG=y CONFIG_CPU_FREQ_STAT=y CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=y CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y CONFIG_X86_LONGRUN=y CONFIG_PCI=y CONFIG_PCI_GOANY=y CONFIG_PCI_BIOS=y CONFIG_PCI_DIRECT=y CONFIG_ISA_DMA_API=y CONFIG_PCCARD=y CONFIG_PCMCIA=y CONFIG_PCMCIA_LOAD_CIS=y CONFIG_PCMCIA_IOCTL=y CONFIG_CARDBUS=y CONFIG_YENTA=y CONFIG_YENTA_O2=y CONFIG_YENTA_RICOH=y CONFIG_YENTA_TI=y CONFIG_YENTA_ENE_TUNE=y CONFIG_YENTA_TOSHIBA=y CONFIG_PCCARD_NONSTATIC=
Re: 2.6.22-rcX Transmeta/APM regression
Okay, after a ridiculous amount of bisecting and recompiling and rebooting... First I had to find out that the kernel stops booting as of bf50467204: "i386: Use per-cpu GDT immediately on boot" (With theis commit, it silently stops booting. The GP fault I posted earlier comes a little later, but I didn't bother finding it.) and starts again as of b0b73cb41d: "i386: msr.h: be paranoid about types and parentheses" However, one commit before the former suspends properly, and the latter fails to suspend (exactly the same problem at get_fixed_ranges+0x9/0x60), so I had to bisect further between the two, backporting the msr.h changes across the msr-index.h splitoff. Anyway, the patch which introduces the problem is the aptly named 3ebad: 3ebad59056: [PATCH] x86: Save and restore the fixed-range MTRRs of the BSP when suspending 2.6.22-rc6 plus that one commit reverted successfully does APM suspend (and resume) for me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.22-rcX Transmeta/APM regression
Responding to various proposed fixes: > Index: linux/arch/i386/kernel/cpu/mtrr/main.c > === > --- linux.orig/arch/i386/kernel/cpu/mtrr/main.c > +++ linux/arch/i386/kernel/cpu/mtrr/main.c > @@ -734,8 +734,11 @@ void mtrr_ap_init(void) > */ > void mtrr_save_state(void) > { > - int cpu = get_cpu(); > + int cpu; > > + if (!cpu_has_mtrr) > + return; > + cpu = get_cpu(); > if (cpu == 0) > mtrr_save_fixed_ranges(NULL); > else This does not change the symptoms in any way. > --- a/arch/i386/kernel/cpu/mtrr/generic.c~i386-mtrr-crash-fix > +++ a/arch/i386/kernel/cpu/mtrr/generic.c > @@ -65,7 +65,8 @@ get_fixed_ranges(mtrr_type * frs) > > void mtrr_save_fixed_ranges(void *info) > { > - get_fixed_ranges(mtrr_state.fixed_ranges); > + if (cpu_has_mtrr) > + get_fixed_ranges(mtrr_state.fixed_ranges); > } > > static void print_fixed(unsigned base, unsigned step, const mtrr_type*types) This works great, thanks! Please consider the regression diagnosed and fixed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
A simpler variant on sys_indirect?
I was just thinking, while sys_indirect is an interesting way to add features to a system call, the argument marshalling in user space is a bit of a pain. An alternate idea would be to instead have a "prefix system call" that sets some flags that apply to the next system call made by that thread only. They wouldn't be global mode flags that would mess up libraries. Maybe I've just been programming x86s too long, but this seems like a nicer mental model. The downsides are that you need to save and restore the prefix flags across signal delivery, and you have a second user/kernel/user transition. Most of the options seem to be applied to system calls that resolve path names. While that is certainly a very important code path, it's also of non-trivial length, even with the dcache. How much would one extra kernel entry bloat the budget? And if the kernel entry overhead IS a problem, wouldn't you want to batch together the non-prefix system calls as well, using something like the syslet ideas that were kicked around recently? That would allow less than 1 kernel entry per system call, even with prefixes. Oh! That suggests an interesting possibility that solves the signal handling problem as well: - Make a separate prefix system call, BUT - The flags are reset on each return to user space, THUS - You *have* to use a batch-system-call mechanism for the prefix system calls to do anything. Of course, this takes us right back to the beginning with respect to messy user-space argument marshalling. But at least it's only one indirect system call mechanism, not two. Wrapping indirect system call mechanism #1 (to set syscall options) in indirect system call mechanism #2 (to batch system calls) seems like a bit of a nightmare. I'm not at all sure that these are good ideas, but they're not obviously bad ones, to me. Is it worth looking for synergy between various "indirect system call" ideas? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] change futex_wait() to hrtimers
> BTW. my futex man page says timeout's contents "describe the maximum duration > of the wait". Surely that should be *minimum*? Michael cc'ed. Er, the intent of the wording is to say "futex will wait until uaddr no longer contains val, or the timeout expires, whichever happens first". One option for selecting different clock resolutions is to use the clockid_t from the POSIX clock_gettime() family. That is, specify the clock that a wait uses, and then have a separate mechanism for turning a resolution requirement into a clockid_t. (And there can be default clocks for interfaces that don't specify one explicitly.) Although clockid_t is pretty generic, it's biased toward an enumerated list of clocks rather than a continuous resolution. Fortunately, that seems to match the implementation ideas. The question is how much the timeout gets rounded, and the choices are currently jiffies or microseconds. A related option may be whether rounding down is acceptable. For some applications (periodic polling for events), it's fine. For others, it's not. Thus, while it's okay to specify such clocks explicitly, it'd probably be a good idea to forbid selecting them as the default for interfaces that don't specify a clock explicitly. I had some code that suffered 1 ms buzz-loops on Solaris because poll(2) would round the timeout interval down, but the loop calling it would explicitly check whether the timeout had expired using gettimeofday() and would keep re-invoking poll(pollfds, npollfds, 1) until the timeout really did expire. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
swapper: page allocation failure. order:0, mode:0x20
I'm not used to seeing order-0 allocation failures on lightly loaded 2 GB (amd64, so it's all low memory) machines. Can anyone tell me what happened? It happened just as I was transferring a large file to the machine for later crunching (the "sgrep" program is a local number-crunching application that was getting alignment errors in SSE code), and the network stopped working. The "NETDEV WATCHDOG" message happened a few minutes later, during the head-scratching phase. I ended up rebooting the machine to get on with the number-crunching, but this is a bit mysterious. The ethernet driver is forcedeth. Does it appear to be at fault? Here's a dmesg log, with /proc/*info and lspci appended. amd64 uniprocessor, with ECC memory. Stock 2.6.21 + linuxpps patches. Thanks for any suggestions! er [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12 serio: i8042 KBD port at 0x60,0x64 irq 1 serio: i8042 AUX port at 0x60,0x64 irq 12 mice: PS/2 mouse device common for all mice input: AT Translated Set 2 keyboard as /class/input/input2 input: PC Speaker as /class/input/input3 input: PS/2 Generic Mouse as /class/input/input4 i2c_adapter i2c-0: nForce2 SMBus adapter at 0x1c00 i2c_adapter i2c-1: nForce2 SMBus adapter at 0x1c40 it87: Found IT8712F chip at 0x290, revision 7 it87: in3 is VCC (+5V) it87: in7 is VCCH (+5V Stand-By) md: raid0 personality registered for level 0 md: raid1 personality registered for level 1 md: raid10 personality registered for level 10 raid6: int64x1 2052 MB/s raid6: int64x2 2606 MB/s raid6: int64x4 2579 MB/s raid6: int64x8 1838 MB/s raid6: sse2x12817 MB/s raid6: sse2x23738 MB/s raid6: sse2x44021 MB/s raid6: using algorithm sse2x4 (4021 MB/s) md: raid6 personality registered for level 6 md: raid5 personality registered for level 5 md: raid4 personality registered for level 4 raid5: automatically using best checksumming function: generic_sse generic_sse: 7089.000 MB/sec raid5: using function: generic_sse (7089.000 MB/sec) EDAC MC: Ver: 2.0.1 Apr 26 2007 netem: version 1.2 Netfilter messages via NETLINK v0.30. ip_tables: (C) 2000-2006 Netfilter Core Team TCP cubic registered Initializing XFRM netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 NET: Registered protocol family 15 802.1Q VLAN Support v1.8 Ben Greear <[EMAIL PROTECTED]> All bugs added by David S. Miller <[EMAIL PROTECTED]> powernow-k8: Found 1 AMD Athlon(tm) 64 Processor 3700+ processors (version 2.00.00) powernow-k8:0 : fid 0xe (2200 MHz), vid 0x6 powernow-k8:1 : fid 0xc (2000 MHz), vid 0x8 powernow-k8:2 : fid 0xa (1800 MHz), vid 0xa powernow-k8:3 : fid 0x2 (1000 MHz), vid 0x12 md: Autodetecting RAID arrays. md: autorun ... md: considering sdf4 ... md: adding sdf4 ... md: sdf3 has different UUID to sdf4 md: sdf2 has different UUID to sdf4 md: sdf1 has different UUID to sdf4 md: adding sde4 ... md: sde3 has different UUID to sdf4 md: sde2 has different UUID to sdf4 md: sde1 has different UUID to sdf4 md: adding sdd4 ... md: sdd3 has different UUID to sdf4 md: sdd2 has different UUID to sdf4 md: sdd1 has different UUID to sdf4 md: adding sdc4 ... md: sdc3 has different UUID to sdf4 md: sdc2 has different UUID to sdf4 md: sdc1 has different UUID to sdf4 md: adding sdb4 ... md: sdb3 has different UUID to sdf4 md: sdb2 has different UUID to sdf4 md: sdb1 has different UUID to sdf4 md: adding sda4 ... md: sda3 has different UUID to sdf4 md: sda2 has different UUID to sdf4 md: sda1 has different UUID to sdf4 md: created md5 md: bind md: bind md: bind md: bind md: bind md: bind md: running: raid5: device sdf4 operational as raid disk 5 raid5: device sde4 operational as raid disk 4 raid5: device sdd4 operational as raid disk 3 raid5: device sdc4 operational as raid disk 2 raid5: device sdb4 operational as raid disk 1 raid5: device sda4 operational as raid disk 0 raid5: allocated 6362kB for md5 raid5: raid level 5 set md5 active with 6 out of 6 devices, algorithm 2 RAID5 conf printout: --- rd:6 wd:6 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:1, dev:sde4 disk 5, o:1, dev:sdf4 md5: bitmap initialized from disk: read 11/11 pages, set 1 bits, status: 0 created bitmap (164 pages) for device md5 md: considering sdf3 ... md: adding sdf3 ... md: sdf2 has different UUID to sdf3 md: sdf1 has different UUID to sdf3 md: adding sde3 ... md: sde2 has different UUID to sdf3 md: sde1 has different UUID to sdf3 md: adding sdd3 ... md: sdd2 has different UUID to sdf3 md: sdd1 has different UUID to sdf3 md: adding sdc3 ... md: sdc2 has different UUID to sdf3 md: sdc1 has different UUID to sdf3 md: adding sdb3 ... md: sdb2 has different UUID to sdf3 md: sdb1 has different UUID to sdf3 md: adding sda3 ... md: sda2 has different UUID to sdf3 md: sda1 has different UUID to sdf3 md: created md4 md: bind md: bind md: bind md: bind md: bind md: bind md: running: raid10: raid set md4 active with 6 out of 6 devices md4: b
Re: increase Linux kernel address space 3.5 G memory on Redhat Enterprise
> Hi: > I am running Redhat Linux Enterprise version 4 update 4 on a dual-core > 4G memory machine. There are many references on the web talking about > increasing default user address space to 3.5 G however lacking specific > instructions. My questions: > > 1. What is the specific steps to be done for the kernel to support 3.5 G >address space? > 2. Do I need to re-compile kernel to make this happen? If so, any >specific instruction? 2. Yes, you need to re-compile the kernel. Instructions all over the web. Basically, "cd /usr/src/linux", make sure a reasonable default .config file is installed (the distribution should supply one for its default kernel), "make menuconfig" or "make xconfig", change the options you want changed, then "make" and "make install". The latter *usually* works; the usual worst case is that it installs it somewhere other than your boot loader is looking and rebooting will find the old kernel. The fun comes when you left an option out of your new kernel that you need to boot - like the hard drive controller! Then you need to go back to an old, known working kernel. It's not at all difficult, but you do need to be careful; a mistake can be awkward to recover from if you don't plan ahead. 1. First of all, that's not necessarily a good idea. Doing that would limit you to 384 MB of kernel memory, after the usual 128 MB deduction for PCI devices. That has to fit the kernel binary, all page tables, inode cache, network buffers, and so on. For some workloads, that can be a bottleneck. If your application is heavily biased toward file data that the kernel doesn't have to look at, such as databases, it might be okay. A much better thing would be to take advantage of the fact that every multi-core processor I've heard of (IBM's POWER4, Sun Niagara, and a few by some companies you may not have heard of like Intel and AMD) is a 64-bit processor. So you can run a 64bit kernel and get terabytes of user address space. Even 32-bit applications get a full 4G of address space, as the 64-bit kernel doesn't need to share. That would make your user application happier *and* the kernel happier. It would increase kernel data structure size, but it's still usually a net win. 1b. If you really want to do it, it's not a normally selectable option, but you can add it to arch/i386/Kconfig by following the pattern of the others. You need CONFIG_PAGE_OFFSET=0xE000, and you need to be sure that CONFIG_HIGHMEM64G is turned OFF. (If it's on, you're using PAE, and its 3-level page table structure requires using a 1G boundary.) Then configure the kernel, select CONFIG_EMBEDDED under "General setup", then your memory split under "Processor type and features". Compile, install the new kernel, and reboot. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20.3 AMD64 oops in CFQ code
> 3 (I think) seperate instances of this, each involving raid5. Is your > array degraded or fully operational? Ding! A drive fell out the other day, which is why the problems only appeared recently. md5 : active raid5 sdf4[5] sdd4[3] sdc4[2] sdb4[1] sda4[0] 1719155200 blocks level 5, 64k chunk, algorithm 2 [6/5] [_U] bitmap: 149/164 pages [596KB], 1024KB chunk H'm... this means that my alarm scripts aren't working. Well, that's good to know. The drive is being re-integrated now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.20.3 AMD64 oops in CFQ code
c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c RIP [] cfq_dispatch_insert+0x18/0x68 RSP CR2: 0098 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20.3 AMD64 oops in CFQ code
As an additional data point, here's a libata problem I'm having trying to rebuild the array. I have six identical 400 GB drives (ST3400832AS), and one is giving me hassles. I've run SMART short and long diagnostics, badblocks, and Seagate's "seatools" diagnostic software, and none of these find problems. It is the only one of the six with a non-zero reallocated sector count (it's 26). Anyway, the drive is partitioned into a 45G RAID-10 part and a 350G RAID-5 part. The RAID-10 part integrated successfully, but the RAID-5 got to about 60% and then puked: ata5.00: exception Emask 0x0 SAct 0x1ef SErr 0x0 action 0x2 frozen ata5.00: cmd 61/c0:00:d2:d0:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 98304 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/40:08:92:d1:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 32768 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:10:d2:d1:b9/01:00:1c:00:00/40 tag 2 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:18:d2:d2:b9/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:28:d2:d3:b9/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:30:d2:d4:b9/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:38:d2:d5:b9/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:40:d2:d6:b9/01:00:1c:00:00/40 tag 8 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5: soft resetting port ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB) sde: Write Protect is off SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata5: soft resetting port ata5: softreset failed (timeout) ata5: softreset failed, retrying in 5 secs ata5: hard resetting port ata5: softreset failed (timeout) ata5: follow-up softreset failed, retrying in 5 secs ata5: hard resetting port ata5: softreset failed (timeout) ata5: reset failed, giving up ata5.00: disabled ata5: EH complete sd 4:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sde, sector 91795259 md: super_written gets error=-5, uptodate=0 raid10: Disk failure on sde3, disabling device. Operation continuing on 5 devices sd 4:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sde, sector 481942994 raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices sd 4:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sde, sector 481944018 md: md5: recovery done. RAID10 conf printout: --- wd:5 rd:6 disk 0, wo:0, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sdc3 disk 2, wo:0, o:1, dev:sdd3 disk 3, wo:1, o:0, dev:sde3 disk 4, wo:0, o:1, dev:sdf3 disk 5, wo:0, o:1, dev:sda3 RAID10 conf printout: --- wd:5 rd:6 disk 0, wo:0, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sdc3 disk 2, wo:0, o:1, dev:sdd3 disk 4, wo:0, o:1, dev:sdf3 disk 5, wo:0, o:1, dev:sda3 RAID5 conf printout: --- rd:6 wd:5 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:0, dev:sde4 disk 5, o:1, dev:sdf4 RAID5 conf printout: --- rd:6 wd:5 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 5, o:1, dev:sdf4 The first error address is just barely inside the RAID-10 part (which ends at sector 91,795,410), while the second and third errors (at 481,942,994) look like where the reconstruction was working. Anyway, what's annoying is that I can't figure out how to bring the drive back on line without resetting the box. It's in a hot-swap enclosure, but power cycling the drive doesn't seem to help. I thought libata hotplug was working? (SiI3132 card, using the sil24 driver.) (H'm... after rebooting, reallocated sectors jumped from 26 to 39. Something is up with that drive.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.21
Today, 26 april, is the *21*'th anniversary of nuclear explosion at Chernobyl's station (ex USSR). And linux 2.6.*21* is released. Nice! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch resend v4] update ctime and mtime for mmaped write
> Yes, this will make msync(MS_ASYNC) more heavyweight again. But if an > application doesn't want to update the timestamps, it should just omit > this call, since it does nothing else. Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC) and doesn't care about timestamps. (In fact, sometimes it's configured to write to a raw device and there are no timestamps.) It's used as a poor man's portable async I/O. The application logs data to disk, and sometimes needs to sync it to disk to ensure it has all been written. To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC) as soon as a page is filled up to prompt asynchronous writeback. "I'm done writing this page and don't intend to write it again. Please start committing it to stable storage, but don't block me." Then, occasionally, there's an msync(MS_SYNC) call to be sure the data is synced to disk. This caused annoying hiccups before the MS_ASYNC calls were added. I agree that msync(MS_ASYNC) has no semantics if time is ignored. But it's a useful way to tell the OS that the page is not going to be dirtied again. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is NCQ enabled by default by libata? (2.6.20)
Here's some more data. 6x ST3400832AS (Seagate 7200.8) 400 GB drives. 3x SiI3232 PCIe SATA controllers 2.2 GHz Athlon 64, 1024k cache (3700+), 2 GB RAM Linux 2.6.20.4, 64-bit kernel Tested able to sustain reads at 60 MB/sec/drive simultaneously. RAID-10 is across 6 drives, first part of drive. RAID-5 most of the drive, so depending on allocation policies, may be a bit slower. The test sequence actually was: 1) raid5ncq 2) raid5noncq 3) raid10noncq 4) raid10ncq 5) raid5ncq 6) raid5noncq but I rearranged things to make it easier to compare. Note that NCQ makes writes faster (oh... I have write cacheing turned off; perhaps I should turn it on and do another round), but no-NCQ seems to have a read advantage. [EMAIL PROTECTED]@#ing bonnie++ overflows and won't print file read times; I haven't bothered to fix that yet. NCQ seems to have a pretty significant effect on the file operations, especially deletes. Update: added 7) wcache5noncq - RAID 5 with no NCQ but write cache enabled 8) wcache5ncq - RAID 5 with NCQ and write cache enabled RAID=5, NCQ Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP raid5ncq 7952M 31688 53 34760 10 25327 4 57908 86 167680 13 292.2 0 raid5ncq 7952M 30357 50 34154 10 24876 4 59692 89 165663 13 285.6 0 raid5noncq7952M 29015 48 31627 9 24263 4 61154 91 185389 14 286.6 0 raid5noncq7952M 28447 47 31163 9 23306 4 60456 89 198624 15 293.4 0 wcache5ncq7952M 32433 54 35413 10 26139 4 59898 89 168032 13 303.6 0 wcache5noncq 7952M 31768 53 34597 10 25849 4 61049 90 193351 14 304.8 0 raid10ncq 7952M 54043 89 110804 32 48859 9 58809 87 142140 12 363.8 0 raid10noncq 7952M 48912 81 68428 21 38906 7 57824 87 146030 12 358.2 0 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min/sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16:10:16/64 1351 25 + +++ 941 3 2887 42 31526 96 382 1 16:10:16/64 1400 18 + +++ 386 1 4959 69 32118 95 570 2 16:10:16/64 636 8 + +++ 176 0 1649 23 + +++ 245 1 16:10:16/64 715 12 + +++ 164 0 156 2 11023 32 2161 8 16:10:16/64 1291 26 + +++ 2778 10 2424 33 31127 93 483 2 16:10:16/64 1236 26 + +++ 840 3 2519 37 30366 91 445 2 16:10:16/64 1714 37 + +++ 1652 6 789 11 4700 14 12264 48 16:10:16/64 634 11 + +++ 1035 3 338 4 + +++ 1349 5 raid5ncq,7952M,31688,53,34760,10,25327,4,57908,86,167680,13,292.2,0,16:10:16/64,1351,25,+,+++,941,3,2887,42,31526,96,382,1 raid5ncq,7952M,30357,50,34154,10,24876,4,59692,89,165663,13,285.6,0,16:10:16/64,1400,18,+,+++,386,1,4959,69,32118,95,570,2 raid5noncq,7952M,29015,48,31627,9,24263,4,61154,91,185389,14,286.6,0,16:10:16/64,636,8,+,+++,176,0,1649,23,+,+++,245,1 raid5noncq,7952M,28447,47,31163,9,23306,4,60456,89,198624,15,293.4,0,16:10:16/64,715,12,+,+++,164,0,156,2,11023,32,2161,8 wcache5ncq,7952M,32433,54,35413,10,26139,4,59898,89,168032,13,303.6,0,16:10:16/64,1291,26,+,+++,2778,10,2424,33,31127,93,483,2 wcache5noncq,7952M,31768,53,34597,10,25849,4,61049,90,193351,14,304.8,0,16:10:16/64,1236,26,+,+++,840,3,2519,37,30366,91,445,2 raid10ncq,7952M,54043,89,110804,32,48859,9,58809,87,142140,12,363.8,0,16:10:16/64,1714,37,+,+++,1652,6,789,11,4700,14,12264,48 raid10noncq,7952M,48912,81,68428,21,38906,7,57824,87,146030,12,358.2,0,16:10:16/64,634,11,+,+++,1035,3,338,4,+,+++,1349,5 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is NCQ enabled by default by libata? (2.6.20)
>From [EMAIL PROTECTED] Tue Mar 27 16:25:58 2007 Date: Tue, 27 Mar 2007 12:25:52 -0400 (EDT) From: Justin Piszcz <[EMAIL PROTECTED]> X-X-Sender: [EMAIL PROTECTED] To: [EMAIL PROTECTED] cc: [EMAIL PROTECTED], [EMAIL PROTECTED], linux-ide@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: Why is NCQ enabled by default by libata? (2.6.20) In-Reply-To: <[EMAIL PROTECTED]> References: <[EMAIL PROTECTED]> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed On Tue, 27 Mar 2007, [EMAIL PROTECTED] wrote: > Here's some more data. > > 6x ST3400832AS (Seagate 7200.8) 400 GB drives. > 3x SiI3232 PCIe SATA controllers > 2.2 GHz Athlon 64, 1024k cache (3700+), 2 GB RAM > Linux 2.6.20.4, 64-bit kernel > > Tested able to sustain reads at 60 MB/sec/drive simultaneously. > > RAID-10 is across 6 drives, first part of drive. > RAID-5 most of the drive, so depending on allocation policies, > may be a bit slower. > > The test sequence actually was: > 1) raid5ncq > 2) raid5noncq > 3) raid10noncq > 4) raid10ncq > 5) raid5ncq > 6) raid5noncq > but I rearranged things to make it easier to compare. > > Note that NCQ makes writes faster (oh... I have write cacheing turned off; > perhaps I should turn it on and do another round), but no-NCQ seems to have > a read advantage. [EMAIL PROTECTED]@#ing bonnie++ overflows and won't print > file > read times; I haven't bothered to fix that yet. > > NCQ seems to have a pretty significant effect on the file operations, > especially deletes. > > Update: added > 7) wcache5noncq - RAID 5 with no NCQ but write cache enabled > 8) wcache5ncq - RAID 5 with NCQ and write cache enabled > > > RAID=5, NCQ > Version 1.03 --Sequential Output-- --Sequential Input- > --Random- >-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec > %CP > raid5ncq 7952M 31688 53 34760 10 25327 4 57908 86 167680 13 292.2 > 0 > raid5ncq 7952M 30357 50 34154 10 24876 4 59692 89 165663 13 285.6 > 0 > raid5noncq7952M 29015 48 31627 9 24263 4 61154 91 185389 14 286.6 > 0 > raid5noncq7952M 28447 47 31163 9 23306 4 60456 89 198624 15 293.4 > 0 > wcache5ncq7952M 32433 54 35413 10 26139 4 59898 89 168032 13 303.6 > 0 > wcache5noncq 7952M 31768 53 34597 10 25849 4 61049 90 193351 14 304.8 > 0 > raid10ncq 7952M 54043 89 110804 32 48859 9 58809 87 142140 12 363.8 > 0 > raid10noncq 7952M 48912 81 68428 21 38906 7 57824 87 146030 12 358.2 > 0 > >--Sequential Create-- Random Create >-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files:max:min/sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec > %CP >16:10:16/64 1351 25 + +++ 941 3 2887 42 31526 96 382 1 >16:10:16/64 1400 18 + +++ 386 1 4959 69 32118 95 570 2 >16:10:16/64 636 8 + +++ 176 0 1649 23 + +++ 245 1 >16:10:16/64 715 12 + +++ 164 0 156 2 11023 32 2161 8 >16:10:16/64 1291 26 + +++ 2778 10 2424 33 31127 93 483 2 >16:10:16/64 1236 26 + +++ 840 3 2519 37 30366 91 445 2 >16:10:16/64 1714 37 + +++ 1652 6 789 11 4700 14 12264 48 >16:10:16/64 634 11 + +++ 1035 3 338 4 + +++ 1349 5 > > raid5ncq,7952M,31688,53,34760,10,25327,4,57908,86,167680,13,292.2,0,16:10:16/64,1351,25,+,+++,941,3,2887,42,31526,96,382,1 > raid5ncq,7952M,30357,50,34154,10,24876,4,59692,89,165663,13,285.6,0,16:10:16/64,1400,18,+,+++,386,1,4959,69,32118,95,570,2 > raid5noncq,7952M,29015,48,31627,9,24263,4,61154,91,185389,14,286.6,0,16:10:16/64,636,8,+,+++,176,0,1649,23,+,+++,245,1 > raid5noncq,7952M,28447,47,31163,9,23306,4,60456,89,198624,15,293.4,0,16:10:16/64,715,12,+,+++,164,0,156,2,11023,32,2161,8 > wcache5ncq,7952M,32433,54,35413,10,26139,4,59898,89,168032,13,303.6,0,16:10:16/64,1291,26,+,+++,2778,10,2424,33,31127,93,483,2 > wcache5noncq,7952M,31768,53,34597,10,25849,4,61049,90,193351,14,304.8,0,16:10:16/64,1236,26,+,+++,840,3,2519,37,30366,91,445,2 > raid10ncq,7952M,54043,89,110804,32,48859,9,58809,87,142140,12,363.8,0,16:10:16/64,1714,37,+,+++,1652,6,789,11,4700,14,12264,48 > raid10noncq,7952M,48912,81,68428,21,38906,7,57824,87,146030,12,358.2,0,16:10:16/64,634,11,+,+++,1035,3,338,4,+,+++,1349,5 > > I would try with write-caching enabled. I did. See the "wcache5" lines? > Also, the RAID5/RAID10 you mention seems lik
Re: Why is NCQ enabled by default by libata? (2.6.20)
> I meant you do not allocate the entire disk per raidset, which may alter > performance numbers. No, that would be silly. It does lower the average performance of the large RAID-5 area, but I don't know how ext3fs is allocating the blocks anyway, so > 04:00.0 RAID bus controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II > Controller (rev 01) > I assume you mean 3132 right? Yes; did I mistype? 02:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01) 03:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01) 04:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01) > I also have 6 seagates, I'd need to run one > of these tests on them as well, also you took the micro jumper off the > Seagate 400s in the back as well right? Um... no, I don't remember doing anything like that. What micro jumper? It's been a while, but I just double-checked the drive manual and it doesn't mention any jumpers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add support for ITE887x serial chipsets
Minor point: the chip part numbers are actually IT887x, not ITE887x. I STFW for a data sheet, but didn't have immediate luck. Does anyone know where to find documentation? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch resend v4] update ctime and mtime for mmaped write
> * MS_ASYNC does not start I/O (it used to, up to 2.5.67). Yes, I noticed. See http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html for a bug report on the subject February 2006. That's why this application is still running on 2.4. As I mentioned at the time, the SUS says: (http://opengroup.org/onlinepubs/007908799/xsh/msync.html) "When MS_ASYNC is specified, msync() returns immediately once all the write operations are initiated or queued for servicing." You can argue that putting it on the dirty list constitutes "queued for servicing", but the intent seems pretty clear to me: MS_ASYNC is supposed to start the I/O. Although strict standards-ese parsing says that either branch of an or is acceptable, it is a common English language convention that the first alternative is preferred and the second is a fallback. It makes sense in this case: start the write or, if that's not possible (the disk is already busy), queue it for service as soon as the disk is available. They perhaps didn't mandate it this strictly, but that's clearly the intent. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch resend v4] update ctime and mtime for mmaped write
> Suggest you use msync(MS_ASYNC), then > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE). Thank you; I didn't know about that. And I can handle -ENOSYS by falling back to the old behavior. > We can fix your application, and we'll break someone else's. If you can point to an application that it'll break, I'd be a lot more understanding. Nobody did, last year. > I don't think it's solveable, really - the range of applications is so > broad, and the "standard" is so vague as to be useless. I agree that standards are sometimes vague, but that one seemed about as clear as it's possible to be without imposing unreasonably on the file system and device driver layers. What part of "The msync() function writes all modified data to permanent storage locations [...] For mappings to files, the msync() function ensures that all write operations are completed as defined for synchronised I/O data integrity completion." suggests that it's not supposed to do disk I/O? How is that uselessly vague? It says to me that msync's raison d'ĂȘtre is to write data from RAM to stable storage. If an application calls it too often, that's the application's fault just as if it called sync(2) too often. > This is why we've > been extending these things with linux-specific goodies which permit > applications to actually tell the kernel what they want to be done in a > more finely-grained fashion. Well, I still think the current Linux behavior is a bug, but there's a usable (and run-time compatible) workaround that doesn't unreasonably complicate the code, and that's good enough. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch resend v4] update ctime and mtime for mmaped write
> Linux _will_ write all modified data to permanent storage locations. > Since 2.6.17 it will do this regardless of msync(). Before 2.6.17 you > do need msync() to enable data to be written back. > > But it will not start I/O immediately, which is not a requirement in > the standard, or at least it's pretty vague about that. As I've said before, I disagree, but I'm not going to start a major flame war about it. The most relevant paragraph is: # When MS_ASYNC is specified, msync() returns immediately once all the # write operations are initiated or queued for servicing; when MS_SYNC is # specified, msync() will not return until all write operations are # completed as defined for synchronised I/O data integrity completion. # Either MS_ASYNC or MS_SYNC is specified, but not both. Note two things: 1) In the paragraphs before, what msync does is defined independently of the MS_* flags. Only the time of the return to user space varies. Thus, whatever the delay between calling msync() and the data being written, it should be the same whether MS_SYNC or MS_ASYNC is used. The implementation intended is: - Start all I/O - If MS_SYNC, wait for I/O to complete - Return to user space 2) "all the write operations are initiated or queued for servicing". It is a common convention in English (and most languages, I expect) that in the "or" is a preference for the first alternative. The second is a permitted alternative if the first is not possible. And "queued for servicing", especially "initiated or queued for servicing", to me imples queuing waiting for some resource. To have the resource being waited for be a timer expiry seems like rather a cheat to me. It's perhaps doesn't break the letter of the standard, but definitely bends it. It feels like a fiddle. Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished: "I don't expect to write this page any more, so now would be a good time to clean it." It would just make my life easier if the kernel procrastinated less. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch resend v4] update ctime and mtime for mmaped write
> But if you didn't notice until now, then the current implementation > must be pretty reasonable for you use as well. Oh, I definitely noticed. As soon as I tried to port my application to 2.6, it broke - as evidenced by my complaints last year. The current solution is simple - since it's running on dedicated boxes, leave them on 2.4. I've now got the hint on how to make it work on 2.6 (sync_file_range()), so I can try again. But the pressure to upgrade is not strong, so it might be a while. You may recall, this subthread started when I responding to "the only reason to use msync(MS_ASYNC) is to update timestamps" with a counterexample. I still think the purpose of the call is a hint to the kernel that writing to the specified page(s) is complete and now would be a good time to clean them. Which has very little to do with timestamps. Now, my application, which leaves less than a second between the MS_ASYNC and a subsequent MS_SYNC to check whether it's done, broke, but I can imagine similar cases where MS_ASYNC would remain a useful hint to reduce the sort of memory hogging generally associated with "dd if=/dev/zero" type operations. Reading between the lines of the standard, that seems (to me, at least) to obviously be the intended purpose of msync(MS_ASYNC). I wonder if there's any historical documentation describing the original intent behind creating the call. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is NCQ enabled by default by libata? (2.6.20)
> But when writing, what is the difference between queuing multiple tagged > writes, and sending down multiple untagged cached writes that complete > immediately and actually hit the disk later? Either way the host keeps > sending writes to the disk until it's buffers are full, and the disk is > constantly trying to commit those buffers to the media in the most > optimal order. Well, theoretically it allows more buffering, without hurting read cacheing. With NCQ, the drive gets the command, and then tells the host when it wants the corresponding data. It can ask for the data in any order it likes, when it's decided which write will be serviced next. So it doesn's have to fill up its RAM with the write data. This leaves more RAM free for things like read-ahead. Another trick, that I know SCSI can do and I expect NCQ can do, is that the drive cam ask for the data for a single write in different orders. This is particularly useful for reads, where a drive asked for blocks 100-199 can deliver blocks 150-199 first, then 100-149 when the drive spins around. This is, unfortunately, kind of theoretical. I don't actually know how hard drive cacheing algorithms work, but I assume it's mostly a readahead cache. The host has much more RAM than the drive, so any block that it's read won't be requested again for a long time. So the drive doesn't want to keep that in cache. But any sectors that the drive happens to read nearby requested sectors are worth keeping. I'm not sure it's a big deal, as 32 (tags) x 128K (largest LBA28 write size) is 4M, only half of a typical drive's cache RAM. But it's possible that there's some difference. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20.3 AMD64 oops in CFQ code
data 73728 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/70:10:62:30:ba/01:00:1c:00:00/40 tag 2 cdb 0x0 data 188416 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:18:d2:31:ba/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:20:d2:32:ba/01:00:1c:00:00/40 tag 4 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:28:d2:33:ba/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:30:d2:34:ba/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:38:d2:35:ba/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5: soft resetting port 14:56:43: ata5: softreset failed (timeout) 14:56:43: ata5: softreset failed, retrying in 5 secs 14:56:48: ata5: hard resetting port 14:57:20: ata5: softreset failed (timeout) 14:57:20: ata5: follow-up softreset failed, retrying in 5 secs 14:57:25: ata5: hard resetting port 14:57:58: ata5: softreset failed (timeout) 14:57:58: ata5: reset failed, giving up 14:57:58: ata5.00: disabled 14:57:58: ata5: EH complete 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004 14:57:58: end_request: I/O error, dev sde, sector 481965522 14:57:58: raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004 14:57:58: end_request: I/O error, dev sde, sector 481965266 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004 14:57:58: end_request: I/O error, dev sde, sector 481965010 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004 14:57:58: end_request: I/O error, dev sde, sector 481964754 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004 14:57:58: end_request: I/O error, dev sde, sector 481964498 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004 14:57:58: end_request: I/O error, dev sde, sector 481964130 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004 14:57:58: end_request: I/O error, dev sde, sector 481963986 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x0004 14:57:58: end_request: I/O error, dev sde, sector 481941210 14:57:58: md: md5: recovery done. 14:57:58: RAID5 conf printout: 14:57:58: --- rd:6 wd:5 14:57:58: disk 0, o:1, dev:sda4 14:57:58: disk 1, o:1, dev:sdb4 14:57:58: disk 2, o:1, dev:sdc4 14:57:58: disk 3, o:1, dev:sdd4 14:57:58: disk 4, o:0, dev:sde4 14:57:58: disk 5, o:1, dev:sdf4 14:57:58: RAID5 conf printout: 14:57:58: --- rd:6 wd:5 14:57:58: disk 0, o:1, dev:sda4 14:57:58: disk 1, o:1, dev:sdb4 14:57:58: disk 2, o:1, dev:sdc4 14:57:58: disk 3, o:1, dev:sdd4 14:57:58: disk 5, o:1, dev:sdf4 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/5] hwmon: (core) Inherit power properties to hdev
Quoting Nicolin Chen : The new hdev is a child device related to the original parent hwmon driver and its device. However, it doesn't support the power features, typically being defined in the parent driver. So this patch inherits three necessary power properties from the parent dev to hdev: power, pm_domain and driver pointers. Note that the dev->driver pointer is the place that contains a dev_pm_ops pointer defined in the parent device driver and the pm runtime core also checks this pointer: if (!cb && dev->driver && dev->driver->pm) Signed-off-by: Nicolin Chen --- Changelog v2->v3: * N/A v1->v2: * Added device pointers drivers/hwmon/hwmon.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c index 975c95169884..14cfab64649f 100644 --- a/drivers/hwmon/hwmon.c +++ b/drivers/hwmon/hwmon.c @@ -625,7 +625,12 @@ __hwmon_device_register(struct device *dev, const char *name, void *drvdata, hwdev->name = name; hdev->class = &hwmon_class; hdev->parent = dev; - hdev->of_node = dev ? dev->of_node : NULL; + if (dev) { + hdev->driver = dev->driver; + hdev->power = dev->power; + hdev->pm_domain = dev->pm_domain; + hdev->of_node = dev->of_node; + } We'l need to dig into this more; I suspect it may be inappropriate to do this. With this change, every hwmon driver supporting (runtime ?) suspend/resume will have the problem worked around in #5, and that just seems wrong. Guenter hwdev->chip = chip; dev_set_drvdata(hdev, drvdata); dev_set_name(hdev, HWMON_ID_FORMAT, id); -- 2.17.1
[PATCH] ftrace: Remove unused list 'ftrace_direct_funcs'
From: "Dr. David Alan Gilbert" Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API") stopped using 'ftrace_direct_funcs' (and the associated struct ftrace_direct_func). Remove them. Build tested only (on x86-64 with FTRACE and DYNAMIC_FTRACE enabled) Signed-off-by: Dr. David Alan Gilbert --- include/linux/ftrace.h | 1 - kernel/trace/ftrace.c | 8 2 files changed, 9 deletions(-) diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index 54d53f345d149..b01cca36147ff 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -83,7 +83,6 @@ static inline void early_trace_init(void) { } struct module; struct ftrace_hash; -struct ftrace_direct_func; #if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_MODULES) && \ defined(CONFIG_DYNAMIC_FTRACE) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index da1710499698b..b18b4ece3d7c9 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -5318,14 +5318,6 @@ ftrace_set_addr(struct ftrace_ops *ops, unsigned long *ips, unsigned int cnt, #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS -struct ftrace_direct_func { - struct list_headnext; - unsigned long addr; - int count; -}; - -static LIST_HEAD(ftrace_direct_funcs); - static int register_ftrace_function_nolock(struct ftrace_ops *ops); /* -- 2.45.0
[PATCH] virt: acrn: Remove unusted list 'acrn_irqfd_clients'
From: "Dr. David Alan Gilbert" It doesn't look like this was ever used. Build tested only. Signed-off-by: Dr. David Alan Gilbert --- drivers/virt/acrn/irqfd.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/virt/acrn/irqfd.c b/drivers/virt/acrn/irqfd.c index d4ad211dce7a3..346cf0be4aac7 100644 --- a/drivers/virt/acrn/irqfd.c +++ b/drivers/virt/acrn/irqfd.c @@ -16,8 +16,6 @@ #include "acrn_drv.h" -static LIST_HEAD(acrn_irqfd_clients); - /** * struct hsm_irqfd - Properties of HSM irqfd * @vm:Associated VM pointer -- 2.45.0
[PATCH] ftrace: Remove unused global 'ftrace_direct_func_count'
From: "Dr. David Alan Gilbert" Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API") stopped setting the 'ftrace_direct_func_count' variable, but left it around. Clean it up. Signed-off-by: Dr. David Alan Gilbert --- include/linux/ftrace.h | 2 -- kernel/trace/fgraph.c | 11 --- kernel/trace/ftrace.c | 1 - 3 files changed, 14 deletions(-) diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index b01cca36147ff..e3a83ebd1b333 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -413,7 +413,6 @@ struct ftrace_func_entry { }; #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS -extern int ftrace_direct_func_count; unsigned long ftrace_find_rec_direct(unsigned long ip); int register_ftrace_direct(struct ftrace_ops *ops, unsigned long addr); int unregister_ftrace_direct(struct ftrace_ops *ops, unsigned long addr, @@ -425,7 +424,6 @@ void ftrace_stub_direct_tramp(void); #else struct ftrace_ops; -# define ftrace_direct_func_count 0 static inline unsigned long ftrace_find_rec_direct(unsigned long ip) { return 0; diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index c83c005e654e3..a130b2d898f7c 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -125,17 +125,6 @@ int function_graph_enter(unsigned long ret, unsigned long func, { struct ftrace_graph_ent trace; -#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS - /* -* Skip graph tracing if the return location is served by direct trampoline, -* since call sequence and return addresses are unpredictable anyway. -* Ex: BPF trampoline may call original function and may skip frame -* depending on type of BPF programs attached. -*/ - if (ftrace_direct_func_count && - ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE)) - return -EBUSY; -#endif trace.func = func; trace.depth = ++current->curr_ret_depth; diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index b18b4ece3d7c9..adf34167c3418 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -2538,7 +2538,6 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec) /* Protected by rcu_tasks for reading, and direct_mutex for writing */ static struct ftrace_hash __rcu *direct_functions = EMPTY_HASH; static DEFINE_MUTEX(direct_mutex); -int ftrace_direct_func_count; /* * Search the direct_functions hash to see if the given instruction pointer -- 2.45.0
Re: [RFC] New kernel-message logging API
> I don't know. Compare the following two lines: > > printk(KERN_INFO "Message.\n"); > kprint_info("Message."); > > By dropping the lengthy macro (it's not like it's going to change > while we're running anyway, so why not make it a part of the function > name?) and the final newline, we actually end up with a net decrease > in line length. Agreed. In fact, you may want to write a header that implements the kprint_ functions in terms of printk for out-of-core driver writers to incorporate into their code bases, so they can upgrade their API while maintaining backward compatibility. (If it were me, I'd also give it a very permissive license, like outright public domain, to encourage use.) > I thought it would be nice to have something that looks familiar, > since that would ease an eventual transition. klog is a valid > alternative, but isn't kp a bit cryptic? Well, in context: kp_info("Message."); Even the "kp_" prefix is actually pretty unnecessary. It's "info" and a human-readable string that make it recognizable as a log message. Another reason to keep it short is just that it's going to get typed a LOT. Anyway, just MHO. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] New kernel-message logging API
>> Even the "kp_" prefix is actually pretty unnecessary. It's "info" >> and a human-readable string that make it recognizable as a log message. > While I agree a prefix isn't necessary, info, warn, err > are already frequently #define'd and used. > > kp_ isn't currently in use. > > $ egrep -r -l --include=*.[ch] > "^[[:space:]]*#[[:space:]]*define[[:space:]]+(info|err|warn)\b" * | wc -l > 29 Sorry for being unclear. I wasn't seriously recommending no prefix, due to name collisions (exactly your point), but rather saying that no prefix is necessary for human understanding. Something to avoid the ambiguity is still useful. I was just saying that it can be pretty much anything withouyt confusing the casual reader. We're in violent agreement, I just didn't say it very well the first time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] New kernel-message logging API (take 2)
> Example: { > struct kprint_block out; > kprint_block_init(&out, KPRINT_DEBUG); > kprint_block(&out, "Stack trace:"); > > while(unwind_stack()) { > kprint_block(&out, "%p %s", address, symbol); > } > kprint_block_flush(&out); > } Assuming that kprint_block_flush() is a combination of kprint_block_printit() and kprint_block_abort(), you coulld make a macro wrapper for this to preclude leaks: #define KPRINT_BLOCK(block, level, code) \ do { \ struct kprint_block block; \ kprint_block_init(&block, KPRINT_##level); \ do { \ code ; \ kprint_block_printit(&block); \ while (0); \ kprint_block_abort(&block); \ } while(0) The inner do { } while(0) region is so you can abort with "break". (Or you can split it into KPRINT_BEGIN() and KPRINT_END() macros, if that works out to be cleaner.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.23-rc8 network problem. Mem leak? ip1000a?
Uniprocessor Althlon 64, 64-bit kernel, 2G ECC RAM, 2.6.23-rc8 + linuxpps (5.0.0) + ip1000a driver. (patch from http://marc.info/?l=linux-netdev&m=118980588419882) After a few hours of operation, ntp loses the ability to send packets. sendto() returns -EAGAIN to everything, including the 24-byte UDP packet that is a response to ntpq. -EAGAIN on a sendto() makes me think of memory problems, so here's meminfo at the time: ### FAILED state ### # cat /proc/meminfo MemTotal: 2059384 kB MemFree: 15332 kB Buffers:665608 kB Cached: 18212 kB SwapCached: 0 kB Active: 380384 kB Inactive: 355020 kB SwapTotal: 5855208 kB SwapFree: 5854552 kB Dirty: 28504 kB Writeback: 0 kB AnonPages: 51608 kB Mapped: 11852 kB Slab: 1285348 kB SReclaimable: 152968 kB SUnreclaim:1132380 kB PageTables: 3888 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 6884900 kB Committed_AS: 590528 kB VmallocTotal: 34359738367 kB VmallocUsed:265628 kB VmallocChunk: 34359472059 kB Killing and restarting ntpd gets it running again for a few hours. Here's after about two hours of successful operation. (I'll try to remember to run slabinfo before killing ntpd next time.) ### WORKING state ### # cat /proc/meminfo MemTotal: 2059384 kB MemFree: 20252 kB Buffers:242688 kB Cached: 41556 kB SwapCached:200 kB Active: 285012 kB Inactive: 147348 kB SwapTotal: 5855208 kB SwapFree: 5854212 kB Dirty: 36 kB Writeback: 0 kB AnonPages: 148052 kB Mapped: 12756 kB Slab: 1582512 kB SReclaimable: 134348 kB SUnreclaim:1448164 kB PageTables: 4500 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 6884900 kB Committed_AS: 689956 kB VmallocTotal: 34359738367 kB VmallocUsed:265628 kB VmallocChunk: 34359472059 kB # /usr/src/linux/Documentation/vm/slabinfo Name Objects ObjsizeSpace Slabs/Part/Cpu O/S O %Fr %Ef Flg :016 1478 1624.5K 6/3/1 256 0 50 96 * :024 170 24 4.0K 1/0/1 170 0 0 99 * :032 1339 3245.0K 11/2/1 128 0 18 95 * :040 102 40 4.0K 1/0/1 102 0 0 99 * :064 5937 64 413.6K 101/15/1 64 0 14 91 * :07256 72 4.0K 1/0/1 56 0 0 98 * :088 6946 88 618.4K151/0/1 46 0 0 98 * :096 23851 96 2.5M 616/144/1 42 0 23 90 * :128 730 128 114.6K 28/6/1 32 0 21 81 * :136 232 13636.8K 9/6/1 30 0 66 85 * :192 474 19298.3K 24/4/1 21 0 16 92 * :256 1385376 256 354.6M 86587/0/1 16 0 0 99 * :32012 304 4.0K 1/0/1 12 0 0 89 *A :384 359 384 180.2K44/23/1 10 0 52 76 *A :512 1384316 512 708.7M 173040/1/18 0 0 99 * :64072 61653.2K 13/5/16 0 38 83 *A :704 1870 696 1.3M170/0/1 11 1 0 93 *A :0001024 4271024 454.6K111/9/14 0 8 96 * :0001472 1501472 245.7K 30/0/15 1 0 89 * :00020481589912048 325.7M 39759/25/14 1 0 99 * :0004096514096 245.7K 30/9/12 1 30 85 * Acpi-State 51 80 4.0K 1/0/1 51 0 0 99 anon_vma 1032 1628.6K 7/5/1 170 0 71 57 bdev_cache 43 72036.8K 9/1/15 0 11 83 Aa blkdev_requests 42 28812.2K 3/0/1 14 0 0 98 buffer_head 59173 10411.1M2734/1690/1 39 0 61 54 a cfq_io_context 223 15240.9K 10/6/1 26 0 60 82 dentry 98641 19219.7M 4813/274/1 21 0 5 96 a ext3_inode_cache115690 68886.3M 10545/77/1 11 1 0 92 a file_lock_cache 23 168 4.0K 1/0/1 23 0 0 94 idr_layer_cache118 52869.6K 17/1/17 0 5 89 inode_cache 1365 528 798.7K195/0/17 0 0 90 a kmalloc-131072 1 131072 131.0K 1/0/11 5 0 100 kmalloc-163848 16384 131.0K 8/0/11 2 0 100 kmalloc-327681 3276832.7K 1/0/11 3 0 100 kmalloc-8 1535 812.2K 3/1/1 512 0 33 99 kmalloc-819