Re: [ANNOUNCE] block device interfaces changes
PLEASE TAKE ME OFF THE CC LIST. BTW: I'm on holidays and won't be replying to email for a while. Richard B. Johnson writes: On Sun, 9 Jan 2000, Alexander Viro wrote: [SNIPPED] that we provide source to the end-user, they required that we supply a "current" distribution of Linux if the end-user requests it. Oh. My. God. They are requiring you to do WHAT??? Do you mean that you really ship 2.3.x to your customers? Arrggh. "Source" == "source of what we are shipping". And not "anything that was written by other guys who started from the same source". It's utter nonsense. _No_ license can oblige you to include the modifications done by somebody else. Otherwise you'ld have those drivers in the main tree, BTW - _that_ much should be clear even for your LD. [snip] The obvious solution, given these constraints, is that we just ignore all changes until shipping time, then attempt to compile with the latest distribution, fixing all the problems at once. However, we then end up shipping untested software which ends up being another problem. Checking to see if it "runs" isn't testing software in the cold cruel world of industry. You do realize that stability of the system doesn't exceed that of the weakest link, don't you? You _are_ shipping untested software if you are shipping 2.3.whatever + your drivers. It's called unstable for a good reason. Ouch... OK, what if Linus will put a pre-patch-2.3.39-dont-even-think-of-using-it-anywhere-near-your-data-3.gz on ftp.kernel.org tomorrow? Will your LD require you to ship _that_? No? Is the notion of 'untested software' completely alien to them? BTW, you could point them to Debian or RH - none of them ships the 2.3.x in released versions _and_ it's not even the latest 2.2.x existing. Hell, Debian 2.1 is shipped with 2.0 - switch to 2.2 is in potato (== Debian 2.2 to be). RH uses 2.2.12, AFAICS (with a lot of patches). And all of them have darn good reasons to do so - stability being the first one. Is there any chance to get your legal folks talking with RH lawyers? Or Caldera, or Corel ones... So, presently, I have 13 drivers I have to keep "current". Yesterday they all got broken again. A week before, half of them were broken because somebody didn't like a variable name! Which might mean that repository of pointers to 3rd-party drivers (along with the contact info) might be Good Thing(tm). I would suggest the following: keep this information in DNS (RBL-like scheme; i.e. driver_name.author_or_company_name.drivers.linux.org having TXT record with URL and kernel version(s) in the body). Then all you need is (a) standard address (e.g. [EMAIL PROTECTED]) aliased to the script; (b) said script verifying (PGP, GPG, whatever) the source of mail and updating the record. IOW, all it really takes is somebody with nameserver, clue and decent connectivity. Any takers? Again, since there was so much mail on this, I will answer this one only with cc to linux-kernel. The idea is that once something gets "released", it gets built with whatever distribution is available at that time. That distribution is shipped (if required). It's just like DOS/Windows/SunOS/Solaris, etc. You ship with the "current" distribution. Certainly a customer would be really pissed to get a new product with a two-year-old version of software. Once the product is shipped, we can make "service-pack" updates just like everybody else, which are not free. What this means for us, is to keep our development software sufficiently up-to-date so that there are no radical changes required once a release is imminent. In no case do we intend to ship using "development" kernels. But, to keep up-to-date for a pending release we need to be using development kernels in engineering. I never attempt to use any of the "pre-NN" intermediate stuff anyway. A lot of people are going to have to do the same as Linux works its way from being just a desktop OS to something being used to replace Sun software and Windows in industrial applications. The embeded market doesn't have to worry about releases and version numbers because the customer usually couldn't "upgrade" anyway. I have a "platinum" project which uses some old stable version of Linux that I happen to like. It's an embedded system that runs VXI bus instruments. It works just fine. That kernel will never have to be changed, even if bugs are found. They can be fixed with the reset button. There are two major interfaces that are critical to success: (1) The POSIX/C-interface API. (2) The module API. Much care has been taken to assure that (1) stays stable. We need some such care on (2). In particular, if major changes are made, they should be terminating, i.e., made in such a way that they are unlikely to ever have to be made again. For example, open(), read(), write(), ioctl(), in the kernel seem to
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
"Stephen C. Tweedie" wrote: (...) 3) The soft-raid backround rebuild code reads and writes through the buffer cache with no synchronisation at all with other fs activity. After a crash, this background rebuild code will kill the write-ordering attempts of any journalling filesystem. This affects both ext3 and reiserfs, under both RAID-1 and RAID-5. Interaction 3) needs a bit more work from the raid core to fix, but it's still not that hard to do. So, can any of these problems affect other, non-journaled filesystems too? Yes, 1) can: throughout the kernel there are places where buffers are modified before the dirty bits are set. In such places we will always mark the buffers dirty soon, so the window in which an incorrect parity can be calculated is _very_ narrow (almost non-existant on non-SMP machines), and the window in which it will persist on disk is also very small. This is not a problem. It is just another example of a race window which exists already with _all_ non-battery-backed RAID-5 systems (both software and hardware): even with perfect parity calculations, it is simply impossible to guarantee that an entire stipe update on RAID-5 completes in a single, atomic operation. If you write a single data block and its parity block to the RAID array, then on an unexpected reboot you will always have some risk that the parity will have been written, but not the data. On a reboot, if you lose a disk then you can reconstruct it incorrectly due to the bogus parity. THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the only way you can get bitten by this failure mode is to have a system failure and a disk failure at the same time. --Stephen thank you very much for these clear explanations, Last doubt: :-) Assume all RAID code - FS interaction problems get fixed, since a linux soft-RAID5 box has no battery backup, does this mean that we will loose data ONLY if there is a power failure AND successive disk failure ? If we loose the power and then after reboot all disks remain intact can the RAID layer reconstruct all information in a safe way ? The problem is that power outages are unpredictable even in presence of UPSes therefore it is important to have some protection against power losses. regards, Benno.
[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, This is a FAQ: I've answered it several times, but in different places, so here's a definitive answer which will be my last one: future questions will be directed to the list archives. :-) On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner [EMAIL PROTECTED] said: then raid can miscalculate parity by assuming that the buffer matches what is on disk, and that can actually cause damage to other data than the data being written if a disk dies and we have to start using parity for that stripe. do you know if using soft RAID5 + regular etx2 causes the same sort of damages, or if the corruption chances are lower when using a non journaled FS ? Sort of. See below. is the potential corruption caused by the RAID layer or by the FS layer ? ( does need the FS code or the RAID code to be fixed ?) It is caused by neither: it is an interaction effect. if it's caused by the FS layer, how does behave XFS (not here yet ;-) ) or ReiserFS in this case ? They will both fail in the same way. Right, here's the problem: The semantics of the linux-2.2 buffer cache are not well defined with respect to write ordering. There is no policy to guide what gets written and when: the writeback caching can trickle to disk at any time, and other system components such as filesystems and the VM can force a write-back of data to disk at any time. Journaling imposes write ordering constraints which insist that data in the buffer cache *MUST NOT* be written to disk unless the filesystem explicitly says so. RAID-5 needs to interact directly with the buffer cache in order to be able to improve performance. There are three nasty interactions which result: 1) RAID-5 tries to bunch writes of dirty buffers up so that all the data in a stripe gets written to disk at once. For RAID-5, this is very much faster than dribbling the stripe back one disk at a time. Unfortunately, this can result in dirty buffers being written to disk earlier than the filesystem expected, with the result that on a crash, the filesystem journal may not be entirely consistent. This interaction hits ext3, which stores its pending transaction buffer updates in the buffer cache with the b_dirty bit set. 2) RAID-5 peeks into the buffer cache to look for buffer contents in order to calculate parity without reading all of the disks in a stripe. If a journaling system tries to prevent modified data from being flushed to disk by deferring the setting of the buffer dirty flag, then RAID-5 will think that the buffer, being clean, matches the state of the disk and so it will calculate parity which doesn't actually match what is on disk. If we crash and one disk fails on reboot, wrong parity may prevent recovery of the lost data. This interaction hits reiserfs, which stores its pending transaction buffer updates in the buffer cache with the b_dirty bit clear. Both interactions 1) and 2) can be solved by making RAID-5 completely avoid buffers which have an incremented b_count reference count, and making sure that the filesystems all hold that count raised when the buffers are in an inconsistent or pinned state. 3) The soft-raid backround rebuild code reads and writes through the buffer cache with no synchronisation at all with other fs activity. After a crash, this background rebuild code will kill the write-ordering attempts of any journalling filesystem. This affects both ext3 and reiserfs, under both RAID-1 and RAID-5. Interaction 3) needs a bit more work from the raid core to fix, but it's still not that hard to do. So, can any of these problems affect other, non-journaled filesystems too? Yes, 1) can: throughout the kernel there are places where buffers are modified before the dirty bits are set. In such places we will always mark the buffers dirty soon, so the window in which an incorrect parity can be calculated is _very_ narrow (almost non-existant on non-SMP machines), and the window in which it will persist on disk is also very small. This is not a problem. It is just another example of a race window which exists already with _all_ non-battery-backed RAID-5 systems (both software and hardware): even with perfect parity calculations, it is simply impossible to guarantee that an entire stipe update on RAID-5 completes in a single, atomic operation. If you write a single data block and its parity block to the RAID array, then on an unexpected reboot you will always have some risk that the parity will have been written, but not the data. On a reboot, if you lose a disk then you can reconstruct it incorrectly due to the bogus parity. THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the only way you can get bitten by this failure mode is to have a system failure and a disk failure at the same time. --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, On Tue, 11 Jan 2000 20:17:22 +0100, Benno Senoner [EMAIL PROTECTED] said: Assume all RAID code - FS interaction problems get fixed, since a linux soft-RAID5 box has no battery backup, does this mean that we will loose data ONLY if there is a power failure AND successive disk failure ? If we loose the power and then after reboot all disks remain intact can the RAID layer reconstruct all information in a safe way ? Yes. --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
"Stephen C. Tweedie" wrote: Hi, This is a FAQ: I've answered it several times, but in different places, SNIP THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the only way you can get bitten by this failure mode is to have a system failure and a disk failure at the same time. To try to avoid this kind of problem some brands do have additional logging (to disk which is slow for sure or to NVRAM) in place, which enables them to at least recognize the fault to avoid the reconstruction of invalid data or even enables them to recover the data by using redundant copies of it in NVRAM + logging information what could be written to the disks and what not. Heinz