Plugin for corruption resistance?
Anyone ever given a though to adding support to reiserfs to store a cryptographic checksum along with a file? The idea is that files get a hidden attribute that contains their SHA1 hash. If the file is modified, the hash is marked as 'unclean'. A trusted cleaner comes by eventually and hashes the file, OR the file is hashed right away if someone tried to read the attribute while the file is unclean. Fsck could be optionally told to go check the hash on every file. Files could also be tested via a background process that randomly tests some files every night. Why would this be useful? 1. Lots of applications today (such a P2P sharing systems) need the hashes of files.. it's inefficient to keep recomputing them. The file system always knows when a file changes, so it can be setup to always return the correct hash. 2. Random disk corruption can go undetected (even if the drives ECC is sufficient to prevent corruption there could be memory, bus, or kernel issues the corrupt data, a hash will help it be detected). 3. Although there are encrypted block devices available in Linux, none of them can provide authentication.. So it's possible for an attacker (with access to your disk) to replace hunks of files with random (and potentially chosen depending on the chaining mode) crud without detection. 4. It could greatly speed up casual verification of files for changes (if you don't trust the kernel to report the true hash, then you couldn't trust it to return the real file to some userspace file verifier either) it could also be used to help locate duplicates in a very efficient manner..
Re: Plugin for corruption resistance?
On Thu, 17 Feb 2005 23:28:09 -0500, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > On the flip side, hash functions like MD5 or the SHA family are fairly > bulletproof, > but are essentially impossible to develop an incremental update for (if there > existed a fast incremental update for the hash function, that would imply a > very low preimage resistance, rendering it useless as a cryptographic hash). Tree hashes. Divide the file into blocks of N bytes. Compute size/N hashes. Group hashes into pairs. Compute N/2 N' hashes, this is fast because hashes are small. Group N' hashes into pairs compute N'/2 N'' hashes etc.. Reduce to a single hash. A number of useful tradeoffs are possible: By enlarging N you improve the strength along various cryptographic dimensions. By changing the fanout, and deciding how many N your store, which N you store, which N' you store, etc you decide how easy it is to update the hash and you decide what the smallest increment you can test is... you trade off storage (and a little computation) for this ease. > Also, there's another issue - unlike standard ECC codes that can actually > *fix* > the problem (for at least small number of bit errors), it's unclear what you > should > do if you find a mismatch between the hash of a block and the block contents, > as > you don't know whether it's the actual data or the hash that's corrupted In my initial suggestion I offered that hashes could be verified by a userspace daemon, or by fsck (since it's an expensive operation)... Such policy could be controlled in the daemon. In most cases I'd like it to make the file inaccessible until I go and fix it by hand. It would also be useful to have the checker daemon watch the logs (or recieve notifications through some kernel interface)... and any block level errors (or smartd errors) backprojected up (through raid and lvm remappings) to the file system level ... After identifying the potentially corrupted file, it could then test the file. If the file has been corrupted, the configured action is taken. If this policy is in userspace, the level of action sopication could be very high: for example, if I was on a distribution with package management, and the file was outside of /home, and the package flags didn't indicate it was a config file.. then go fetch the package, and replace the file and send me an email so I don't forget how wonderful my OS is. :)
Re: Plugin for corruption resistance?
On Fri, 18 Feb 2005 17:09:00 -0500, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > On Fri, 18 Feb 2005 08:36:51 EST, Gregory Maxwell said: > > > Tree hashes. > > Divide the file into blocks of N bytes. Compute size/N hashes. > > Group hashes into pairs. Compute N/2 N' hashes, this is fast because > > hashes are small. Group N' hashes into pairs compute N'/2 N'' hashes > > etc.. Reduce to a single hash. > > You get massively I/O bound real fast this way. You may want to re-evaluate > whether this *really* buys you anything, especially if you're not using some > sort of guarantee that you know what's actually b0rked... I brought up tree hashes because someone pointed out there was no way to incrementally update a normal hash. Tree hashes can easily be incrementally updated if you keep all the sub parts. I don't think that would suddenly make it useful for frequently updated files. > > In my initial suggestion I offered that hashes could be verified by a > > userspace daemon, or by fsck (since it's an expensive operation)... > > Such policy could be controlled in the daemon. > > In most cases I'd like it to make the file inaccessible until I go and > > fix it by hand. > > You're still missing the point that in general, you don't have a way to tell > whether > the block the file lived in went bad, or the block the hash lived in went bad. I'm not missing the point. Compare the number of disk blocks a file takes vs the hash. Compare the ease of atomically updating the hash data vs atomically updating the hash. If they don't match, It is far more likely that the file has been silently corrupted than hash has been.. In either case, something seriously wrong has happened (i.e. that *any* data has been corrupted without triggering alarms elsewhere). Wetware will be required figure out what is going on. Perhaps correct a serious problem before it eats the whole file system... Automagic correction of stuff that is automagically correctable is useful in that it might prevent something worse from happening... For example, if the corrupted file was /sbin/init.. regardless of the cause of the problem I'd be glad if the system took some action while the wetware was in an uninteruptable sleep. ;) > Sure, if the file *happens* to be ascii text, you can use Wetware 1.5 to scan > the file and tell which one went bad. However, you'll need Wetware 2.0 to > do the same for your multi-gigabyte Oracle database... :) Such a proposed system would likely not be all that useful on a live database.. the overhead of computing hashes would likely be too great.. Rather, it would be useful if the database system used it's knowledge of how data was stored to do this efficiently. If the database system were written with reiserfs in mind and rather than using a couple of big opaque files it stored it's data in tens of thousands of files... then perhaps such a hashing scheme might actually work out okay. > (And yes, I *have* seen cases where Tripwire went completely and totally > bananas > and claimed zillions of files were corrupted, when the *real* problem was that > the Tripwire database itself had gotten stomped on - so it's *not* a purely > theoretical issue The discussion is to store the hash in the file metadata. ... If that is getting stomped on, it's a *good* thing if the system goes totally bananas. In a great many situations I'd rather lose a file completely than have some random bytes in it silently corrupted. (and of course, attaching hashes doesn't mean you lose the file... it means it gets brought to your attention) As things stand today, there are hundreds of ways a system could end up with files getting silently corrupted. Many of them would be fairly difficult to detect until it's far too late (to recover cleanly or even detect the root cause). Right now most distros have a package management system that can detect changes in some system files, which is useful against a small subset of these problems, but not most since it will only detect problems in files that almost never change. The proposed system of attaching hashes in metadata would protect all files that are not constantly updated (so that counts out database and single file mailboxes), but could protect most everything else. .. And the things that can't be protected could be with changes to their operation that would be useful to make for reiserfs due to other reasons. (there is no performance reason in reiserfs to make a mail box a single file, for example). Furthermore, attached hashes could greatly speed up applications using hashes in a way that no userspace solution can: Userspace solutions can't maintain a cache of the files hashes because they have no way to be *sure* that the file wa
Re: reiser4 plugins
On 6/25/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > I wonder if Apple is a better > social environment for developers these days than Linux? It would be > fun to work with Steve Jobs, he has such a sense of vision and a delight > in new things. He hires good people too; Dominic Giampaolo is really > sharp. And trade freedom for this? In the current situation you can release code even if the powers that be do not agree. Others can choose to integrate and distribute your code, .. they have and they will.. If you were at apple and some authority decided against a feature you and your team would likely not be permitted to develop it, and if developed it would not be distributable. The situation right now is somewhat unfortunate, but it is not unsolvable. Trading away freedom is seldom a good deal.
Re: reiser4 plugins
On 6/26/05, Lincoln Dale <[EMAIL PROTECTED]> wrote: > the l-k community have asked YOU may times. any lack of response isn't > because of the kernel cabal .. its because YOU are refusing to entertain > any notion that what Reiser4 has implemented is unpalatable to the > kernel community. A lot of this is based on misconceptions, for example in recent times reiser4 is faulted for layering violations.. But it doesn't have them, it neither duplicates nor modifies the VFS. It has also been requested that reiser4 be changed to move some of it's operations above the VFS... not only would that not make sense for the currently provided functions, but merging was put off previously because of changes to the VFS now that it doesn't change the VFS we are asking hans to push it off until it does?? It's a filesysem for gods sake. Hans and his team have worked hard to minimize its impact and they are still willing to accept more guidance, even if their patience has started to run a little thin. The acceptance of reiser4 into the mainline shouldn't be any larger deal than any other filesystem, but yet it is...
Re: reiser4 plugins
On 6/27/05, Horst von Brand <[EMAIL PROTECTED]> wrote: > Wonderful! I carefully "transparently encrypt" my secret files, so > /everybody/ can read them! Now /that/ is progress! All of this side feature argument is completely offtopic for the inclusion of reiser4, but oh well. In any case, the real use for encrypted files (vs encrypted partitions) would be for doing things like tying keying into the login process so that your files are only accessible while you are logged in. This would be a very nice feature on a multiuser system.
Re: reiser4 plugins
On 6/27/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > >. But nevertheless it didn't survive, as like V3, with time V4 became > >slower and slower. In this case no year was needed, but just one month or > >alike. So end of test...but in fact I'll give V4 another go in the near > >future. > Interesting that it got slower with time. It sounds like our online > repacker is much needed. It will be a priority for after the kernel merge. Whered it go, I recall it being activated with: echo 1 >/sys/fs/reiser4/*/repacker/start
Re: How to user reiser4 and crypt plugin ??
On 6/28/05, Hubert Chan <[EMAIL PROTECTED]> wrote: > > Isn't dm-crypt the new way of doing this? > Yes and no. dm-crypt is recommended over cryptoloop. But there is also > loop-AES, which is more secure (in some modes) than dm-crypt (currently). There is now support for both LRW-AES and ESSIV in the mainstream kernel. With ESSIV the security should be the same as loop-aes and with LRW-AES potentially better. For the highest security you should also use the LUKS cryptosetup because it provides hardening for passwords.
Re: reiser4progs do not handle the reiser4 format changes
On 7/20/05, Edward Shishkin <[EMAIL PROTECTED]> wrote: > like other existing > ones. This will be a way to create cryptcompress files per superblock. > There is another > more flexible way (which is compatible with the previous one) to create > it per file/directory, > but it uses deprecated metas interface.. Per ... superblock? Excuse me? Nonselective use of this feature will be nearly useless. There must be an API to selectively control the feature. This sounds like a silly tantrum about the interface changes.
Re: reiser4 performance
On 8/8/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > I should add that fsync performance has not been worked on yet, which is > surely why postgres performance is poor. Hans, I'm on the postgresql hackers list (although I don't really have a voice there, so I can't really speak much for reiser4 there).. One of the 'interesting' issues they face is that that postgresql database works with 8K pages. From a performance and reliability perspective they would benefit impressively from a file system (and VFS) which could atomically update their 8k pages. Without such a feature their performance is slaughtered when operating in a mode that provides the highest reliability, and their reliability is slaughtered when operating in the highest performance configuration. I'm sure the PostgreSQL folks would be here themselves asking for help with this issue... if they weren't so oriented around FreeBSD. :) If ever you are looking for a killer app for Reiser4 that people who don't care about the visionary stuff will care about: you couldn't find one better than postgresql. If you could get postgresql working as reliably as a double logged full fsync configuration but performing as fast as a configuration with async writes you'd have a lot more supporters.
Re: reiser4 performance
On 8/8/05, David Masover <[EMAIL PROTECTED]> wrote: > Gregory Maxwell wrote: > > On 8/8/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > > > If ever you are looking for a killer app for Reiser4 that people who > > don't care about the visionary stuff will care about: > > Define "visionary"? > > I can name a few things that work best in Reiser4, and very well in v3, > simply because of efficient storage of small files and lazy allocation: > > webserver -- lots of small files, very few large ones, mostly reads Given webserver horsepower from the same budget class as the Internet pipe it is attached to, it is utterly trivial to saturate the pipe with static content.. even if the files are small, larger than core in total, and have poor locality. (and those are very infrequently the case). So for most webserver cases, FS speed doesn't matter. For the few cases where it does, locality is usually fairly good... so who cares if the new FS is 2x faster, when it is still 200x slower than ram. Add ram. > mailserver -- especially IMAP+maildir, lots of small files, read/write > and so on... An interesting application space, no doubt. Although you can cure a lot of sins tossing solid state disk at it. :) (or just battery backed cache on a hardware raid controller). > Gentoo box: > /usr/portage is over a hundred thousand very small files, updated via > rsync. Since they are updated all at once, you get a boost out of lazy [snip] Rsync algo or the network is going to be your bottleneck there, for the sync... Are you really getting disk bound for compile? if so increase your -j N. > These are all things that Reiser4 already does better than anything > else. So now we're going to get Postgres to run faster. I can't wait > until we have more people hacking on the plugin interface -- then we'll > have some *real* killer apps. I agree. Still it would be nice to have some really good bread and butter improvements.. and a sufficient level of 'transaction' support exposed to PostgreSQL could result in a huge performance improvement while improving reliability. "Your database is more reliable on reiser4" would be a compelling argument... even to those not convinced by plugins and small files.
Re: reiser4 performance
On 8/8/05, David Masover <[EMAIL PROTECTED]> wrote: > "Reiser4 would be great if..." is getting old. It is great, and it's > getting even better pretty fast. > > And, by the way, if the transaction interface gets done, it's not just > databases that will benefit, but also small files. After all, what kind > of transactions are used for your OpenOffice document? ... Well there doesn't actually need to be a transaction interface for postgresql's needs.. it just needs a fairly limited set of assurances from the VFS/FS that ... aren't usually provided. Beyond that, it already handles its own transactions. It looks like from Hans' reply the reiser4 already provides everything needed.. which I had suspected.
Re: reiser4 performance
On 8/8/05, David Masover <[EMAIL PROTECTED]> wrote: > Absolutely. I'm not knocking your idea, just wanted to clarify that > "Reiser4 would be great if..." is getting old. It is great, and it's > getting even better pretty fast. (sorry for reply bloat) I just wanted to point out.. that wasn't my intent. I think the only 'feature' reiser4 needs right now is mainstream inclusion. My ability to use it is severely hampered only being able to use it on boxes running test-kernel of the day.. which are laden with other issues unrelated to reiser4 that I don't have time to deal with.
Re: reiser4 performance
On 8/11/05, PFC <[EMAIL PROTECTED]> wrote: > > > Well, but then you have to tell postgres that it can assume these things > > about reiser4. > > you can already set the sync mode in the config file to a llot of > different choices, like fdatasync, fsync, O_SYNC, etc, so a reiser4 option > would be possibel I guess. Right, and the PostgreSQL team has already shown that they are willing to create platform specific options. Could someone familiar with the reiser4 internals provide some detailed information about what reiser4 currently provides in this regard? The specific concerns would be about controlling the ordering and atomicity of 8k writes and how to tell when they are fixed to the media.
Re: KDE integration with ReiserFS 4
On 8/16/05, Christian Iversen <[EMAIL PROTECTED]> wrote: > Are there independent implementations of QT? Well... Google "harmony project" "qt". :) I think in reality it would be reasonable to handle this like the Beagle search engine handles it's metadata.. Either you enable extended attrs in your file system, or it uses a sqllite database. The performance with the sqllite database stinks, but it's there for those that can't turn on extended attrs on their filesystem.
Re: Basic interface for key management in reiser4 (DRAFT)
On 8/18/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > but the idea is to use keys instead of standard unix permissions > > I think you need to store keys in a per process place, and allow > specifying whether children of a process inherit the keys somehow. Oh, slick! I did not previously catch what the advantage would be to using crypto in the FS rather than just a crypted block device... minus some non critical niceties (like being able to use a random & per file IV is good from a security perspective.). Now I see what is possible, and I'm really excited. It will be interesting to see how a system with many keys performs.. most fast implementations of most crypto algs need a computationally expensive key setup which produces a fairly large working set of constants for encryption/decryption. With a per process structure there should be a way to revoke all instances of a key from all the other running processes that carry it.. or at least all processes of a specific user. Otherwise it will be too easy to accidental leave keys laying around. This per process crypto in the FS fits very nicely with a lot of the other recent security advances in the Linux world. Thanks for something new and exciting to talk about with my Linux using friends! :)
Re: Basic interface for key management in reiser4 (DRAFT)
On 8/19/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > I think it would make sense to put the keys as files in /proc: > e.g. > touch /proc/1/keys/private/"0893410984328098094321" > would give init (aka process with id of 1) a new key that its children > would not inherit > touch /proc/1/keys/inheritable/"1893410984328098094321" > would give init a new key that is children WOULD inherit. > > Not sure what the permissions on that keys directory would be, I guess 700. Eh, that would make leaks of key information easy.. Since most applications don't assume that something visable in a file name could be highly confidential information. :) Better to have the file in proc be an abbriviated keyid (some kind of smaller lossy hash of the key). To add you might echo "label:real key data" > /proc/1/keys/private/keys, and a file would appear named /proc/1/keys/private/label-123abcd which is a user defined label and the hash. Under no condition should a process be able to actually read the key data.. they can get the ID.. delete keys based on IDs.. etc. But they can't get the data.. otherwise a process could steal keys. If I take away a processes key from proc, there should be no way for it to get any further access to those files.. no chance that it could have hidden away a copy of that key. On 8/19/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > I think it would make sense to put the keys as files in /proc: > > e.g. > touch /proc/1/keys/private/"0893410984328098094321" > would give init (aka process with id of 1) a new key that its children > would not inherit > > touch /proc/1/keys/inheritable/"1893410984328098094321" > would give init a new key that is children WOULD inherit. > > Not sure what the permissions on that keys directory would be, I guess 700. > > Hans > > Gregory Maxwell wrote: > > >On 8/18/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > > > > > >>but the idea is to use keys instead of standard unix permissions > >> > >>I think you need to store keys in a per process place, and allow > >>specifying whether children of a process inherit the keys somehow. > >> > >> > > > >Oh, slick! > > > >I did not previously catch what the advantage would be to using crypto > >in the FS rather than just a crypted block device... minus some non > >critical niceties (like being able to use a random & per file IV is > >good from a security perspective.). > > > >Now I see what is possible, and I'm really excited. > > > >It will be interesting to see how a system with many keys performs.. > >most fast implementations of most crypto algs need a computationally > >expensive key setup which produces a fairly large working set of > >constants for encryption/decryption. > > > >With a per process structure there should be a way to revoke all > >instances of a key from all the other running processes that carry > >it.. or at least all processes of a specific user. Otherwise it will > >be too easy to accidental leave keys laying around. > > > >This per process crypto in the FS fits very nicely with a lot of the > >other recent security advances in the Linux world. > > > >Thanks for something new and exciting to talk about with my Linux > >using friends! :) > > > > > > > > > >
Re: Basic interface for key management in reiser4 (DRAFT)
On 8/19/05, Edward Shishkin <[EMAIL PROTECTED]> wrote: > Actually it is critical: > http://marc.theaimsgroup.com/?l=linux-kernel&m=107719798631935&w=2 > But why random? It is slowly.. I would prefer object-id-based one.. The IV doesn't need to be random, but it should be different for every instance of a file, different every time a file is deleted and recreated, not increment in any predictable way between files, and be impossible to control by a user. It should have a low possibility of reuse. Earlier linux DM crypt had a weakness where the IV incremented with every block in the file system, this lead to some interesting watermarking attacks. It was possible to form a stream of data with changes that negated the XORs from the trivially incremented IV, and thus the first block of each sector could be used to form an electronic code book. This has since been corrected with a couple of options (one is to use the cryptographic hash of the block number). If the user has some way of trivially influencing differences in the object ID, for example sequential files have sequential object id, then the object ID should be passed through a hash function so that a user must know the full object iD in order to predict even a single bitflip.
WinFS beta out
http://it.slashdot.org/it/05/08/29/2241243.shtml?tid=109&tid=218 Looks like MSFT will be beating Linux to the nextgen FS punch after all. ;)
Wikipedia article
Someone might want to update the information on reiser4 at: http://en.wikipedia.org/wiki/Comparison_of_file_systems and http://en.wikipedia.org/wiki/Reiser4 A fair bit of information is missing/out of date it seems.
Re: Some questions about r4
On 9/1/05, Pysiak Satriani <[EMAIL PROTECTED]> wrote: > Hi, > Is there a r4-patch cooking for 2.6.13 ? > Is the requirement for not enabling 4k-stacks going away soon? > If I patch a vanilla kernel with r4, and sop using r4, would you say that the > changes I introduced by patching are rather safe to the rest of the kernel, > or would you recomend going back to vanilla just in case? On this point: 2.6.13 is out, is there going to be a pass at getting reiser4 into 2.6.14?
Re: journal size reiserfs vs reiser4
On 9/2/05, Łukasz Mierzwa <[EMAIL PROTECTED]> wrote: > Dnia Fri, 02 Sep 2005 09:19:55 +0200, Hans Reiser <[EMAIL PROTECTED]> napisał: > > It could probably be a lot less than 5%, 2% is more than enough I would > > guess, but we also need to reserve space to get good performance. > Maybe You can make it an mkfs.reiser4 option, set 5% to default so it won't > change anything to 99% of people using reiser4 but will make that 1% that > want some other value happy. What would be really nice would be to have two options, a reserved amount, and a root reserved amount at mkfs time. The behaviour of the ext2/3 filesystem allows you to reserve some portion of the disk for root use. This prevents various unpleasant system failure modes when a user task goes nuts and fills the disk. It would be even more nice if rather than root/nonroot if it could be controlled by a capability/selinux context so you could make it so syslog can't write into that safe buffer.. But it's kinda moot at this point, mainline inclusion must be the highest priority right now.
Re: journal size reiserfs vs reiser4
On 9/6/05, Tom Vier <[EMAIL PROTECTED]> wrote: > My vote: put the reserve % in the superblock (if it isn't already) and > give mkfs a sane default. Looking at the code it appears it would be easier to make it a mount option that defaults to 5%. Would that work okay for you?
Re: journal size reiserfs vs reiser4
On 9/6/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > Guys, I am sorry, but I just don't think this issue is a priority > compared to other issues. Sorry, too much else going on, honestly. I'd say so: Per Andrew Morton. reiser4 merge status: Stuck. Last time we discussed this I asked the reiser4 team to develop and negotiate a bullet-point list of things to be addressed. Once that's agreed to, implement it and then we can merge it. None of that has happened and as far as I know, all the review feedback which was provided was lost. :(
Re: why does reiserfs list get so much spam?
On 9/16/05, Tomasz Chmielewski <[EMAIL PROTECTED]> wrote: > yeah, but it is to be used by the end-user. > > the archive will be still filled with spam. > > not everyone who wants to know about reiser subscribes to the list; most > of the people would just use the archives. I actually subbed to the list because I was frustrated by spam in the archive. Whomever runs the archive should at least just pipe it through spam assassin. If some legitimate messages are lost from the web archive it could not be worse than what we have now, an archive made nearly useless from excessive spam.
Re: Compression Plugin
On 9/20/05, Clay Barnes <[EMAIL PROTECTED]> wrote: > Forgive me if this has been answered recently, but I haven't gotten my > last two dozen e-mails for today yet. > > Regarding the compression plugin, what sort of compression can one > expect from it? [snip] Just a general, no reiser4 specifc answer since because the compression isn't done yet I don't know the details on reiser4's performance It is generally the case the disk based compression performs somewhat worse than normal file based compressors. This is because every block of data must be compressed alone in order to preserve the random access semantics of the file system. This also means that there is less to be gained by using alternative compression systems (such as bzip2 or better LZMA) because most only pick up their impressive performance as a result of having a much larger context and at a greatly increased cost in memory usage. For disk based compression Lz77 is pretty good and is so widely used that people feel comfortable implimenting it in kernel space. Another interesting player is LZO because decompression requires very little memory and is VERY VERY fast (I think someting like 8x faster than gzip in my testing). This means that decompression is effectively free. However compression is perhaps 10% worse than LZ77 ... on most hardware the disk is so slow that the decrease in compression outweighs the improvement in decompression performance. But on systems with a fast disk array, LZO may be a welcome tradeoff.
Re: Compression Plugin
On 9/20/05, David Masover <[EMAIL PROTECTED]> wrote: > Probably lzo, which is already used for other things like network > connections (ssh, openvpn, and so on). The nice thing about lzo is that > it's fast, faster than gzip or bzip2, and gets decent compression -- not > great, but decent. I don't usually get gzip or bzip2 to compress at > disk speed, but then, I usually crank the compression way up, so YMMV. > The point of using a fast algorithm is that you not only save space, but > when you apply it to things like text files, it can actually make things > go faster. > > But I imagine it will be settable per-file. Files can be both encrypted > and compressed, and I think (I hope) it could be with a choice of > crypto/compression algorithms. I didn't know SSH supported LZO. Rsync does though... Gzip compression is pretty darn quick at lower levels, though depending on the lz77 implimentaiton it can be fairly slow at higher compression levels. An interesting idea: select the algo and a range of compression levels per file, but select the actual compression level at flush time based on some estimate of how loaded the system is.. :) Probably not worth it even though the amount of compression and the speed differ greatly from -1 to -9... I hope no one wastes their time on it until the more important things are done.. but perhaps a nice touch.
Re: I request inclusion of reiser4 in the mainline kernel
On 9/20/05, Theodore Ts'o <[EMAIL PROTECTED]> wrote: > The script could be improved by select random locations to damage the > filesystem, instead of hard-coding the seek=7 value. Seek=7 is good > for testing ext2/ext3 filesystems, but it may not be ideal for other > filesystems. What would be interesting would be to overwrite random blocks in an sub-exponentially increasing fashion, fsck and measure file loss at every step. You fail the test if the system panics reading a FS that passed a fsck. It would be interesting to chart files lost and files silently corrupted over time... Another interesting thought would be to snapshot a file system over and over again while it's got a disk workout suite running on it.. Then fsck the snapshots, and check for the amount of data loss and corruption. > There is a very interesting paper that I coincidentally just came > across today that talks about making filesystems robust against > various different forms of failures of modern disk systems. It is > going to be presented at the upcoming 2005 SOSP conference. > > http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf Very interesting indeed, although it almost seems silly to tackle the difficult problem of making filesystems highly robust against oddball failure modes while our RAID subsystem falls horribly on it's face in the fairly common (and conceptually easy to handle) failure mode of a raid-5 where two disks have single unreadable blocks on differing parts of the disk. (the current raid system hits one bad block, fails the whole disk, then you attempt a rebuild and while reading hits the other bad block and downs the array).
Iron files
So the post about file system failure modes made me think of something interesting... We'd discussed in the past that it would be interesting to store cryptographic hashes of files as metadata for facilitating applications which require hashes as well as data integrity. Of course, the challenge is making it perform well.. tree hashes make it possible but still messy. Another though on the subject, when we're using the compression plugin it's quite likely that many blocks will be shrunk quite a bit on write. We could at that time add a strong checksum (or cryptographic hash)... It could just be stored as though it were part of the compressed data, the cost partly saved by the gains of compression. It would probably be useful to include the file identity and position offset into the hash for each sub part of the file, so that if an upper level data structure were corrupted in the FS that you'd never end up with a part of another file silently sitting in the middle of another file. This would enable a policy where files could never be silently corrupted. Protection could be controlled on a file by file basis just like compression and optionally operate in mode where check data is only written but not tested (no substantial performance loss on read, but a risk of returning corrupted data to the application). Just another thought for the never ending list...
Re: I request inclusion of reiser4 in the mainline kernel
On 9/20/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > > I am not a big fan of formal committees, but would be happy to take > > part in any effort to standardize, code and test the result... > The committee could simply exchange a set of emails, and agree on > things. I doubt it needs to get all complicated. I suggest you contact > all the folks you want to be consistent with each other, send us an > email asking us to all try to work together, and then ask for proposals > on what we should all conform to. Distill the proposals, and then > suggest a common solution. With luck, we will all just say yes.:) Another goal of the group should be to formulate a requested set of changes or extensions to the makers of drives and other storage systems. For example, it might be advantageous to be able to disable bad block relocation and allow the filesystem to perform the action. The reason for this is because relocates slaughter streaming read performance, but the filesystem could still contiguously allocate around them... Perhaps a more implementable alternative is just a method to find out which sectors have been relocated so the data can be moved off of them and they be avoided. (and potentially they be 'derelocated' to preserve the relocation space) Ditto for other layers.. if a filesystem has some internal integrity function and a raid sweep has found that the parity doesn't agree, it would be nice if the FS could check all possible decodings and see if there is one that is more sane than all the others... This is even more useful when you have raid-6 and there is a lot more potential decoding. Also things like bubbling up to userspace.. If there is an unrecoverable read error in a file found during operation or an automated scan, it should show up in syslog with some working complete path to the file (as canonical as the fs can provide), and hopefully an offset to the error. Then my package manager could see if this is a file replaceable out of a package or if it's user data... If it's user data, my backup scripts can check the access time on the file and silently restore it from backup if the file hasn't changed. ... only leaving me an email "Dear operator, I saved your butt yet again --love, mr computer" And finally operator policy.. I'd like corrupted user files to become permission denied until I run some command to make the accessible, don't let me hang my apps trying to access them..
Re: Will I need to re-format my partition for using the compression plugin?
On 9/22/05, Edward Shishkin <[EMAIL PROTECTED]> wrote: > Yes. It is impossible to implement all features in one file plugin. > Checksuming means a low > performance: in order to read some bytes of such file you will need > first to read the whole file > to check a checksum (isnt it?). So it will be suitable for a small > number of special files. > To write this new file plugin you will want to use already existing > cipher and compression > transform plugins (dont mix it with cryptcompress file plugin which also > calls those plugins) > to compress and encrypt your checksumed file. For file data integrity it would actually be more useful to have a per block hash or checksum. This solves the update problem. It would be useful if the file offset and some file identifier were also mixed into the calculation so that a misplaced block will fail as well. This would fit quite nicely into the existing actions of the cryptocompress plugin, and could be accomplished as just another compression algo.. one that always adds the 64-256 bits of check data per block.. At least as long as the error handling in the FS is robust enough to be able to treat a decompression failure as an IO error. ... If it were desirable to produce a cryptographically strong checksum which can be handed to the user, what you would do is perform a hash per block, and store that with each block, then a hash of the hashes which is returned to the user. This is called a tree hash (google it), usually you have a deeper hierarchy than two, depends on the application. This makes incremental updates cheap enough (just hash the block, then ripple the changes up the tree). This would remove the ability to include the file id and offsets directly in the hash, but i would argue that they should still be used: for example you could xor the hash value with them before writing it to disk and on reading it from disk. This would still allow you to detect a misplaced block but would not make the tree value differ for multiple copies of the same file.
Re: Will I need to re-format my partition for using the compression plugin?
On 9/22/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > 1) RSA is useless for this - you really need a symmetric block cipher of some > sort. Almost all block ciphers are best used with maximum-entropy input - if > the attacker can lop out a large part of the keyspace, a brute force attack > becomes a lot easier. This is somewhat related to the concept of "Hamming > Distance". If the attacker tries a brute force attack, and the first 8 bytes > of > the output look like valid HTML, or English text, or anything else > recognizable, he's almost certainly found found the correct key. On the other > hand, well-compressed data has very high entropy - as a result, it becomes > harder to tell if a correct key has been found. If it's English text, but > 3 of the first 8 bytes have the high bit set, it's probably not a correct key. > If it's compressed, 3 flipped bits in the first 8 bytes will probably still > represent a valid compressed stream - just of something else wildly different. It would normally seem silly to use RSA for disk encryption... but there might be applications, although you'd still never use RSA directly on user controlled data. For example, RSA could be used on a multi user server to append mail to a mail file so that once written the data is only accessible once the user logs on. The reiser4 crypto system will use the kernel keyring api, so it would be quite reasonable to tie encryption to user accounts. 'write only' files and 'read only' files would be a simple logical extension, and would require asymetric cryptography. Although for most compression algorithms not all inputs are valid outputs, so this may not work for you... It would be ideal (for disk encryption) if it were not possible to tell if you have the right key without decrypting an entire sector. This requires careful selection in compression and chaining mode. Alternatively, it may be possible to develop a good large block cipher which while being much slower than a single block of a small-block cipher, is faster for a disk block. For example, mercy is about 4x faster than AES on my system but is still 16x slower for the smallest unit of decryption than AES. Unfortunately mercy has security problems. > 2) Even though most modern block ciphers are designed to be fast, it's still > faster to apply a reasonably quick compression scheme to whomp 16K of data > down to 5-6K and encrypt/decrypt 5-6k than it is to encrypt/decrypt 16K. Depends on the compression mode and the cipher. A good AES implementation is around the same speed as an aggressive gzip. In general this is correct.
Re: I request inclusion of reiser4 in the mainline kernel
On 9/23/05, David Greaves <[EMAIL PROTECTED]> wrote: > who's not keeping up with the linux-raid list then ;) > > David > PS I'm sure assistance would be appreciated in testing and reviewing > this few day old feature - or indeed the newer 'add a new disk to the > array' feature. After posting that I checked linux-raid Thanked the author, patched a box, but got called out of town before I could test anything. :) This is an important development. ... and it's about darn time!
Re: reiser4 for 2.6.13 is available on our website
On 9/29/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > It works on non-amd64, will fix amd64 tomorrow I hope. No, it fails on FC4/x86 here to complie as well, with the same failure mode.
Re: latest 2.6.13 reiser4 patch
On 9/29/05, Artur Makówka <[EMAIL PROTECTED]> wrote: > Will it work also with 2.6.13.2 kernel ? or is it only for 2.6.13 ( or > 2.6.13.1 ) > > i couldnt find any information about this on page, and i want to be sure... I used it fine with 2.6.13.2... but my wireless card isn't supported there. (it's been included in 2.6.14x, so it was never patches against 2.6.13.. I tried backporting my wireless driver but there were a lot of changes in both directions). The patch also applies cleanly against 2.6.14-rc1/2 but 2.6.14-rc1/2 panic on initrd uncompress for me, and I can't troubleshoot it because the backtrace scrolls my screen and I can't do a serial console on my laptop. :( In any case I tested reiser4 out some on 13.2, and the only bug I hit was a panic on unmount.
Re: latest 2.6.13 reiser4 patch
On 10/1/05, Artur Makówka <[EMAIL PROTECTED]> wrote: > but is there some official statment about using 2.6.x reiser4 patches with > 2.6.x.x kernels? i mean, does patch called reiser 4 2.6.13 patch is also > intended to work with 2.6.13.2 for example or only with 2.6.13 ? You aren't likely to get one. It will work on lf the things that reiser4 depends on are not changed in how they operate. The official kernel doesn't work real hard to preserve external APIs much less internal ones. If it works, it works. Since no one involved in reiser4 can say with confidence what will be in the next version, we can't say for sure.
Re: Reiser4 file recovery
On 10/10/05, Christian Iversen <[EMAIL PROTECTED]> wrote: > > I suggest you run Spinrite (grc.com, ~$50 IIRC) on the bad disk from a > > floppy or CD-ROM in DOS (the program makes images for you in Windows, > > if you have a working partition, or you can get images from the site > > IIRC once you've bought a copy) and see how much is recovered > > (assuming it's just bad sectors or something). Re-add it to the LVM, > > recover to a seperate media, and then convert the whole thing to a > > RAID (maybe via tar?). I know it's not a free solution, but data > > recovery is nearly impossible w/o paying in one way, shape, or form. > > It's easier to have backups. > > As usually, Gibson "Research" is skimpy on details, so I'm not entirely sure > if spinrite is anything more than a disk imager. If not, just use the free > (gratis && libre) dd_rescue program instead. It will save you $50. A decade ago spinright would put the disk into a low level mode where it could read the ADC output and complete raw sectors... then it performed something like PRML to recover the data. It was a miracle worker. Back then, between that and a huge box of spare drives to swap parts from I was able to recover almost all any dead drive that crossed my desk. No clue what spinrite does today, as I doubt that sort of low level access is still possible, and even if it is, drives have become smart enough to do a lot of that on their own.
Transactions faster than locking
Saw this on the Postgres list, and I thought this might be interesting for some of the users here. Interesting in general to think about expanding transaction orientation in software, with Reiser4 providing efficient transactions down to the block update level. -- Forwarded message -- From: Dann Corbit <[EMAIL PROTECTED]> Date: Oct 11, 2005 2:09 PM Subject: Re: [HACKERS] Spinlocks and CPU Architectures To: Simon Riggs <[EMAIL PROTECTED]>, Peter Eisentraut <[EMAIL PROTECTED]> Cc: pgsql-hackers@postgresql.org, Tom Lane <[EMAIL PROTECTED]>, [EMAIL PROTECTED] As an aside, here is a package that has recently been BSD re-licensed: http://sourceforge.net/projects/libltx/ It is a lightweight memory transaction package. It comes with a paper entitled "Cache Sensitive Software Transactional Memory" by Robert Ennals. In the paper, Robert Ennals suggests this form of concurrent programming as a replacement for lock based programming. A quote: "We have now reached the point where transactions are outperforming locks -- and people are starting to get interested." There are a number of interesting claims in the paper. Since the license is now compatible, it may have some interest for integration into the PostgreSQL core where appropriate. It would certainly be worthwhile to read the paper and fool around with the supplied test driver to compare the approaches. If nobody on the PostgreSQL team has time for the experimentations, it might be a good project for a PhD candidate at some university. > -Original Message- > From: [EMAIL PROTECTED] [mailto:pgsql-hackers- > [EMAIL PROTECTED] On Behalf Of Simon Riggs > Sent: Tuesday, October 11, 2005 10:56 AM > To: Peter Eisentraut > Cc: pgsql-hackers@postgresql.org; Tom Lane > Subject: Re: [HACKERS] Spinlocks and CPU Architectures > > On Tue, 2005-10-11 at 18:45 +0200, Peter Eisentraut wrote: > > Tom Lane wrote: > > > This seems pretty unworkable from a packaging standpoint. Even if > > > you teach autoconf how to tell which model it's running on, there's > > > no guarantee that the resulting executables will be used on that same > > > machine. > > > > A number of packages in the video area (and perhaps others) do compile > > "sub-architecture" specific variants. This could be done for > > PostgreSQL, but you'd probably need to show some pretty convincing > > performance numbers before people start the packaging effort. > > I completely agree, just note that we already have some cases where > convincing performance numbers exist. > > Tom is suggesting having different behaviour for x86 and x86_64. The x86 > will still run on x86_64 architecture would it not? So we'll have two > binaries for each OS, yes? > > In general, where we do find a clear difference, we should at very least > identify/record which variant the binary is most suitable for. At best > we could produce different executables, but I understand the packaging > effort required to do that. > > Best Regards, Simon Riggs > > > > > ---(end of broadcast)--- > TIP 4: Have you searched our list archives? > >http://archives.postgresql.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: reiser4 and laptop_mode
On 10/17/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > > In fact, if you have enough RAM, you won't ever touch the > >disk -- deleting a file before it's committed means it never touches disk. > > > >It is not as spindown-friendly as laptop_mode, which notices when the > >drive has to spin up anyway (maybe through a read) and flushes all > >writes. Don't know if they are compatible. > > > > > We should work to integrate well with it. Zam, can you look at that? > Thanks. Actually, laptop mode flushes when there is a write and sync all pending transactions just before spindown. (delaying the write as long as possible to hopefully get as much in one pass as possible). In the future when enough API is exposed to make a nice interface for multi-syscall transactions with partial sync (i.e. only forced syncs of blocks related to transactions which demand physical fixation), it would be nice if the commit logic were smart enough to grab other nearby small transactions and batch them into the same commit.
My Dad suggests a redundant copies plugin
On 10/25/05, Hans Reiser <[EMAIL PROTECTED]> wrote: > This would not be (at least in theory) useful for RAID devices, but for > a user with a single disk drive, it might be useful to have a plugin > that creates two (or N) copies, and tries to allocate the two copies at > opposite ends of the disk. Anyone out there still looking for a plugin > to write? It would be more important to have redundancy for important filesystem datastructures first, because if you lose your filesystem it becomes very difficult to read your data. With modern high density drives bad blocks have become more common than disk failures, redundancy makes a lot of sense but the cost of duplicating every file is simply too great for many applications. The ability to control the level of redundancy and protection from the filesystem is becoming increasingly important. Hans, have you read the Iron filesystems paper (http://www.cs.wisc.edu/wind/Publications/iron-sosp05.ps) that was previously cited on this list? You should at least skim it.
Fwd: My Dad suggests a redundant copies plugin
On 10/25/05, Sander <[EMAIL PROTECTED]> wrote: > That will kill performance badly. First of all the two read/writes > needed, and second because you have to seek from one end to the disk to > the other every time you read/write something. Kill it worse for writes than a filesytem without wandering logs? I don't think so, since they double write in any case... Since reiser4 defers flushes you don't end up seeking all over the disk, you will write out a nice long queue, then a little seeking to ripple up the transaction(s). Reading will not be harmed since we will expect the underlying disk to report read failures. > And what is the advantage? You are not protected against a lot of disk > failures (only against bad blocks, right?). Badblocks have become the most common non-transient failure mode of disks by far. > There is a lot more advantage in buying two disks and do raid over them. > This is (much) cheaper, gives better performance and gives more > protection. Cheaper how? Two disks are cheaper than one? > Anyway, no need for a plugin. You can just divide your disk in two > partitions and configure them as a raid1. > > Or am I missing something in your suggestion? If I do not want, and can not afford, that level of protection for all my data, but only some, RAID-1 is a very wasteful solution.
Re: My Dad suggests a redundant copies plugin
On 10/25/05, Ingo Bormuth <[EMAIL PROTECTED]> wrote: > I agree, real backups are the major weappon against classical data loss due > to hardware failure. > Other quite anoying and common causes for data loss are accidentally deleted, > overwritten or modified files. A _simple_ versioning plugin would be very > nice to have (I'd definitly use it in /etc). It would be a sin to implement versioning in resier4 without taking advantage of how transactions work since done right they can provide this with almost zero overhead... plus it can be difficult to make sure that you've got a consistent copy of the file because of how many applications update files (is a version a write() call? no that won't work.. so do we only version files that get unlinked and replaced?). Between these two things a simple implementation wouldn't be so simple.
COW files
Any thought to making a file plugin that creates copy on write files? The operation would be something like a hardlink which is invisible to the user and broken as soon as either file is modified. Files could be COWed by a flag on the cp command (or really, perhaps that should be the default behavior) or with a utility (perhaps run as a periodic script to locate duplicates and COW them. This would greatly speed up the process of copying files. The behavior on break could be to duplicate the whole cowed file on the first right, or allow a COWed file to have alternate choices for blocks. Files would remain cowed until decowed, which would likely be bad for performance (due to fragmentation of alternate versions causing gaps in sequential scans), so the repacker could be taught to decow files that have too many alternate blocks.
ZFS - Reiser team reactions?
Looks like ZFS is no longer vaporware. http://www.opensolaris.org/os/community/zfs/docs/ Any commentary from the Reiserfs team? A (supposedly) production ready FS that provides the transactions (using a similar/same tree ripple technique as reiser4), compression, and snapshotting, that we expected with reiser4 in the not two distant future... as well as checksumming, and raid integration (which uses checksumming to aid reconstruction) that we've barely just talked about here. People have already started asking about Linux integration, if it were not for that fact that CDDL is incompatible with the GPL, I'd expect it to make its way into linux before Reiser4 as well. A positive point is that this might be a good chance to point out that Reiser4 has some of ZFS's features already and is the right framework for building the rest (and much more)... and get some more interest in getting it merged and completed.
Re: What's the state of Reiser4 inclusion in the mainline kernel ...
On 1/9/06, Giovanni A. Orlando <[EMAIL PROTECTED]> wrote: > Hans, > > Can you tell me please the status of Reiser4 in the Kernel? Here is the thread you were supposted to read... (I think): http://marc.theaimsgroup.com/?l=linux-kernel&m=113650213621940&w=2
Re: Reiser4 crash 2.6.16-mm1
On 3/28/06, Jonathan Briggs <[EMAIL PROTECTED]> wrote: > But for a production machine that is "producing" something of value, the > extra cost should not be an issue. RAM errors are so subtle and so hard > to find that ECC is of far more value than RAID. It is obvious when > your disk fails. > > An extra high bit in a credit transaction could cost you $16,384 and you > might not ever realize what happened. :) > > Anyway, off topic, but ECC is highly recommended. And with the amount of memory that people are putting in modern system 1 bit events should be happening on a approx weekly basis. ECC may be more expensive but it doesn't make memory more expensive than it was just a few years ago you really should have it. But this has gone far offtopic.
Linux and atomic write()
How does this http://marc.theaimsgroup.com/?t=11448928423 impact reiser4's atomic writes?
Re: reiserfs performance on ssd
On 4/27/06, Sander <[EMAIL PROTECTED]> wrote: > > I have a simple solid state disk to play with here. > > See http://nerv.eu.org/iram/ > > Interesting review, thanks. > > To get better reliability you could raid1 them. > I guess this is a 'must' anyway when used in servers (just like with > harddisks). > > Have to try this product myself.. Because they have no ECC most failures will just be completely silent data corruption. A sadly useless device.
Re: reiserfs performance on ssd
On 4/27/06, Toby Thain <[EMAIL PROTECTED]> wrote: > Sure ECC would be nice, but how does this differ from disk? Silent > failures are certainly possible. > > The fact that error detection and propagation doesn't really happen > in modern disk subsystems is why systems like Sun's ZFS are coming > into being. Um. Because *every* cosmic ray hit (of which you can expect one to two every week or so with 2+ gigs of ram) will result in data corruption. It's claimed that disks don't do a great job propagating hard errors, which is true to an extent. But they *do* manage to handle soft errors. Without the coding gain provided by block ECC your modern high density would be nearly useless. ZFS wouldn't make iram seriously useable... because AFAIK, raidz will not work on a single device... so even if can detect a bad block, it can't correct it. The problem goes further than that because the cost of computing block checksums in software will greatly reduce the performance of the fast ram device.. Not that better integrity features are bad.. The iron filesystem paper has a lot of great suggestions that go beyond what ZFS provides, and it would be wonderful to see them in reiser4 someday. But things need to progress one step at a time.
Re: reiser4: first impression (vs xfs and jfs)
On 5/23/06, Tom Vier <[EMAIL PROTECTED]> wrote: [snip] What i'm doing is rsyncing from a slower drive (on 1394) to the raid1 dev. When using r4 (xfs behaves similarly), after several seconds, reading from the source and writing to the destination stops for 3 or 4 seconds, then brief burst of writes to the r4 fs (the dest), a 1 second pause, and then reading and periodic writes resume, until it happens again. It seems that both r4 and xfs allow a large number of pages to be dirtied, before queuing them for writeback, and this has a negative effect on throughput. In my test (rsync'ing ~50gigs of flacs), r4 and xfs are almost 10 minutes slower than jfs. [snip] Have you tested a pure write load? It may be that rsync's combined reading writing is triggering a corner case for FSes with delayed allocation. It may not be issuing it's checksumming reads far enough ahead of time and end up disk latency bound. It's interesting that you saw the same issues with XFS... I use XFS on my audio workstation computer because it (combined with a low latency patched kernel) had by far the lowest worst case latencies of all the FSes I tested at the time.
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion
On 7/31/06, Alan Cox <[EMAIL PROTECTED]> wrote: Its well accepted that reiserfs3 has some robustness problems in the face of physical media errors. The structure of the file system and the tree basis make it very hard to avoid such problems. XFS appears to have managed to achieve both robustness and better data structures. How reiser4 compares I've no idea. Citation? I ask because your clam differs from the only detailed research that I'm aware of on the subject[1]. In figure 2 of the iron filesystems paper that Ext3 is show to ignore a great number of data-loss inducing failure conditions that Reiser3 detects an panics under. Are you sure that you aren't commenting on cases where Reiser3 alerts the user to a critical data condition (via a panic) which leads to a trouble report while ext3 ignores the problem which suppresses the trouble report from the user? *1) http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion
On 8/1/06, David Masover <[EMAIL PROTECTED]> wrote: Yikes. Undetected. Wait, what? Disks, at least, would be protected by RAID. Are you telling me RAID won't detect such an error? Unless the disk ECC catches it raid won't know anything is wrong. This is why ZFS offers block checksums... it can then try all the permutations of raid regens to find a solution which gives the right checksum. Every level of the system must be paranoid and take measure to avoid corruption if the system is to avoid it... it's a tough problem. It seems that the ZFS folks have addressed this challenge by building as much of what is classically separate layers into one part.
Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion
On 8/3/06, Matthias Andree <[EMAIL PROTECTED]> wrote: Berkeley DB can, since version 4.1 (IIRC), write checksums (newer versions document this as SHA1) on its database pages, to detect corruptions and writes that were supposed to be atomic but failed (because you cannot write 4K or 16K atomically on a disk drive). The drive doesn't provide atomic writes for > 1sector (and I'd hardly call trashing a sector during a failed write atomic for even one)... but a FS can provide such semantics.
Re: the " 'official' point of view" expressed by kernelnewbies.org
On 8/15/06, Edward Shishkin <[EMAIL PROTECTED]> wrote: checksumming is _not_ much more easy then ecc-ing from implementation standpoint, however it would be nice, if some part of errors will get fixed without massive surgery performed by fsck We need checksumming even with eccing... ECCing on large spans of data is too computationally costly to do unless we know something is wrong (via a checksum). Lets pause for a minute, when you talk about ECC what are you actually talking about? If you're talking about a hamming code (used on ram, http://en.wikipedia.org/wiki/Hamming_code) or Convolutional code (used on telecom links, http://en.wikipedia.org/wiki/Convolutional_code) or are you talking about an erasure code like RS coding (http://en.wikipedia.org/wiki/Reed-Solomon_code)? I assume in the discussions that you're not talking about an RS like code... because RAID-5 and RAID-6 are, fundamentally, a form of RS coding. They don't solve bit errors, but when you know you've lost a block of data they can recover it. Non-RS forms of ECC are very slow in software (esp decoding) .. and really aren't that useful: most of the time HDD's will lose data in nice big chunks that erasure codes handle well but other codes do not. The challenge with erasure codes is that you must know that a block is bad... most of the times the drives will tell you, but some times corruption leaks through. This is where block level checksums come into play... they allow you to detect bad blocks and then your erasure code allows you to recover the data. The checksum must be fast because you must perform it on every read from disk... this makes ECC unsuitable, because although it could detect errors, it is too slow. Also, the number of additional errors ECC could fix are very small.. It would simply be better to store more erasure code blocks. An optimal RS codes which allows one block of N to fail (and require one block extra storage) is computationally trivial. We call it raid-5. If your 'threat model' is bad sectors rather than bad disks (an increasingly realistic shift) then N needs to have nothing to do with the number of disks you have and can be instead related to how much protection you want on a file. If 1:N isn't enough for you, RS can be generalized to any number of redundant blocks. Unfortunately, doing so requires modular aritmetic which current CPUs are not too impressively fast at. However, the Linux Raid-6 code demonstrates that two part parity can be done quite quickly in software. As such, I think 'ecc' is useless.. checksums are useful because they are cheap and allow us to use cheap erasure coding (which could be in a lower levle raid driver, or implemented in the FS) to achieve data integrity. The question of including error coding in the FS or in a lower level is, as far as I'm concerned, so clear a matter that it is hardly worth discussing anymore. In my view it is absolutely idiotic to place redundancy in a lower level. The advantage of placing redundancy in a lower level is code simplicity and sharing. The problem with doing so, however, is many fold. The redundancy requirements for various parts of the file system differ dramatically, without tight FS integration matching the need to the service is nearly impossible. The most important reason, however, is performance. Raid-5 (and raid-6) suffer a tremendous performance hit because of the requirement to write a full stripe OR execute a read modify write cycle. With FS integrated erasure codes it is possible to adjust the layout of the written blocks to ensure that every write is a full stripe write, effectively you adjust the stripe width with every write to ensure that the write always spans all the disks. Alternatively you can reduce the number of stripe chunks (i.e. number of disks) in the parity computation to make the write fit (although doing so wastes space)... FS redundancy integration also solves the layout problem. From my experience most systems with hardware raid are getting far below optimal performance because even when their FS is smart enough to do file allocation in a raid aware way (XFS and to a lesser extent EXT2/3) this is usually foiled by the partition table at the beginning of the raid device. Resulting in 1:N FS blocks actually spanning two disks! (thus reading that block incurres potentially 2x disk latency). Seperated FS and redundancy layers are an antiquated concept.. The FS's job is to provide reliable storage, fully stop. It's shocking to see that a dinosaur like SUN has figured this out but the free software community still fights against it.
Re: the " 'official' point of view" expressed by kernelnewbies.org
On 8/15/06, Tom Reinhart <[EMAIL PROTECTED]> wrote: Of course, not everyone uses RAID. ECC would benefit some people in some cases... no argument there. We can use RAID mechanisms (RS erasure code) on a single disk. You could technically call it ECC, but if you do so you will confuse people. "Block level parity" would be correct.
Re: Reiser4 und LZO compression
On 8/29/06, PFC <[EMAIL PROTECTED]> wrote: Anyone has a bench for lzf ? This is on a opteron 1.8GHz box. Everything tested hot cache. Testing on a fairly repetative but real test case (an SQL dump of one of the Wikipedia tables): -rw-rw-r-- 1 gmaxwell gmaxwell 426162134 Jul 20 06:54 ../page.sql $time lzop -c ../page.sql > page.sql.lzo real0m8.618s user0m7.800s sys 0m0.808s $time lzop -9c ../page.sql > page.sql.lzo-9 real4m45.299s user4m44.474s sys 0m0.712s $time gzip -1 -c ../page.sql > page.sql.gz real0m19.292s user0m18.545s sys 0m0.748s $time lzop -d -c ./page.sql.lzo > /dev/null real0m3.061s user0m2.836s sys 0m0.224s $time gzip -dc page.sql.gz >/dev/null real0m7.199s user0m7.020s sys 0m0.176s $time ./lzf -d < page.sql.lzf > /dev/null real0m2.398s user0m2.224s sys 0m0.172s -rw-rw-r-- 1 gmaxwell gmaxwell 193853815 Aug 29 10:59 page.sql.gz -rw-rw-r-- 1 gmaxwell gmaxwell 243497298 Aug 29 10:47 page.sql.lzf -rw-rw-r-- 1 gmaxwell gmaxwell 259986955 Jul 20 06:54 page.sql.lzo -rw-rw-r-- 1 gmaxwell gmaxwell 204930904 Jul 20 06:54 page.sql.lzo-9 (decompression of the differing lzo levels is the same speed) None of them really decompress fast enough to keep up with the disks in this system, lzf or lzo wouldn't be a big loss. (Bonnie scores: floodlamp,64G,,,246163,52,145536,35,,,365198,42,781.2,2,16,4540,69,+,+++,2454,31,4807,76,+,+++,2027,36)
Re: Reiser4 und LZO compression
On 8/29/06, David Masover <[EMAIL PROTECTED]> wrote: [snip] Conversely, compression does NOT make sense if: - You spend a lot of time with the CPU busy and the disk idle. - You have more than enough disk space. - Disk space is cheaper than buying enough CPU to handle compression. - You've tried compression, and the CPU requirements slowed you more than you saved in disk access. [snip] It's also not always this simple ... if you have a single threaded workload that doesn't overlap CPU and disk well, (de)compression may be free even if you're still CPU bound a lot as the compression is using cpu cycles which would have been otherwise idle..