Plugin for corruption resistance?

2005-02-11 Thread Gregory Maxwell
Anyone ever given a though to adding support to reiserfs to store a
cryptographic checksum along with a file?


The idea is that files get a hidden attribute that contains their SHA1 hash.
If the file is modified, the hash is marked as 'unclean'. A trusted
cleaner comes by eventually and hashes the file, OR the file is hashed
right away if someone tried to read the attribute while the file is
unclean.

Fsck could be optionally told to go check the hash on every file.
Files could also be tested via a background process that randomly
tests some files every night.

Why would this be useful?

1. Lots of applications today (such a P2P sharing systems) need the
hashes of files.. it's inefficient to keep recomputing them.  The file
system always knows when a file changes, so it can be setup to always
return the correct hash.

2. Random disk corruption can go undetected (even if the drives ECC is
sufficient to prevent corruption there could be memory, bus, or kernel
issues the corrupt data, a hash will help it be detected).

3. Although there are encrypted block devices available in Linux, none
of them can provide authentication.. So it's possible for an attacker
(with access to your disk) to replace hunks of files with random (and
potentially chosen depending on the chaining mode) crud without
detection.

4. It could greatly speed up casual verification of files for changes
(if you don't trust the kernel to report the true hash, then you
couldn't trust it to return the real file to some userspace file
verifier either) it could also be used to help locate duplicates
in a very efficient manner..


Re: Plugin for corruption resistance?

2005-02-18 Thread Gregory Maxwell
On Thu, 17 Feb 2005 23:28:09 -0500, [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> On the flip side, hash functions like MD5 or the SHA family are fairly 
> bulletproof,
> but are essentially impossible to develop an incremental update for (if there
> existed a fast incremental update for the hash function, that would imply a
> very low preimage resistance, rendering it useless as a cryptographic hash).

Tree hashes.
Divide the file into blocks of N bytes. Compute size/N hashes. 
Group hashes into pairs. Compute N/2 N' hashes, this is fast because
hashes are small. Group N' hashes into pairs compute N'/2 N'' hashes
etc.. Reduce to a single hash.

A number of useful tradeoffs are possible: By enlarging N you improve
the strength along various cryptographic dimensions.  By changing the
fanout, and deciding how many N your store, which N you store, which
N' you store, etc you decide how easy it is to update the hash and you
decide what the smallest increment you can test is... you trade off
storage (and a little computation) for this ease.

> Also, there's another issue - unlike standard ECC codes that can actually 
> *fix*
> the problem (for at least small number of bit errors), it's unclear what you 
> should
> do if you find a mismatch between the hash of a block and the block contents, 
> as
> you don't know whether it's the actual data or the hash that's corrupted

In my initial suggestion I offered that hashes could be verified by a
userspace daemon, or by fsck (since it's an expensive operation)...
Such policy could be controlled in the daemon.
In most cases I'd like it to make the file inaccessible until I go and
fix it by hand.

It would also be useful to have the checker daemon watch the logs (or
recieve notifications through some kernel interface)... and any block
level errors (or smartd errors) backprojected up (through raid and lvm
remappings) to the file system level ... After identifying the
potentially corrupted file, it could then test the file.  If the file
has been corrupted, the configured action is taken.

If this policy is in userspace, the level of action sopication could
be very high: for example, if I was on a distribution with package
management, and the file was outside of /home, and the package flags
didn't indicate it was a config file.. then go fetch the package, and
replace the file and send me an email so I don't forget how wonderful
my OS is. :)


Re: Plugin for corruption resistance?

2005-02-18 Thread Gregory Maxwell
On Fri, 18 Feb 2005 17:09:00 -0500, [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> On Fri, 18 Feb 2005 08:36:51 EST, Gregory Maxwell said:
> 
> > Tree hashes.
> > Divide the file into blocks of N bytes. Compute size/N hashes.
> > Group hashes into pairs. Compute N/2 N' hashes, this is fast because
> > hashes are small. Group N' hashes into pairs compute N'/2 N'' hashes
> > etc.. Reduce to a single hash.
> 
> You get massively I/O bound real fast this way.  You may want to re-evaluate
> whether this *really* buys you anything, especially if you're not using some
> sort of guarantee that you know what's actually b0rked...

I brought up tree hashes because someone pointed out there was no way
to incrementally update a normal hash. Tree hashes can easily be
incrementally updated if you keep all the sub parts.

I don't think that would suddenly make it useful for frequently updated files.
 
> > In my initial suggestion I offered that hashes could be verified by a
> > userspace daemon, or by fsck (since it's an expensive operation)...
> > Such policy could be controlled in the daemon.
> > In most cases I'd like it to make the file inaccessible until I go and
> > fix it by hand.
> 
> You're still missing the point that in general, you don't have a way to tell 
> whether
> the block the file lived in went bad, or the block the hash lived in went bad.

I'm not missing the point.  Compare the number of disk blocks a file
takes vs the hash. Compare the ease of atomically updating the hash
data vs atomically updating the hash.
If they don't match, It is far more likely that the file has been
silently corrupted than hash has been.. In either case, something
seriously wrong has happened (i.e. that *any* data has been corrupted
without triggering alarms elsewhere).

Wetware will be required figure out what is going on.
Perhaps correct a serious problem before it eats the whole file system...

Automagic correction of stuff that is automagically correctable is
useful in that it might prevent something worse from happening... For
example, if the corrupted file was /sbin/init.. regardless of the
cause of the problem I'd be glad if the system took some action while
the wetware was in an uninteruptable sleep. ;)
 
> Sure, if the file *happens* to be ascii text, you can use Wetware 1.5 to scan
> the file and tell which one went bad.  However, you'll need Wetware 2.0 to
> do the same for your multi-gigabyte Oracle database... :)

Such a proposed system would likely not be all that useful on a live
database.. the overhead of computing hashes would likely be too
great..  Rather, it would be useful if the database system used it's
knowledge of how data was stored to do this efficiently.

If the database system were written with reiserfs in mind and rather
than using a couple of big opaque files it stored it's data in tens of
thousands of files... then perhaps such a hashing scheme might
actually work out okay.

> (And yes, I *have* seen cases where Tripwire went completely and totally 
> bananas
> and claimed zillions of files were corrupted, when the *real* problem was that
> the Tripwire database itself had gotten stomped on - so it's *not* a purely
> theoretical issue

The discussion is to store the hash in the file metadata.  ... If that
is getting stomped on, it's a *good* thing if the system goes totally
bananas. In a great many situations I'd rather lose a file completely
than have some random bytes in it silently corrupted. (and of course,
attaching hashes doesn't mean you lose the file... it means it gets
brought to your attention)

As things stand today, there are hundreds of ways a system could end
up with files getting silently corrupted.  Many of them would be
fairly difficult to detect until it's far too late (to recover cleanly
or even detect the root cause).  Right now most distros have a package
management system that can detect changes in some system files, which
is useful against a small subset of these problems, but not most since
it will only detect problems in files that almost never change.

The proposed system of attaching hashes in metadata would protect all
files that are not constantly updated (so that counts out database and
single file mailboxes), but could protect most everything else.  ..
And the things that can't be protected could be with changes to their
operation that would be useful to make for reiserfs due to other
reasons. (there is no performance reason in reiserfs to make a mail
box a single file, for example).

Furthermore, attached hashes could greatly speed up applications using
hashes in a way that no userspace solution can:  Userspace solutions
can't maintain a cache of the files hashes because they have no way to
be *sure* that the file wa

Re: reiser4 plugins

2005-06-24 Thread Gregory Maxwell
On 6/25/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> I wonder if Apple is a better
> social environment for developers these days than Linux?  It would be
> fun to work with Steve Jobs, he has such a sense of vision and a delight
> in new things.  He hires good people too; Dominic Giampaolo is really
> sharp.

And trade freedom for this? In the current situation you can release
code even if the powers that be do not agree. Others can choose to
integrate and distribute your code, .. they have and they will..

If you were at apple and some authority decided against a feature you
and your team would likely not be permitted to develop it, and if
developed it would not be distributable.

The situation right now is somewhat unfortunate, but it is not
unsolvable. Trading away freedom is seldom a good deal.


Re: reiser4 plugins

2005-06-25 Thread Gregory Maxwell
On 6/26/05, Lincoln Dale <[EMAIL PROTECTED]> wrote:
> the l-k community have asked YOU may times.  any lack of response isn't
> because of the kernel cabal .. its because YOU are refusing to entertain
> any notion that what Reiser4 has implemented is unpalatable to the
> kernel community.

A lot of this is based on misconceptions, for example in recent times
reiser4 is faulted for layering violations.. But it doesn't have them,
it neither duplicates nor modifies the VFS.

It has also been requested that reiser4 be changed to move some of
it's operations above the VFS... not only would that not make sense
for the currently provided functions, but merging was put off
previously because of changes to the VFS now that it doesn't
change the VFS we are asking hans to push it off until it does??

It's a filesysem for gods sake. Hans and his team have worked hard to
minimize its impact and they are still willing to accept more
guidance, even if their patience has started to run a little thin.  
The acceptance of reiser4 into the mainline shouldn't be any larger
deal than any other filesystem, but yet it is...


Re: reiser4 plugins

2005-06-26 Thread Gregory Maxwell
On 6/27/05, Horst von Brand <[EMAIL PROTECTED]> wrote:
> Wonderful! I carefully "transparently encrypt" my secret files, so
> /everybody/ can read them! Now /that/ is progress!

All of this side feature argument is completely offtopic for the
inclusion of reiser4, but oh well.

In any case, the real use for encrypted files (vs encrypted
partitions) would be for doing things like tying keying into the login
process so that your files are only accessible while you are logged
in.  This would be a very nice feature on a multiuser system.


Re: reiser4 plugins

2005-06-27 Thread Gregory Maxwell
On 6/27/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> >. But nevertheless it didn't survive, as like V3, with time V4 became
> >slower and slower. In this case no year was needed, but just one month or
> >alike. So end of test...but in fact I'll give V4 another go in the near 
> >future.
> Interesting that it got slower with time.  It sounds like our online
> repacker is much needed.  It will be a priority for after the kernel merge.

Whered it go, I recall it being activated with:
echo 1 >/sys/fs/reiser4/*/repacker/start


Re: How to user reiser4 and crypt plugin ??

2005-06-28 Thread Gregory Maxwell
On 6/28/05, Hubert Chan <[EMAIL PROTECTED]> wrote:
> > Isn't dm-crypt the new way of doing this?
> Yes and no.  dm-crypt is recommended over cryptoloop.  But there is also
> loop-AES, which is more secure (in some modes) than dm-crypt (currently).

There is now support for both LRW-AES and ESSIV in the mainstream kernel.
With ESSIV the security should be the same as loop-aes and with
LRW-AES potentially better.

For the highest security you should also use the LUKS cryptosetup
because it provides hardening for passwords.


Re: reiser4progs do not handle the reiser4 format changes

2005-07-21 Thread Gregory Maxwell
On 7/20/05, Edward Shishkin <[EMAIL PROTECTED]> wrote:
> like other existing
> ones. This will be a way to create cryptcompress files per superblock.
> There is another
> more flexible way (which is compatible with the previous one) to create
> it per file/directory,
> but it uses deprecated metas interface..

Per ... superblock?
Excuse me?

Nonselective use of this feature will be nearly useless. There must be
an API to selectively control the feature.

This sounds like a silly tantrum about the interface changes.


Re: reiser4 performance

2005-08-08 Thread Gregory Maxwell
On 8/8/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> I should add that fsync performance has not been worked on yet, which is
> surely why postgres performance is poor.

Hans, I'm on the postgresql hackers list (although I don't really have
a voice there, so I can't really speak much for reiser4 there)..

One of the 'interesting' issues they face is that that postgresql
database works with 8K pages.  From a performance and reliability
perspective they would benefit impressively from a file system (and
VFS) which could atomically update their 8k pages. Without such a
feature their performance is slaughtered when operating in a mode that
provides the highest reliability, and their reliability is slaughtered
when operating in the highest performance configuration.

I'm sure the PostgreSQL folks would be here themselves asking for help
with this issue... if they weren't so oriented around FreeBSD. :)

If ever you are looking for a killer app for Reiser4 that people who
don't care about the visionary stuff will care about: you couldn't
find one better than postgresql. If you could get postgresql working
as reliably as a double logged full fsync configuration but performing
as fast as a configuration with async writes you'd have a lot more
supporters.


Re: reiser4 performance

2005-08-08 Thread Gregory Maxwell
On 8/8/05, David Masover <[EMAIL PROTECTED]> wrote:
> Gregory Maxwell wrote:
> > On 8/8/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> 
> > If ever you are looking for a killer app for Reiser4 that people who
> > don't care about the visionary stuff will care about:
> 
> Define "visionary"?
> 
> I can name a few things that work best in Reiser4, and very well in v3,
> simply because of efficient storage of small files and lazy allocation:
> 
> webserver -- lots of small files, very few large ones, mostly reads

Given webserver horsepower from the same budget class as the Internet
pipe it is attached to, it is utterly trivial to saturate the pipe
with static content.. even if the files are small, larger than core in
total, and have poor locality.   (and those are very infrequently the
case).

So for most webserver cases, FS speed doesn't matter. For the few
cases where it does, locality is usually fairly good... so who cares
if the new FS is 2x faster, when it is still 200x slower than ram. Add
ram.

> mailserver -- especially IMAP+maildir, lots of small files, read/write
> and so on...

An interesting application space, no doubt. Although you can cure a
lot of sins tossing solid state disk at it. :) (or just battery backed
cache on a hardware raid controller).

> Gentoo box:
> /usr/portage is over a hundred thousand very small files, updated via
> rsync.  Since they are updated all at once, you get a boost out of lazy
[snip]

Rsync algo or the network is going to be your bottleneck there, for
the sync... Are you really getting disk bound for compile? if so
increase your -j N.


> These are all things that Reiser4 already does better than anything
> else.  So now we're going to get Postgres to run faster.  I can't wait
> until we have more people hacking on the plugin interface -- then we'll
> have some *real* killer apps.

I agree. Still it would be nice to have some really good bread and
butter improvements.. and a sufficient level of 'transaction' support
exposed to PostgreSQL could result in a huge performance improvement
while improving reliability. "Your database is more reliable on
reiser4" would be a compelling argument... even to those not convinced
by plugins and small files.


Re: reiser4 performance

2005-08-08 Thread Gregory Maxwell
On 8/8/05, David Masover <[EMAIL PROTECTED]> wrote:
> "Reiser4 would be great if..." is getting old.  It is great, and it's
> getting even better pretty fast.
> 
> And, by the way, if the transaction interface gets done, it's not just
> databases that will benefit, but also small files.  After all, what kind
> of transactions are used for your OpenOffice document?  
...

Well there doesn't actually need to be a transaction interface for
postgresql's needs.. it just needs a fairly limited set of assurances
from the VFS/FS that ... aren't usually provided. Beyond that, it
already handles its own transactions.  It looks like from Hans' reply
the reiser4 already provides everything needed.. which I had
suspected.


Re: reiser4 performance

2005-08-08 Thread Gregory Maxwell
On 8/8/05, David Masover <[EMAIL PROTECTED]> wrote:

> Absolutely.  I'm not knocking your idea, just wanted to clarify that
> "Reiser4 would be great if..." is getting old.  It is great, and it's
> getting even better pretty fast.

(sorry for reply bloat)
I just wanted to point out.. that wasn't my intent. I think the only
'feature' reiser4 needs right now is mainstream inclusion.

My ability to use it is severely hampered only being able to use it on
boxes running test-kernel of the day.. which are laden with other
issues unrelated to reiser4 that I don't have time to deal with.


Re: reiser4 performance

2005-08-11 Thread Gregory Maxwell
On 8/11/05, PFC <[EMAIL PROTECTED]> wrote:
> 
> > Well, but then you have to tell postgres that it can assume these things
> > about reiser4.
> 
> you can already set the sync mode in the config file to a llot of
> different choices, like fdatasync, fsync, O_SYNC, etc, so a reiser4 option
> would be possibel I guess.

Right, and the PostgreSQL team has already shown that they are willing
to create platform specific options.

Could someone familiar with the reiser4 internals provide some
detailed information about what reiser4 currently provides in this
regard?

The specific concerns would be about controlling the ordering and
atomicity of 8k writes and how to tell when they are fixed to the
media.


Re: KDE integration with ReiserFS 4

2005-08-16 Thread Gregory Maxwell
On 8/16/05, Christian Iversen <[EMAIL PROTECTED]> wrote:
> Are there independent implementations of QT?

Well... Google "harmony project" "qt". 

:)

I think in reality it would be reasonable to handle this like the
Beagle search engine handles it's metadata.. Either you enable
extended attrs in your file system, or it uses a sqllite database. 
The performance with the sqllite database stinks, but it's there for
those that can't turn on extended attrs on their filesystem.


Re: Basic interface for key management in reiser4 (DRAFT)

2005-08-18 Thread Gregory Maxwell
On 8/18/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> but the idea is to use keys instead of standard unix permissions
> 
> I think you need to store keys in a per process place, and allow
> specifying whether children of a process inherit the keys somehow.

Oh, slick!

I did not previously catch what the advantage would be to using crypto
in the FS rather than just a crypted block device... minus some non
critical niceties (like being able to use a random & per file IV is
good from a security perspective.).

Now I see what is possible, and I'm really excited.

It will be interesting to see how a system with many keys performs..
most fast implementations of most crypto algs need a computationally
expensive key setup which produces a fairly large working set of
constants for encryption/decryption.

With a per process structure there should be a way to revoke all
instances of a key from all the other running processes that carry
it.. or at least all processes of a specific user. Otherwise it will
be too easy to accidental leave keys laying around.

This per process crypto in the FS fits very nicely with a lot of the
other recent security advances in the Linux world.

Thanks for something new and exciting to talk about with my Linux
using friends! :)


Re: Basic interface for key management in reiser4 (DRAFT)

2005-08-19 Thread Gregory Maxwell
On 8/19/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> I think it would make sense to put the keys as files in /proc:
> e.g.
> touch  /proc/1/keys/private/"0893410984328098094321"
> would give init (aka process with id of 1) a new key that its children
> would not inherit
> touch  /proc/1/keys/inheritable/"1893410984328098094321"
> would give init a new key that is children WOULD inherit.
> 
> Not sure what the permissions on that keys directory would be, I guess 700.

Eh, that would make leaks of key information easy..  Since most
applications don't assume that something visable in a file name could
be highly confidential information. :)

Better to have the file in proc be an abbriviated keyid (some kind of
smaller lossy hash of the key). To add you might echo "label:real key
data" > /proc/1/keys/private/keys, and a file would appear named
/proc/1/keys/private/label-123abcd which is a user defined label and
the hash.

Under no condition should a process be able to actually read the key
data.. they can get the ID.. delete keys based on IDs.. etc. But they
can't get the data.. otherwise a process could steal keys. If I take
away a processes key from proc, there should be no way for it to get
any further access to those files.. no chance that it could have
hidden away a copy of that key.

On 8/19/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> I think it would make sense to put the keys as files in /proc:
> 
> e.g.
> touch  /proc/1/keys/private/"0893410984328098094321"
> would give init (aka process with id of 1) a new key that its children
> would not inherit
> 
> touch  /proc/1/keys/inheritable/"1893410984328098094321"
> would give init a new key that is children WOULD inherit.
> 
> Not sure what the permissions on that keys directory would be, I guess 700.
> 
> Hans
> 
> Gregory Maxwell wrote:
> 
> >On 8/18/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> >
> >
> >>but the idea is to use keys instead of standard unix permissions
> >>
> >>I think you need to store keys in a per process place, and allow
> >>specifying whether children of a process inherit the keys somehow.
> >>
> >>
> >
> >Oh, slick!
> >
> >I did not previously catch what the advantage would be to using crypto
> >in the FS rather than just a crypted block device... minus some non
> >critical niceties (like being able to use a random & per file IV is
> >good from a security perspective.).
> >
> >Now I see what is possible, and I'm really excited.
> >
> >It will be interesting to see how a system with many keys performs..
> >most fast implementations of most crypto algs need a computationally
> >expensive key setup which produces a fairly large working set of
> >constants for encryption/decryption.
> >
> >With a per process structure there should be a way to revoke all
> >instances of a key from all the other running processes that carry
> >it.. or at least all processes of a specific user. Otherwise it will
> >be too easy to accidental leave keys laying around.
> >
> >This per process crypto in the FS fits very nicely with a lot of the
> >other recent security advances in the Linux world.
> >
> >Thanks for something new and exciting to talk about with my Linux
> >using friends! :)
> >
> >
> >
> >
> 
>


Re: Basic interface for key management in reiser4 (DRAFT)

2005-08-19 Thread Gregory Maxwell
On 8/19/05, Edward Shishkin <[EMAIL PROTECTED]> wrote:
> Actually it is critical:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107719798631935&w=2
> But why random? It is slowly.. I would prefer object-id-based one..

The IV doesn't need to be random, but it should be different for every
instance of a file, different every time a file is deleted and
recreated, not increment in any predictable way between files, and be
impossible to control by a user. It should have a low possibility of
reuse.

Earlier linux DM crypt had a weakness where the IV incremented with
every block in the file system, this lead to some interesting
watermarking attacks. It was possible to form a stream of data with
changes that negated the XORs from the trivially incremented IV, and
thus the first block of each sector could be used to form an
electronic code book. This has since been corrected with a couple of
options (one is to use the cryptographic hash of the block number).

If the user has some way of trivially influencing differences in the
object ID, for example sequential files have sequential object id,
then the object ID should be passed through a hash function so that a
user must know the full object iD in order to predict even a single
bitflip.


WinFS beta out

2005-08-29 Thread Gregory Maxwell
http://it.slashdot.org/it/05/08/29/2241243.shtml?tid=109&tid=218

Looks like MSFT will be beating Linux to the nextgen FS punch after all. ;)


Wikipedia article

2005-08-29 Thread Gregory Maxwell
Someone might want to update the information on reiser4 at:

http://en.wikipedia.org/wiki/Comparison_of_file_systems
and
http://en.wikipedia.org/wiki/Reiser4

A fair bit of information is missing/out of date it seems.


Re: Some questions about r4

2005-09-01 Thread Gregory Maxwell
On 9/1/05, Pysiak Satriani <[EMAIL PROTECTED]> wrote:
> Hi,
> Is there a r4-patch cooking for 2.6.13 ?
> Is the requirement for not enabling 4k-stacks going away soon?
> If I patch a vanilla kernel with r4, and sop using r4, would you say that the 
> changes I introduced by patching are rather safe to the rest of the kernel, 
> or would you recomend going back to vanilla just in case?

On this point: 2.6.13 is out, is there going to be a pass at getting
reiser4 into 2.6.14?


Re: journal size reiserfs vs reiser4

2005-09-02 Thread Gregory Maxwell
On 9/2/05, Łukasz Mierzwa <[EMAIL PROTECTED]> wrote:
> Dnia Fri, 02 Sep 2005 09:19:55 +0200, Hans Reiser <[EMAIL PROTECTED]> napisał:
> > It could probably be a lot less than 5%, 2% is more than enough I would
> > guess, but we also need to reserve space to get good performance.
> Maybe You can make it an mkfs.reiser4 option, set 5% to default so it won't 
> change anything to 99% of people using reiser4 but will make that 1% that 
> want some other value happy.

What would be really nice would be to have two options, a reserved
amount, and a root reserved amount at mkfs time. The behaviour of the
ext2/3 filesystem allows you to reserve some portion of the disk for
root use. This prevents various unpleasant system failure modes when a
user task goes nuts and fills the disk. It would be even more nice if
rather than root/nonroot if it could be controlled by a
capability/selinux context so you could make it so syslog can't write
into that safe buffer..

But it's kinda moot at this point, mainline inclusion must be the
highest priority right now.


Re: journal size reiserfs vs reiser4

2005-09-06 Thread Gregory Maxwell
On 9/6/05, Tom Vier <[EMAIL PROTECTED]> wrote:
> My vote: put the reserve % in the superblock (if it isn't already) and
> give mkfs a sane default.

Looking at the code it appears it would be easier to make it a mount
option that defaults to 5%. Would that work okay for you?


Re: journal size reiserfs vs reiser4

2005-09-06 Thread Gregory Maxwell
On 9/6/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> Guys, I am sorry, but I just don't think this issue is a priority
> compared to other issues.  Sorry, too much else going on, honestly.

I'd say so:

Per Andrew Morton. 

reiser4 merge status:
Stuck.  Last time we discussed this I asked the reiser4 team to
develop and negotiate a bullet-point list of things to be addressed. 
Once that's agreed to, implement it and then we can merge it.  None of
that has happened and as far as I know, all the review feedback which
was provided was lost.

:(


Re: why does reiserfs list get so much spam?

2005-09-16 Thread Gregory Maxwell
On 9/16/05, Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:
> yeah, but it is to be used by the end-user.
> 
> the archive will be still filled with spam.
> 
> not everyone who wants to know about reiser subscribes to the list; most
> of the people would just use the archives.

I actually subbed to the list because I was frustrated by spam in the archive.
Whomever runs the archive should at least just pipe it through spam
assassin. If some legitimate messages are lost from the web archive it
could not be worse than what we have now, an archive made nearly
useless from excessive spam.


Re: Compression Plugin

2005-09-20 Thread Gregory Maxwell
On 9/20/05, Clay Barnes <[EMAIL PROTECTED]> wrote:
> Forgive me if this has been answered recently, but I haven't gotten my
> last two dozen e-mails for today yet.
> 
> Regarding the compression plugin, what sort of compression can one
> expect from it? 
[snip]

Just a general, no reiser4 specifc answer since because the
compression isn't done yet I don't know the details on reiser4's
performance

It is generally the case the disk based compression performs somewhat
worse than normal file based compressors. This is because every block
of data must be compressed alone in order to preserve the random
access semantics of the file system.

This also means that there is less to be gained by using alternative
compression systems (such as bzip2 or better LZMA) because most only
pick up their impressive performance as a result of having a much
larger context and at a greatly increased cost in memory usage.

For disk based compression Lz77 is pretty good and is so widely used
that people feel comfortable implimenting it in kernel space. Another
interesting player is LZO because decompression requires very little
memory and is VERY VERY fast (I think someting like 8x faster than
gzip in my testing). This means that decompression  is effectively
free. However compression is perhaps 10% worse than LZ77 ... on most
hardware the disk is so slow that the decrease in compression
outweighs the improvement in decompression performance. But on systems
with a fast disk array, LZO may be a welcome tradeoff.


Re: Compression Plugin

2005-09-20 Thread Gregory Maxwell
On 9/20/05, David Masover <[EMAIL PROTECTED]> wrote:
> Probably lzo, which is already used for other things like network
> connections (ssh, openvpn, and so on).  The nice thing about lzo is that
> it's fast, faster than gzip or bzip2, and gets decent compression -- not
> great, but decent.  I don't usually get gzip or bzip2 to compress at
> disk speed, but then, I usually crank the compression way up, so YMMV.
> The point of using a fast algorithm is that you not only save space, but
> when you apply it to things like text files, it can actually make things
> go faster.
> 
> But I imagine it will be settable per-file.  Files can be both encrypted
> and compressed, and I think (I hope) it could be with a choice of
> crypto/compression algorithms.

I didn't know SSH supported LZO.  Rsync does though...

Gzip compression is pretty darn quick at lower levels, though
depending on the lz77 implimentaiton it can be fairly slow at higher
compression levels.

An interesting idea:  select the algo and a range of compression
levels per file, but select the actual compression level at flush time
based on some estimate of how loaded the system is.. :)
Probably not worth it even though the amount of compression and the
speed differ greatly from -1 to -9... I hope no one wastes their time
on it until the more important things are done.. but perhaps a nice
touch.


Re: I request inclusion of reiser4 in the mainline kernel

2005-09-20 Thread Gregory Maxwell
On 9/20/05, Theodore Ts'o <[EMAIL PROTECTED]> wrote:
> The script could be improved by select random locations to damage the
> filesystem, instead of hard-coding the seek=7 value.  Seek=7 is good
> for testing ext2/ext3 filesystems, but it may not be ideal for other
> filesystems.

What would be interesting would be to overwrite random blocks in an
sub-exponentially increasing fashion, fsck and measure file loss at
every step. You fail the test if the system panics reading a FS that
passed a fsck. It would be interesting to chart files lost and files
silently corrupted over time...

Another interesting thought would be to snapshot a file system over
and over again while it's got a disk workout suite running on it..
Then fsck the snapshots, and check for the amount of data loss and
corruption.

> There is a very interesting paper that I coincidentally just came
> across today that talks about making filesystems robust against
> various different forms of failures of modern disk systems.  It is
> going to be presented at the upcoming 2005 SOSP conference.
> 
> http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf

Very interesting indeed, although it almost seems silly to tackle the
difficult problem of making filesystems highly robust against oddball
failure modes while our RAID subsystem falls horribly on it's face in
the fairly common (and conceptually easy to handle) failure mode of a
raid-5 where two disks have single unreadable blocks on differing
parts of the disk. (the current raid system hits one bad block, fails
the whole disk, then you attempt a rebuild and while reading hits the
other bad block and downs the array).


Iron files

2005-09-20 Thread Gregory Maxwell
So the post about file system failure modes made me think of something
interesting...

We'd discussed in the past that it would be interesting to store
cryptographic hashes of files as metadata for facilitating
applications which require hashes as well as data integrity.  Of
course, the challenge is making it perform well.. tree hashes make it
possible but still messy.

Another though on the subject, when we're using the compression plugin
it's quite likely that many blocks will be shrunk quite a bit on
write. We could at that time add a strong checksum (or cryptographic
hash)...  It could just be stored as though it were part of the
compressed data, the cost partly saved by the gains of compression. 
It would probably be useful to include the file identity and position
offset into the hash for each sub part of the file, so that if an
upper level data structure were corrupted in the FS that you'd never
end up with a part of another file silently sitting in the middle of
another file.

This would enable a policy where files could never be silently
corrupted.  Protection could be controlled on a file by file basis
just like compression and optionally operate in mode where check data
is only written but not tested (no substantial performance loss on
read, but a risk of returning corrupted data to the application).

Just another thought for the never ending list...


Re: I request inclusion of reiser4 in the mainline kernel

2005-09-20 Thread Gregory Maxwell
On 9/20/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> > I am not a big fan of formal committees, but would be happy to take
> > part in any effort to standardize, code and test the result...
> The committee could simply exchange a set of emails, and agree on
> things.  I doubt it needs to get all complicated.  I suggest you contact
> all the folks you want to be consistent with each other, send us an
> email asking us to all try to work together, and then ask for proposals
> on what we should all conform to.  Distill the proposals, and then
> suggest a common solution.  With luck, we will all just say yes.:)

Another goal of the group should be to formulate a requested set of
changes or extensions to the makers of drives and other storage
systems.  For example, it might be advantageous to be able to disable
bad block relocation and allow the filesystem to perform the action. 
The reason for this is because relocates slaughter streaming read
performance, but the filesystem could still contiguously allocate
around them...

Perhaps a more implementable alternative is just a method to find out
which sectors have been relocated so the data can be moved off of them
and they be avoided. (and potentially they be 'derelocated' to
preserve the relocation space)

Ditto for other layers.. if a filesystem has some internal integrity
function and a raid sweep has found that the parity doesn't agree, it
would be nice if the FS could check all possible decodings and see if
there is one that is more sane than all the others... This is even
more useful when you have raid-6 and there is a lot more potential
decoding.

Also things like bubbling up to userspace.. If there is an
unrecoverable read error in a file found during operation or an
automated scan, it should show up in syslog with some working complete
path to the file (as canonical as the fs can provide), and hopefully
an offset to the error. Then my package manager could see if this is a
file replaceable out of a package or if it's user data... If it's user
data, my backup scripts can check the access time on the file and
silently restore it from backup if the file hasn't changed. ... only
leaving me an email "Dear operator, I saved your butt yet again
--love, mr computer"

And finally operator policy.. I'd like corrupted user files to become
permission denied until I run some command to make the accessible,
don't let me hang my apps trying to access them..


Re: Will I need to re-format my partition for using the compression plugin?

2005-09-22 Thread Gregory Maxwell
On 9/22/05, Edward Shishkin <[EMAIL PROTECTED]> wrote:
> Yes. It is impossible to implement all features in one file plugin.
> Checksuming means a low
> performance: in order to read some bytes of such file you will need
> first to read the whole file
> to check a checksum (isnt it?). So it will be suitable for a small
> number of special files.
> To write this new file plugin you will want to use already existing
> cipher and compression
> transform plugins (dont mix it with cryptcompress file plugin which also
> calls those plugins)
> to compress and encrypt your checksumed file.

For file data integrity it would actually be more useful to have a per
block hash or checksum. This solves the update problem.  It would be
useful if the file offset and some file identifier were also mixed
into the calculation so that a misplaced block will fail as well. This
would fit quite nicely into the existing actions of the cryptocompress
plugin, and could be accomplished as just another compression algo..
one that always adds the 64-256 bits of check data per block.. At
least as long as the error handling in the FS is robust enough to be
able to treat a decompression failure as an IO error.  ...

If it were desirable to produce a cryptographically strong checksum
which can be handed to the user, what you would do is perform a hash
per block, and store that with each block, then a hash of the hashes
which is returned to the user. This is called a tree hash (google it),
usually you have a deeper hierarchy than two, depends on the
application. This makes incremental updates cheap enough (just hash
the block, then ripple the changes up the tree).  This would remove
the ability to include the file id and offsets directly in the hash,
but i would argue that they should still be used: for example you
could xor the hash value with them before writing it to disk and on
reading it from disk. This would still allow you to detect a misplaced
block but would not make the tree value differ for multiple copies of
the same file.


Re: Will I need to re-format my partition for using the compression plugin?

2005-09-22 Thread Gregory Maxwell
On 9/22/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> 1) RSA is useless for this - you really need a symmetric block cipher of some
> sort.  Almost all block ciphers are best used with maximum-entropy input - if
> the attacker can lop out a large part of the keyspace, a brute force attack
> becomes a lot easier.  This is somewhat related to the concept of "Hamming
> Distance". If the attacker tries a brute force attack, and the first 8 bytes 
> of
> the output look like valid HTML, or English text, or anything else
> recognizable, he's almost certainly found found the correct key.  On the other
> hand, well-compressed data has very high entropy - as a result, it becomes
> harder to tell if a correct key has been found.  If it's English text, but
> 3 of the first 8 bytes have the high bit set, it's probably not a correct key.
> If it's compressed, 3 flipped bits in the first 8 bytes will probably still
> represent a valid compressed stream - just of something else wildly different.

It would normally seem silly to use RSA for disk encryption... but
there might be applications, although you'd still never use RSA
directly on user controlled data.  For example, RSA could be used on a
multi user server to append mail to a mail file so that once written
the data is only accessible once the user logs on.  The reiser4 crypto
system will use the kernel keyring api, so it would be quite
reasonable to tie encryption to user accounts. 'write only' files and
'read only' files would be a simple logical extension, and would
require asymetric cryptography.

Although for most compression algorithms not all inputs are valid
outputs, so this may not work for you... It would be ideal (for disk
encryption) if it were not possible to tell if you have the right key
without decrypting an entire sector. This requires careful selection
in compression and chaining mode.  Alternatively, it may be possible
to develop a good large block cipher which while being much slower
than a single block of a small-block cipher, is faster for a disk
block.  For example, mercy is about 4x faster than AES on my system
but is still 16x slower for the smallest unit of decryption than AES.
Unfortunately mercy has security problems.

> 2) Even though most modern block ciphers are designed to be fast, it's still
> faster to apply a reasonably quick compression scheme to whomp 16K of data
> down to 5-6K and encrypt/decrypt 5-6k than it is to encrypt/decrypt 16K.

Depends on the compression mode and the cipher. A good AES
implementation is around the same speed as an aggressive gzip. In
general this is correct.


Re: I request inclusion of reiser4 in the mainline kernel

2005-09-23 Thread Gregory Maxwell
On 9/23/05, David Greaves <[EMAIL PROTECTED]> wrote:
> who's not keeping up with the linux-raid list then ;)
>
> David
> PS I'm sure assistance would be appreciated in testing and reviewing
> this few day old feature - or indeed the newer 'add a new disk to the
> array' feature.

After posting that I checked linux-raid Thanked the author,
patched a box, but got called out of town before I could test
anything. :)

This is an important development. ... and it's about darn time!


Re: reiser4 for 2.6.13 is available on our website

2005-09-29 Thread Gregory Maxwell
On 9/29/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> It works on non-amd64, will fix amd64 tomorrow I hope.

No, it fails on FC4/x86 here to complie as well, with the same failure mode.


Re: latest 2.6.13 reiser4 patch

2005-09-30 Thread Gregory Maxwell
On 9/29/05, Artur Makówka <[EMAIL PROTECTED]> wrote:
> Will it work also with 2.6.13.2 kernel  ? or is it only for 2.6.13 ( or
> 2.6.13.1 )
>
> i couldnt find any information about this on page, and i want to be sure...

I used it fine with 2.6.13.2... but my wireless card isn't supported
there. (it's been included in 2.6.14x, so it was never patches against
2.6.13.. I tried backporting my wireless driver but there were a lot
of changes in both directions).  The patch also applies cleanly
against 2.6.14-rc1/2 but 2.6.14-rc1/2 panic on initrd uncompress for
me, and I can't troubleshoot it because the backtrace scrolls my
screen and I can't do a serial console on my laptop. :(

In any case I tested reiser4 out some on 13.2, and the only bug I hit
was a panic on unmount.


Re: latest 2.6.13 reiser4 patch

2005-10-01 Thread Gregory Maxwell
On 10/1/05, Artur Makówka <[EMAIL PROTECTED]> wrote:
> but is there some official statment about using 2.6.x reiser4 patches with
> 2.6.x.x kernels? i mean, does patch called reiser 4 2.6.13 patch is also
> intended to work with 2.6.13.2 for example or only with 2.6.13 ?

You aren't likely to get one. It will work on lf the things that
reiser4 depends on are not changed in how they operate. The official
kernel doesn't work real hard to preserve external APIs much less
internal ones. If it works, it works.  Since no one involved in
reiser4 can say with confidence what will be in the next version, we
can't say for sure.


Re: Reiser4 file recovery

2005-10-10 Thread Gregory Maxwell
On 10/10/05, Christian Iversen <[EMAIL PROTECTED]> wrote:

> > I suggest you run Spinrite (grc.com, ~$50 IIRC) on the bad disk from a
> > floppy or CD-ROM in DOS (the program makes images for you in Windows,
> > if you have a working partition, or you can get images from the site
> > IIRC once you've bought a copy) and see how much is recovered
> > (assuming it's just bad sectors or something).  Re-add it to the LVM,
> > recover to a seperate media, and then convert the whole thing to a
> > RAID (maybe via tar?).  I know it's not a free solution, but data
> > recovery is nearly impossible w/o paying in one way, shape, or form.
> > It's easier to have backups.
>
> As usually, Gibson "Research" is skimpy on details, so I'm not entirely sure
> if spinrite is anything more than a disk imager. If not, just use the free
> (gratis && libre) dd_rescue program instead. It will save you $50.

A decade ago spinright would put the disk into a low level mode where
it could read the ADC output and complete raw sectors... then it
performed something like PRML to recover the data. It was a miracle
worker. Back then, between that and a huge box of spare drives to swap
parts from I was able to recover almost all any dead drive that
crossed my desk.

No clue what spinrite does today, as I doubt that sort of low level
access is still possible, and even if it is, drives have become smart
enough to do a lot of that on their own.


Transactions faster than locking

2005-10-11 Thread Gregory Maxwell
Saw this on the Postgres list, and I thought this might be interesting
for some of the users here. Interesting in general to think about
expanding transaction orientation in software, with Reiser4 providing
efficient transactions down to the block update level.

-- Forwarded message --
From: Dann Corbit <[EMAIL PROTECTED]>
Date: Oct 11, 2005 2:09 PM
Subject: Re: [HACKERS] Spinlocks and CPU Architectures
To: Simon Riggs <[EMAIL PROTECTED]>, Peter Eisentraut <[EMAIL PROTECTED]>
Cc: pgsql-hackers@postgresql.org, Tom Lane <[EMAIL PROTECTED]>,
[EMAIL PROTECTED]



As an aside, here is a package that has recently been BSD re-licensed:
http://sourceforge.net/projects/libltx/

It is a lightweight memory transaction package.  It comes with a paper
entitled "Cache Sensitive Software Transactional Memory" by Robert
Ennals.

In the paper, Robert Ennals suggests this form of concurrent programming
as a replacement for lock based programming.  A quote:
"We have now reached the point where transactions are outperforming
locks -- and people are starting to get interested."

There are a number of interesting claims in the paper.  Since the
license is now compatible, it may have some interest for integration
into the PostgreSQL core where appropriate.

It would certainly be worthwhile to read the paper and fool around with
the supplied test driver to compare the approaches.

If nobody on the PostgreSQL team has time for the experimentations, it
might be a good project for a PhD candidate at some university.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
> [EMAIL PROTECTED] On Behalf Of Simon Riggs
> Sent: Tuesday, October 11, 2005 10:56 AM
> To: Peter Eisentraut
> Cc: pgsql-hackers@postgresql.org; Tom Lane
> Subject: Re: [HACKERS] Spinlocks and CPU Architectures
>
> On Tue, 2005-10-11 at 18:45 +0200, Peter Eisentraut wrote:
> > Tom Lane wrote:
> > > This seems pretty unworkable from a packaging standpoint.  Even if
> > > you teach autoconf how to tell which model it's running on,
there's
> > > no guarantee that the resulting executables will be used on that
same
> > > machine.
> >
> > A number of packages in the video area (and perhaps others) do
compile
> > "sub-architecture" specific variants.  This could be done for
> > PostgreSQL, but you'd probably need to show some pretty convincing
> > performance numbers before people start the packaging effort.
>
> I completely agree, just note that we already have some cases where
> convincing performance numbers exist.
>
> Tom is suggesting having different behaviour for x86 and x86_64. The
x86
> will still run on x86_64 architecture would it not? So we'll have two
> binaries for each OS, yes?
>
> In general, where we do find a clear difference, we should at very
least
> identify/record which variant the binary is most suitable for. At best
> we could produce different executables, but I understand the packaging
> effort required to do that.
>
> Best Regards, Simon Riggs
>
>
>
>
> ---(end of
broadcast)---
> TIP 4: Have you searched our list archives?
>
>http://archives.postgresql.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: reiser4 and laptop_mode

2005-10-17 Thread Gregory Maxwell
On 10/17/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> > In fact, if you have enough RAM, you won't ever touch the
> >disk -- deleting a file before it's committed means it never touches disk.
> >
> >It is not as spindown-friendly as laptop_mode, which notices when the
> >drive has to spin up anyway (maybe through a read) and flushes all
> >writes.  Don't know if they are compatible.
> >
> >
> We should work to integrate well with it.  Zam, can you look at that?
> Thanks.

Actually, laptop mode flushes when there is a write and sync all
pending transactions just before spindown. (delaying the write as long
as possible to hopefully get as much in one pass as possible).

In the future when enough API is exposed to make a nice interface for
multi-syscall transactions with partial sync (i.e. only forced syncs
of blocks related to transactions which demand physical fixation), it
would be nice if the commit logic were smart enough to grab other
nearby small transactions and batch them into the same commit.


My Dad suggests a redundant copies plugin

2005-10-26 Thread Gregory Maxwell
On 10/25/05, Hans Reiser <[EMAIL PROTECTED]> wrote:
> This would not be (at least in theory) useful for RAID devices, but for
> a user with a single disk drive, it might be useful to have a plugin
> that creates two (or N) copies, and tries to allocate the two copies at
> opposite ends of the disk.  Anyone out there still looking for a plugin
> to write?

It would be more important to have redundancy for important filesystem
datastructures first, because if you lose your filesystem it becomes
very difficult to read your data.

With modern high density drives bad blocks have become more common
than disk failures, redundancy makes a lot of sense but the cost of
duplicating every file is simply too great for many applications. The
ability to control the level of redundancy and protection from the
filesystem is becoming increasingly important.

Hans, have you read the Iron filesystems paper
(http://www.cs.wisc.edu/wind/Publications/iron-sosp05.ps) that was
previously cited on this list?  You should at least skim it.


Fwd: My Dad suggests a redundant copies plugin

2005-10-26 Thread Gregory Maxwell
On 10/25/05, Sander <[EMAIL PROTECTED]> wrote:
> That will kill performance badly. First of all the two read/writes
> needed, and second because you have to seek from one end to the disk to
> the other every time you read/write something.

Kill it worse for writes than a filesytem without wandering logs? I
don't think so, since they double write in any case... Since reiser4
defers flushes you don't end up seeking all over the disk, you will
write out a nice long queue, then a little seeking to ripple up the
transaction(s).

Reading will not be harmed since we will expect the underlying disk to
report read failures.

> And what is the advantage? You are not protected against a lot of disk
> failures (only against bad blocks, right?).

Badblocks have become the most common non-transient failure mode of
disks by far.

> There is a lot more advantage in buying two disks and do raid over them.
> This is (much) cheaper, gives better performance and gives more
> protection.

Cheaper how? Two disks are cheaper than one?

> Anyway, no need for a plugin. You can just divide your disk in two
> partitions and configure them as a raid1.
>
> Or am I missing something in your suggestion?

If I do not want, and can not afford, that level of protection for all
my data, but only some, RAID-1 is a very wasteful solution.


Re: My Dad suggests a redundant copies plugin

2005-10-26 Thread Gregory Maxwell
On 10/25/05, Ingo Bormuth <[EMAIL PROTECTED]> wrote:
> I agree, real backups are the major weappon against classical data loss due
> to hardware failure.
> Other quite anoying and common causes for data loss are accidentally deleted,
> overwritten or modified files. A _simple_ versioning plugin would be very
> nice to have (I'd definitly use it in /etc).

It would be a sin to implement versioning in resier4 without taking
advantage of how transactions work since done right they can provide
this with almost zero overhead... plus it can be difficult to make
sure that you've got a consistent copy of the file because of how many
applications update files (is a version a write() call? no that won't
work.. so do we only version files that get unlinked and replaced?). 
Between these two things a simple implementation wouldn't be so
simple.


COW files

2005-10-29 Thread Gregory Maxwell
Any thought to making a file plugin that creates copy on write files?
The operation would be something like a hardlink which is invisible to
the user and broken as soon as either file is modified.

Files could be COWed by a flag on the cp command (or really, perhaps
that should be the default behavior) or with a utility (perhaps run as
a periodic script to locate duplicates and COW them. This would
greatly speed up the process of copying files.

The behavior on break could be to duplicate the whole cowed file on
the first right, or allow a COWed file to have alternate choices for
blocks. Files would remain cowed until decowed, which would likely be
bad for performance (due to fragmentation of alternate versions
causing gaps in sequential scans), so the repacker could be taught to
decow files that have too many alternate blocks.


ZFS - Reiser team reactions?

2005-11-17 Thread Gregory Maxwell
Looks like ZFS is no longer vaporware.
http://www.opensolaris.org/os/community/zfs/docs/

Any commentary from the Reiserfs team?

A (supposedly) production ready FS that provides the transactions
(using a similar/same tree ripple technique as reiser4), compression,
and snapshotting, that we expected with reiser4 in the not two distant
future... as well as checksumming, and raid integration (which uses
checksumming to aid reconstruction) that we've barely just talked
about here.

People have already started asking about Linux integration, if it were
not for that fact that CDDL is incompatible with the GPL, I'd expect
it to make its way into linux before Reiser4 as well.

A positive point is that this might be a good chance to point out that
Reiser4 has some of ZFS's features already and is the right framework
for building the rest (and much more)... and get some more interest in
getting it merged and completed.


Re: What's the state of Reiser4 inclusion in the mainline kernel ...

2006-01-09 Thread Gregory Maxwell
On 1/9/06, Giovanni A. Orlando <[EMAIL PROTECTED]> wrote:
> Hans,
>
> Can you tell me please the status of Reiser4 in the Kernel?

Here is the thread you were supposted to read... (I think):
http://marc.theaimsgroup.com/?l=linux-kernel&m=113650213621940&w=2


Re: Reiser4 crash 2.6.16-mm1

2006-03-28 Thread Gregory Maxwell
On 3/28/06, Jonathan Briggs <[EMAIL PROTECTED]> wrote:

> But for a production machine that is "producing" something of value, the
> extra cost should not be an issue.  RAM errors are so subtle and so hard
> to find that ECC is of far more value than RAID.  It is obvious when
> your disk fails.
>
> An extra high bit in a credit transaction could cost you $16,384 and you
> might not ever realize what happened. :)
>
> Anyway, off topic, but ECC is highly recommended.

And with the amount of memory that people are putting in modern system
1 bit events should be happening on a approx weekly basis.

ECC may be more expensive but it doesn't make memory more expensive
than it was just a few years ago you really should have it.

But this has gone far offtopic.


Linux and atomic write()

2006-04-15 Thread Gregory Maxwell
How does this http://marc.theaimsgroup.com/?t=11448928423 impact
reiser4's atomic writes?


Re: reiserfs performance on ssd

2006-04-27 Thread Gregory Maxwell
On 4/27/06, Sander <[EMAIL PROTECTED]> wrote:
> > I have a simple solid state disk to play with here.
> > See http://nerv.eu.org/iram/
>
> Interesting review, thanks.
>
> To get better reliability you could raid1 them.
> I guess this is a 'must' anyway when used in servers (just like with
> harddisks).
>
> Have to try this product myself..

Because they have no ECC most failures will just be completely silent
data corruption.
A sadly useless device.


Re: reiserfs performance on ssd

2006-04-27 Thread Gregory Maxwell
On 4/27/06, Toby Thain <[EMAIL PROTECTED]> wrote:
> Sure ECC would be nice, but how does this differ from disk? Silent
> failures are certainly possible.
>
> The fact that error detection and propagation doesn't really happen
> in modern disk subsystems is why systems like Sun's ZFS are coming
> into being.

Um. Because *every* cosmic ray hit (of which you can expect one to two
every week or so with 2+ gigs of ram) will result in data corruption.

It's claimed that disks don't do a great job propagating hard errors,
which is true to an extent. But they *do* manage to handle soft
errors. Without the coding gain provided by block ECC your modern high
density would be nearly useless.

ZFS wouldn't make iram seriously useable... because AFAIK, raidz will
not work on a single device... so even if can detect a bad block, it
can't correct it.

The problem goes further than that because the cost of computing block
checksums in software will greatly reduce the performance of the fast
ram device..

Not that better integrity features are bad.. The iron filesystem paper
has a lot of great suggestions that go beyond what ZFS provides, and
it would be wonderful to see them in reiser4 someday.  But things need
to progress one step at a time.


Re: reiser4: first impression (vs xfs and jfs)

2006-05-24 Thread Gregory Maxwell

On 5/23/06, Tom Vier <[EMAIL PROTECTED]> wrote:
[snip]

What i'm doing is rsyncing from a slower drive (on 1394) to the raid1 dev.
When using r4 (xfs behaves similarly), after several seconds, reading from
the source and writing to the destination stops for 3 or 4 seconds, then
brief burst of writes to the r4 fs (the dest), a 1 second pause, and then
reading and periodic writes resume, until it happens again.

It seems that both r4 and xfs allow a large number of pages to be dirtied,
before queuing them for writeback, and this has a negative effect on
throughput. In my test (rsync'ing ~50gigs of flacs), r4 and xfs are almost
10 minutes slower than jfs.

[snip]

Have you tested a pure write load? It may be that rsync's combined
reading writing is triggering a corner case for FSes with delayed
allocation. It may not be issuing it's checksumming reads far enough
ahead of time and end up disk latency bound.

It's interesting that you saw the same issues with XFS... I use XFS on
my audio workstation computer because  it (combined with a low latency
patched kernel) had by far the lowest worst case latencies of all the
FSes I tested at the time.


Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

2006-07-31 Thread Gregory Maxwell

On 7/31/06, Alan Cox <[EMAIL PROTECTED]> wrote:

Its well accepted that reiserfs3 has some robustness problems in the
face of physical media errors. The structure of the file system and the
tree basis make it very hard to avoid such problems. XFS appears to have
managed to achieve both robustness and better data structures.

How reiser4 compares I've no idea.


Citation?

I ask because your clam differs from the only detailed research that
I'm aware of on the subject[1]. In figure 2 of the iron filesystems
paper that Ext3 is show to ignore a great number of data-loss inducing
failure conditions that Reiser3 detects an panics under.

Are you sure that you aren't commenting on cases where Reiser3 alerts
the user to a critical data condition (via a panic) which leads to a
trouble report while ext3 ignores the problem which suppresses the
trouble report from the user?

*1) http://www.cs.wisc.edu/adsl/Publications/iron-sosp05.pdf


Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

2006-08-01 Thread Gregory Maxwell

On 8/1/06, David Masover <[EMAIL PROTECTED]> wrote:

Yikes.  Undetected.

Wait, what?  Disks, at least, would be protected by RAID.  Are you
telling me RAID won't detect such an error?


Unless the disk ECC catches it raid won't know anything is wrong.

This is why ZFS offers block checksums... it can then try all the
permutations of raid regens to find a solution which gives the right
checksum.

Every level of the system must be paranoid and take measure to avoid
corruption if the system is to avoid it... it's a tough problem. It
seems that the ZFS folks have addressed this challenge by building as
much of what is classically separate layers into one part.


Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

2006-08-03 Thread Gregory Maxwell

On 8/3/06, Matthias Andree <[EMAIL PROTECTED]> wrote:

Berkeley DB can, since version 4.1 (IIRC), write checksums (newer
versions document this as SHA1) on its database pages, to detect
corruptions and writes that were supposed to be atomic but failed
(because you cannot write 4K or 16K atomically on a disk drive).


The drive doesn't provide atomic writes for > 1sector (and I'd hardly
call trashing a sector during a failed write atomic for even one)...
but a FS can provide such semantics.


Re: the " 'official' point of view" expressed by kernelnewbies.org

2006-08-15 Thread Gregory Maxwell

On 8/15/06, Edward Shishkin <[EMAIL PROTECTED]> wrote:

checksumming is _not_ much more easy then ecc-ing from implementation
standpoint, however it would be nice, if some part of errors will get
fixed without massive surgery performed by fsck


We need checksumming even with eccing... ECCing on large spans of data
is too computationally costly to do unless we know something is wrong
(via a checksum).

Lets pause for a minute, when you talk about ECC what are you actually
talking about?   If you're talking about a hamming code (used on ram,
http://en.wikipedia.org/wiki/Hamming_code) or Convolutional code (used
on telecom links, http://en.wikipedia.org/wiki/Convolutional_code) or
are you talking about an erasure code like RS coding
(http://en.wikipedia.org/wiki/Reed-Solomon_code)?

I assume in the discussions that you're not talking about an RS like
code... because RAID-5 and RAID-6 are, fundamentally, a form of RS
coding. They don't solve bit errors, but when you know you've lost a
block of data they can recover it.

Non-RS forms of ECC are very slow in software (esp decoding) .. and
really aren't that useful: most of the time HDD's will lose data in
nice big chunks that erasure codes handle well but other codes do not.

The challenge with erasure codes is that you must know that a block is
bad... most of the times the drives will tell you, but some times
corruption leaks through. This is where block level checksums come
into play... they allow you to detect bad blocks and then your erasure
code allows you to recover the data.   The checksum must be fast
because you must perform it on every read from disk... this makes ECC
unsuitable, because although it could detect errors, it is too slow.
Also, the number of additional errors ECC could fix are very small..
It would simply be better to store more erasure code blocks.

An optimal RS codes which allows one block of N to fail (and require
one block extra storage)  is computationally trivial. We call it
raid-5.  If your 'threat model' is bad sectors rather than bad disks
(an increasingly realistic shift) then N needs to have nothing to do
with the number of disks you have and can be instead related to how
much protection you want on a file.

If 1:N isn't enough for you, RS can be generalized to any number of
redundant blocks. Unfortunately, doing so requires modular aritmetic
which current CPUs are not too impressively fast at. However, the
Linux Raid-6 code demonstrates that two part parity can be done quite
quickly in software.

As such, I think 'ecc' is useless.. checksums are useful because they
are cheap and allow us to use cheap erasure coding (which could be in
a lower levle raid driver, or implemented in the FS) to achieve data
integrity.

The question of including error coding in the FS or in a lower level
is, as far as I'm concerned, so clear a matter that it is hardly worth
discussing anymore.  In my view it is absolutely idiotic to place
redundancy in a lower level.

The advantage of placing redundancy in a lower level is code
simplicity and sharing.

The problem with doing so, however, is many fold.

The redundancy requirements for various parts of the file system
differ dramatically, without tight FS integration matching the need to
the service is nearly impossible.

The most important reason, however, is performance.  Raid-5 (and
raid-6) suffer a tremendous performance hit because of the requirement
to write a full stripe OR execute a read modify write cycle.  With FS
integrated erasure codes it is possible to adjust the layout of the
written blocks to ensure that every write is a full stripe write,
effectively you adjust the stripe width with every write to ensure
that the write always spans all the disks.  Alternatively you can
reduce the number of stripe chunks  (i.e. number of disks) in the
parity computation to make the write fit (although doing so wastes
space)...

FS redundancy integration also solves the layout problem. From my
experience most systems with hardware raid are getting far below
optimal performance because even when their FS is smart enough to do
file allocation in a raid aware way (XFS and to a lesser extent
EXT2/3) this is usually foiled by the partition table at the beginning
of the raid device. Resulting in 1:N FS blocks actually spanning two
disks! (thus reading that block incurres potentially 2x disk latency).

Seperated FS and redundancy layers are an antiquated concept.. The
FS's job is to provide reliable storage, fully stop.  It's shocking to
see that a dinosaur like SUN has figured this out but the free
software community still fights against it.


Re: the " 'official' point of view" expressed by kernelnewbies.org

2006-08-15 Thread Gregory Maxwell

On 8/15/06, Tom Reinhart <[EMAIL PROTECTED]> wrote:

Of course, not everyone uses RAID.  ECC would benefit some people in some
cases... no argument there.


We can use RAID mechanisms (RS erasure code) on a single disk. You
could technically call it ECC, but if you do so you will confuse
people.  "Block level parity" would be correct.


Re: Reiser4 und LZO compression

2006-08-29 Thread Gregory Maxwell

On 8/29/06, PFC <[EMAIL PROTECTED]> wrote:

Anyone has a bench for lzf ?


This is on a opteron 1.8GHz box. Everything tested hot cache.

Testing on a fairly repetative but real test case (an SQL dump of one
of the Wikipedia tables):
-rw-rw-r-- 1 gmaxwell gmaxwell 426162134 Jul 20 06:54 ../page.sql

$time lzop -c ../page.sql > page.sql.lzo
real0m8.618s
user0m7.800s
sys 0m0.808s

$time lzop -9c ../page.sql > page.sql.lzo-9
real4m45.299s
user4m44.474s
sys 0m0.712s

$time gzip -1 -c ../page.sql > page.sql.gz
real0m19.292s
user0m18.545s
sys 0m0.748s

$time lzop -d -c ./page.sql.lzo > /dev/null
real0m3.061s
user0m2.836s
sys 0m0.224s

$time gzip -dc page.sql.gz >/dev/null
real0m7.199s
user0m7.020s
sys 0m0.176s

$time ./lzf -d  < page.sql.lzf > /dev/null
real0m2.398s
user0m2.224s
sys 0m0.172s

-rw-rw-r-- 1 gmaxwell gmaxwell 193853815 Aug 29 10:59 page.sql.gz
-rw-rw-r-- 1 gmaxwell gmaxwell 243497298 Aug 29 10:47 page.sql.lzf
-rw-rw-r-- 1 gmaxwell gmaxwell 259986955 Jul 20 06:54 page.sql.lzo
-rw-rw-r-- 1 gmaxwell gmaxwell 204930904 Jul 20 06:54 page.sql.lzo-9

(decompression of the differing lzo levels is the same speed)

None of them really decompress fast enough to keep up with the disks
in this system, lzf or lzo wouldn't be a big loss. (Bonnie scores:
floodlamp,64G,,,246163,52,145536,35,,,365198,42,781.2,2,16,4540,69,+,+++,2454,31,4807,76,+,+++,2027,36)


Re: Reiser4 und LZO compression

2006-08-29 Thread Gregory Maxwell

On 8/29/06, David Masover <[EMAIL PROTECTED]> wrote:
[snip]

Conversely, compression does NOT make sense if:
   - You spend a lot of time with the CPU busy and the disk idle.
   - You have more than enough disk space.
   - Disk space is cheaper than buying enough CPU to handle compression.
   - You've tried compression, and the CPU requirements slowed you more
than you saved in disk access.

[snip]

It's also not always this simple ... if you have a single threaded
workload that doesn't overlap CPU and disk well, (de)compression may
be free even if you're still CPU bound a lot as the compression is
using cpu cycles which would have been otherwise idle..