Re: Journaled filesystem in CURRENT

2002-10-04 Thread Alexander Leidinger

On Fri, 27 Sep 2002 13:06:00 -0400 (EDT)
Garrett Wollman [EMAIL PROTECTED] wrote:

  On Thu, Sep 26, 2002 at 09:13:41PM +0200, Alexander Leidinger wrote:
  Yes, bg-fsck isn't really usable at the moment.
 
  They work fine for me for quite a while.  The last buildworld on my
  server was Sept 15th.
 
 Worked fine for me on my home desktop as well -- but I know that fsck
 had little to do in all of the instances I've seen it work, and there
 was no significant disk activity.

Disk activity was/is the key in this case... I haven't tried it with a
recent kernel yet.

Bye,
Alexander.

-- 
   Speak softly and carry a cellular phone.

http://www.Leidinger.net   Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-28 Thread Zhihui Zhang



On Thu, 26 Sep 2002, Claus Assmann wrote:

 On Thu, Sep 26, 2002, Zhihui Zhang wrote:
  On Thu, 26 Sep 2002, Claus Assmann wrote:
 
   If someone is interested:
   http://www.sendmail.org/~ca/email/sm-9-rfh.html
 
   Just as a small data point: I get message acceptance rates of
   400msgs/s on a journalling file system (using a normal PC) that
   writes the data into the journal too. AFAICT that's due to the fact
   that fsync() is much fast for this kind of storage.
   
   The important part for mailservers here is the rate at which content
   files can by safely written to disk. From my limited experience
   journalling file systems are here much better than softupdates.
 
  Can you tell me the approximate sizes of these mails and how they are
  stored?
 
 The test for sendmail 9 were made with small sizes (1-4KB).  They
 were stored in flat files using 16 directories.
 
 The performance tests for sendmail 8 were done with sizes from 1
 to 40 KB, in a single queue directory (AFAIR).

Hope I can bother you with two more questions (I know nothing about
sendmail beyond its name):

(1) Can sendmail be configured to generate automatic messages for the
purpose of performance test?

(2) Is each mail stored in its own file?

Thanks,

-Zhihui


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-28 Thread Claus Assmann

On Sat, Sep 28, 2002, Zhihui Zhang wrote:

 Hope I can bother you with two more questions (I know nothing about
 sendmail beyond its name):
 
 (1) Can sendmail be configured to generate automatic messages for the
 purpose of performance test?

No. sendmail is an MTA, not a performance testing tool...  There
are different load generators available: Netscape has something
(I forgot the name), postal is a test program (which I never got
to compile), and smtp-source/smtp-sink from postfix. I have also
written some load generators, but they require a specific environment.

 (2) Is each mail stored in its own file?

The format of the sendmail mail queue is documented in doc/op/op.*
(should be in /usr/share/doc/smm/08.sendmailop on FreeBSD). In
brief: it uses two files: data (message body) and envelope / headers
/ some other routing data.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-27 Thread David Schultz

Thus spake Alexander Leidinger [EMAIL PROTECTED]:
 On Thu, 26 Sep 2002 10:52:18 -0500 Dan Nelson [EMAIL PROTECTED]
 wrote:
 
   We have something better than those. SoftUpdates. Much faster than
   jfs in metadata intensive operations.
  
  If you can stand the 20 minutes of severly degraded performance while
  the background fsck runs after a crash, and the loss of any files
 
 Sometimes it's better to have 20 minutes (or how long it takes to do the
 bg-fsck on your FS) degraded performance, than no performance at all
 (you can have this too, just configure the system to make an fg-fsck
 instead of a bg-fsck)... (how long does it take to check the journal and
 to do some appropriate actions depending on the journal?)

A journalling FS is slightly slower in day-to-day ``everything
works'' operation because all metadata is written twice.  However,
it can recover from a crash with less work, because all it needs
to do is commit the operations in the log following the last
checkpoint.  It doesn't have to look at every cylinder group to
figure out exactly where it might have been interrupted.

The speed at which background fsck operates should be tunable.  If
you're willing to allow it to take a really long time, you can
make its impact on the system minimal.  I think Kirk's paper on
the topic discusses a way of moderating its speed automatically,
but I forget the details.

  created up to 30 seconds (by default) before the crash.
 
 There's no guarantee with a journaled fs, that the data before the crash
 is on the disk. A journaled fs is in the same boat with SO here.

Any service that really cares about its data being committed to
stable storage should use fsync(2).  Mail servers do this, for
example[1].  If the filesystem implements fsync correctly, then
there is no additional risk.  If you don't use fsync, you can lose
data in a crash with journalling or softupdates, but the metadata
will make it to disk sooner with journalling.


[1] I'm not so sure about qm***.  DJB seems to think that anything
but sync mounts are unreliable for mail, so maybe he expects
that fsync(2) is broken on these filesystems and therefore doesn't
use it.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-27 Thread Alexander Leidinger

On Thu, 26 Sep 2002 12:40:49 -0700 Terry Lambert
[EMAIL PROTECTED] wrote:

   Journalling has advantages that a non-journalling FS with soft
   updates does not -- can not -- have, particularly since it is
   not possible to distinguish a power failure from a hardware
   failure from (some) software failures, and those cases need to
  
  Power failure:
 No problem for both.
  Hardware failure (I assume you think about a HDD failure):
 Read failure: doesn't matter here
 Write failure: either the sector gets remapped (no problem
for both), or the disk is in self destruct
mode (both can't cope with this)
  Software failure:
 Are you talking about bugs in the FS code? Or about a nasty
 person which writes some bad data into the FS structures?
  
   be treated differently for the purposes of recovery.  The soft
  
  Sorry, I don't get it. Can you please be more verbose?
 
 This has been discussed to death before, and Kirk McKusick has
 already posted the definitive post on the topic to FreeBSD-FS.

Keywords (besides SO and Kirk McKusick)/timeframe/message ID/URL?

 The upshot is that it is important to distinguish between an
 FS that had only bad cylinder group bitmap contents, and an FS
 that needs a more thorough consistency checking.
 
 You can not do this if the failure reason for the system is not
 recorded in non-volatile memory somewhere.  For a power failure,
 this is practically impossible, unless you have AC loss notification
 with a sufficient DC holdup time (e.g. like in the InterJet II
 power supply).
 
 Note that recent disk drives (I *will not* call them modern)
 will potentially trash sectors, if a power failure occurs during
 writes.

They don't have a power reservoir large enough to write the entire
content of their cache to disk? Damn. But I shouldn't wonder, the actual
economy is the result of letting marketing people make decissions.

 One way to handle Scott Dodson's problem (for example) is to add
 a softcheck started flag in the superblock, so that if a crash
 occurs durin the abbreviated check, then the full check is done

I asked Kirk a while ago what happens if we have a power failure while
we do a bg-fsck. He told me that this isn't harmful, the actual code
DTRT.

   JFS that journals both data and metadata can recover from all
   three, to a consistant state, and one that journals only
   metadata can recover from two of them.
  
  SO writes the data directly to free sectors in the target
  filesystem. I don't see where journaled data is an improvement in
  fs-consistency here.
 
 The write occurs, or it does not.  The journal entry timestamp
 gets updated after the write completes, or it does not.
 
 Thus, you can always recover a JFS to a consistent state almost
 instantaneously, simply by finding the most recent valid journal
 entry timestamp, and ignoring anything else -- as long as data is
 journalled, and not just metadata.

I'm with Matthias Schündehütte here. SO writes the data and then it
writes the metadata. So either the just written blocks get referenced by
metadata or it does not. So we can recover to a consistent state almost
instantaneously too. The only problem is: when you delete some files,
and the metadata (directory entries) is written, but the free blocks
information isn't updated yet. Then you have to use (bg-)fsck to correct
the free block information. But if you need to go online as fast as
possible with a consistent FS SO doesn't holds you back from this.

Bye,
Alexander.

-- 
   I believe the technical term is Oops!

http://www.Leidinger.net   Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-27 Thread Alexander Leidinger

On Thu, 26 Sep 2002 20:06:00 -0700 David O'Brien [EMAIL PROTECTED]
wrote:

 On Thu, Sep 26, 2002 at 09:13:41PM +0200, Alexander Leidinger wrote:
  Yes, bg-fsck isn't really usable at the moment.
 
 They work fine for me for quite a while.  The last buildworld on my
 server was Sept 15th.

It depends on your usage pattern while doing a bg-fsck. I haven't
checked it lately (I have a -current from Aug 26 which I try to get
updated to an actual known good -current).

Bye,
Alexander.

-- 
   Reboot America.

http://www.Leidinger.net   Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-27 Thread Tony Finch

Terry Lambert [EMAIL PROTECTED] wrote:
Claus Assmann wrote:
[ ... out of order answer, not related to main topic ... ]
 Per domain doesn't work easily if you have multiple recipients.
 Anyway, the new design clearly distinguishes between the content
 files and the data that is necessary for delivery.

Actually, it works fine, since it performs queue entry splitting,
in the case of multiple recipients.  That yields a 100% hit rate
for per domain queue traversals, since they contain only messages
destined for the domain in question.  But back to JFS...

Exim doesn't do per-domain queue runs; when it successfully delivers
mail to a host it checks its hints database for any queued mail that
can go to the same place and shoves them down the same connection --
no scanning of multiple files involved.

Tony (exim bigot).
-- 
f.a.n.finch [EMAIL PROTECTED] http://dotat.at/
ROCKALL: SOUTHWEST BACKING SOUTH 3 OR 4, OCCASIONALLY 5 IN NORTH. MAINLY FAIR.
MODERATE OR GOOD.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-27 Thread Garrett Wollman

On Thu, 26 Sep 2002 20:06:00 -0700, David O'Brien [EMAIL PROTECTED] said:

 On Thu, Sep 26, 2002 at 09:13:41PM +0200, Alexander Leidinger wrote:
 Yes, bg-fsck isn't really usable at the moment.

 They work fine for me for quite a while.  The last buildworld on my
 server was Sept 15th.

Worked fine for me on my home desktop as well -- but I know that fsck
had little to do in all of the instances I've seen it work, and there
was no significant disk activity.

-GAWollman


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-27 Thread Terry Lambert

Tony Finch wrote:
 Exim doesn't do per-domain queue runs; when it successfully delivers
 mail to a host it checks its hints database for any queued mail that
 can go to the same place and shoves them down the same connection --
 no scanning of multiple files involved.

So how does it implement ETRN and ATRN?

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-27 Thread Terry Lambert

Alexander Leidinger wrote:
   Sorry, I don't get it. Can you please be more verbose?
 
  This has been discussed to death before, and Kirk McKusick has
  already posted the definitive post on the topic to FreeBSD-FS.
 
 Keywords (besides SO and Kirk McKusick)/timeframe/message ID/URL?

McKusick AND fsck finds the following in the FreeBSD-arch
archives:

http://www.FreeBSD.org/cgi/getmsg.cgi?fetch=278083+282035+/usr/local/www/db/text/2001/freebsd-arch/20010401.freebsd-arch


  Note that recent disk drives (I *will not* call them modern)
  will potentially trash sectors, if a power failure occurs during
  writes.
 
 They don't have a power reservoir large enough to write the entire
 content of their cache to disk? Damn. But I shouldn't wonder, the actual
 economy is the result of letting marketing people make decissions.

No, they do not.  At one time Qunatum manufactured a 7200 RPM drive
that could do a write/seek/write, using the rotational energy of
th disk, but they quit manufactuing these.  The main reason was that
multimedia disk drives valued the ability to store quickly over
the ability to store correctly (e.g. no thermal recalibration, etc.).


  One way to handle Scott Dodson's problem (for example) is to add
  a softcheck started flag in the superblock, so that if a crash
  occurs durin the abbreviated check, then the full check is done
 
 I asked Kirk a while ago what happens if we have a power failure while
 we do a bg-fsck. He told me that this isn't harmful, the actual code
 DTRT.

I'm aware of this posting as well.

The issue here is is the answer to the question What happens
when I fail in such a way as to need a full fsck?.

If your automated default is to do a backgroundfsck, your
kernel will potentially panic as a result of you running on
the FS with only a background fsck in progress (the panic
which could occur would occur because of normal FS operations
in progress, not as a result of the background fsck operation
itself).

After the panic, if you are in a background fsck mode, you
come up and you panic again, for the same reason, because the
underlying condition of the FS is not related to relatively
harmless overallocations in the cylinder group bitmap, which
is the base assumption the background fsck makes.

Thus, to correct Scott's problem, you need to mark the start
and end of a background fsck cycle, such that if there is a
failure in the middle of a background fsck, when the system is
rebooted, the failure is dealt with via a full non-background
fsck, if the disk is in the state background fsck started but
not yet completed.

To deal with intermittant power outages as the source of
failure, you could, as a tunable parameter, set a count of the
number of times a fatal failure must occur before a background
fsck is no longer an option (e.g. 3 failures duing a background
fsck, and you go to a foreground fsck).

I have quoted correct here, because it's really a workaround,
not a fix, for the underlying problem.


  The write occurs, or it does not.  The journal entry timestamp
  gets updated after the write completes, or it does not.
 
  Thus, you can always recover a JFS to a consistent state almost
  instantaneously, simply by finding the most recent valid journal
  entry timestamp, and ignoring anything else -- as long as data is
  journalled, and not just metadata.
 
 I'm with Matthias Schündehütte here. SO writes the data and then it
 writes the metadata. So either the just written blocks get referenced by
 metadata or it does not. So we can recover to a consistent state almost
 instantaneously too.

Recovering to a consistent state is uninteresting.

Let me explain: you do not want to recover to a consistent state,
per se, you want to recover to *the* consistent state that the FS
*would have been in*, has the failure not occurred, and the
operations which can be rolled forward *had been successful*, and
the operations which can not be rolled forward *had not been
attempted*.

This is the same argument against recovery of an async mounted FS,
following a crash: the number of outstanding operations minux one
is the number of potential operations that left the disk in its
current end state.  Thus, if operations are not ordered, the number
of potential start states grows exponentially.

For example, say I had N related operations in progress; therefore,
the number of consistent states that could have led to the state if
the disk at the time the recovery is attempted is (2^(N-1)).  For
ordered operations, N is always 1, and the result is always (2^0),
or (1) -- therefore it is always possible to recover to *the*
consistent state, rather than *a* consistent state.

Only recovering to *a* consistent state loses implied metadata
(e.g. related updates to record and index files in a relational
database, etc.).  This is unacceptable.


 The only problem is: when you delete some files,
 and the metadata (directory entries) is written, but the free blocks
 information isn't updated yet. 

Re: Journaled filesystem in CURRENT

2002-09-27 Thread Tony Finch

Terry Lambert [EMAIL PROTECTED] wrote:
Tony Finch wrote:
 Exim doesn't do per-domain queue runs; when it successfully delivers
 mail to a host it checks its hints database for any queued mail that
 can go to the same place and shoves them down the same connection --
 no scanning of multiple files involved.

So how does it implement ETRN and ATRN?

They're both sufficiently unimportant not to make it worth complicating
the MTA to optimise them. Exim lets you specify a shell command that is
run in order to implement these SMTP commands, so it's up to you whether
this involves a queue run (with exim -R) or not. For example, you might
route incoming mail to a dial-up host and use the appendfile transport
to dump it in a directory with use_bsmtp, and cause ETRN commands to
run over that directory. (Although the latter requires extra code.)

I'm interested that you think ETRN is important, because to me it seems
the wrong solution given POP with the *ENV extension, or decent IMAP.

Tony.
-- 
f.a.n.finch [EMAIL PROTECTED] http://dotat.at/
FASTNET: SOUTHEASTERLY 3 OR 4 INCREASING 5, OCCASIONALLY 6 LATER. MAINLY FAIR.
MODERATE OR GOOD.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-27 Thread Terry Lambert

Tony Finch wrote:
 Terry Lambert [EMAIL PROTECTED] wrote:
 Tony Finch wrote:
  Exim doesn't do per-domain queue runs; when it successfully delivers
  mail to a host it checks its hints database for any queued mail that
  can go to the same place and shoves them down the same connection --
  no scanning of multiple files involved.
 
 So how does it implement ETRN and ATRN?
 
 They're both sufficiently unimportant not to make it worth complicating
 the MTA to optimise them.

AKA It doesn't.  8-) 8-).


 Exim lets you specify a shell command that is
 run in order to implement these SMTP commands, so it's up to you whether
 this involves a queue run (with exim -R) or not. For example, you might
 route incoming mail to a dial-up host and use the appendfile transport
 to dump it in a directory with use_bsmtp, and cause ETRN commands to
 run over that directory. (Although the latter requires extra code.)
 
 I'm interested that you think ETRN is important, because to me it seems
 the wrong solution given POP with the *ENV extension, or decent IMAP.

POP3 is uninteresting because you don't get the opportunity to
reject the email before taking responsibility for delivering it,

IMAP4 is ununteresting for that reason, and because of the amount
of storage space required.  I understand Mark Crispin's goals in
designing IMAP4, but, practically, it's not very usable in a
traditional ISP setting, any more than POP3 is, when what you are
dealing with isn't local users (and implementing Sieve is generally
not an option, both because it's computationally expensive to run
filters on the ISP side, and because the mail clients aren't really
built to replicate filter information to a mail server, even if you
accepted the computational overhead).  Generally, IMAP4 is not used
in commercial mail services, because of where the value proposition
is loaded in commercial mail services -- both because of how mail
clients have been traditionally designed, and because of the service
overhead.  To be blunt, IMAP4 costs money, and POP3 makes money.

ETRN (and ATRN) solve a totally different problem, actually.  They
solve the MTA store-and-forward problem for differential queue run
latencies.  The most common case of this is a transiently connected
terminal email server for a domain where there are permanently
connected MX's which only store and forward.

In the context of the FS discussion -- which is the context in which
this mail server discussion is taking place -- the issue is one of
the ability of the mail server to permit email to transit it.

There are really only three cases where this happens in high enough
volume that anyone cares about FS performance:

1)  Store and forwarding of queue contents for transiently
connected mail servers (e.g. ETRN, ATRN, finger-based
queue running, RADIUS accounting record triggered queue
running, etc.).

2)  Local delivery to a local maildrop (POP3/IMAP4/etc.), in
which case queue performance is a heck of a lot less
important than that of the local delivery agent -- but
that's not what we were discussing.

3)  If you are an open relay for SPAM.

Frankly, if we are talking about #3, I'd just as soon your machines
became so damaged that they could not be rebooted.  If it could catch
on fire, and burn down the hosting facility that wa willing to sell
connectivity to a SPAM'mer, well, that would just be gravy.  8-).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Alexander Leidinger

On Wed, 25 Sep 2002 11:12:34 -0700 Brooks Davis
[EMAIL PROTECTED] wrote:

  Does CURRENT support journaled filesystem ?
 
 There are not journaling file systems in current at this time. 
 Efforts to port both xfs and jfs are underway.

We have something better than those. SoftUpdates. Much faster than jfs
in metadata intensive operations.

Bye,
Alexander.

-- 
  Loose bits sink chips.

http://www.Leidinger.net   Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Claus Assmann

On Thu, Sep 26, 2002, Alexander Leidinger wrote:
 On Wed, 25 Sep 2002 11:12:34 -0700 Brooks Davis
 [EMAIL PROTECTED] wrote:
 
 Does CURRENT support journaled filesystem ?
  
  There are not journaling file systems in current at this time. 
  Efforts to port both xfs and jfs are underway.
 
 We have something better than those. SoftUpdates. Much faster than jfs
 in metadata intensive operations.

But much slower in some other applications.

When we tested several filesystems for mailservers (to store the
mail queue), JFS and ext3 (in journal mode) beat UFS with softupdates
by about a factor of 2.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Alexander Leidinger

On Thu, 26 Sep 2002 10:52:18 -0500 Dan Nelson [EMAIL PROTECTED]
wrote:

  We have something better than those. SoftUpdates. Much faster than
  jfs in metadata intensive operations.
 
 If you can stand the 20 minutes of severly degraded performance while
 the background fsck runs after a crash, and the loss of any files

Sometimes it's better to have 20 minutes (or how long it takes to do the
bg-fsck on your FS) degraded performance, than no performance at all
(you can have this too, just configure the system to make an fg-fsck
instead of a bg-fsck)... (how long does it take to check the journal and
to do some appropriate actions depending on the journal?)

 created up to 30 seconds (by default) before the crash.

There's no guarantee with a journaled fs, that the data before the crash
is on the disk. A journaled fs is in the same boat with SO here.

Bye,
Alexander.

-- 
Secret hacker rule #11: hackers read manuals.

http://www.Leidinger.net   Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Dan Nelson

In the last episode (Sep 26), Alexander Leidinger said:
 On Wed, 25 Sep 2002 11:12:34 -0700 Brooks Davis
 [EMAIL PROTECTED] wrote:
 Does CURRENT support journaled filesystem ?
  
  There are not journaling file systems in current at this time. 
  Efforts to port both xfs and jfs are underway.
 
 We have something better than those. SoftUpdates. Much faster than
 jfs in metadata intensive operations.

If you can stand the 20 minutes of severly degraded performance while
the background fsck runs after a crash, and the loss of any files
created up to 30 seconds (by default) before the crash.

-- 
Dan Nelson
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Terry Lambert

Claus Assmann wrote:
  Does CURRENT support journaled filesystem ?
  
   There are not journaling file systems in current at this time.
   Efforts to port both xfs and jfs are underway.
 
  We have something better than those. SoftUpdates. Much faster than jfs
  in metadata intensive operations.
 
 But much slower in some other applications.
 
 When we tested several filesystems for mailservers (to store the
 mail queue), JFS and ext3 (in journal mode) beat UFS with softupdates
 by about a factor of 2.

Hi Claus!  Nice to hear from someone who actually tests things!

I think that what you were probably testing was directory entry
layout and O(N) (linear) vs. O(log2(N)+1) search times for both
non-existant entries on creates, and for any entry on lookup
( / 2 on lookup) .

The best answer for inbound mail is to go to per domain mail
queues, and the best for outbound is to go to hashed outbound
domains (as we discussed at the 2000 Sendmail MOTM gathering).
Per domain mail queues inbound give you a 100% hit rate on
a directory traversal for a queue flush; using hashed outbound
directories isn't a 100% hit rate, but you can keep it above
85% with the right hashing structure, which makes the miss
rate have only 1-2% impact on processing.


That said, journalling and Soft Updates are totally orthogonal
technologies, just as btree and linear directory structures are
two orthogonal things.

Journalling has advantages that a non-journalling FS with soft
updates does not -- can not -- have, particularly since it is
not possible to distinguish a power failure from a hardware
failure from (some) software failures, and those cases need to
be treated differently for the purposes of recovery.  The soft
updates background recover can not do this; the foregound
recovery can, but only if it's not the abbreviated version.  A
JFS that journals both data and metadata can recover from all
three, to a consistant state, and one that journals only
metadata can recover from two of them.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread David Malone

On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote:
 I think that what you were probably testing was directory entry
 layout and O(N) (linear) vs. O(log2(N)+1) search times for both
 non-existant entries on creates, and for any entry on lookup
 ( / 2 on lookup) .

Though dirhash should eliminate most of this...

David.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Terry Lambert

David Malone wrote:
 On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote:
  I think that what you were probably testing was directory entry
  layout and O(N) (linear) vs. O(log2(N)+1) search times for both
  non-existant entries on creates, and for any entry on lookup
  ( / 2 on lookup) .
 
 Though dirhash should eliminate most of this...

Everybody alsways says that, and then backs off, when they realize
that a traversal of a mail queue of 100,000 entries, in which the
destination is known by the contents of the file, rather than the
file name, is involved.  8-).

IMO, dirhash is useful in small cases, particularly where locality
of reference is important... which means not during linear traversals
of 100% of a directory on create/iterate and not during linear
traversals of 50% of a directory on lookup of a specific file which
exists or 100% of a directory for a specific file that ends up not
existing.

Cranking the size of the hash up only works to a certain point.

Claus would have to answer this, but I'm pretty sure that the
machines he tested on would have had dirhash, and still ended
up getting bad results for his application (sendmail queue
directories).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread David Malone

  On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote:
   I think that what you were probably testing was directory entry
   layout and O(N) (linear) vs. O(log2(N)+1) search times for both
   non-existant entries on creates, and for any entry on lookup
   ( / 2 on lookup) .
  
  Though dirhash should eliminate most of this...

 Everybody alsways says that, and then backs off, when they realize
 that a traversal of a mail queue of 100,000 entries, in which the
 destination is known by the contents of the file, rather than the
 file name, is involved.  8-).

If you are searching based on contents of a file, then any directory
layout scheme will require mean N/2 probes on success and N on
failure surely? And if these probes are linear (ie. in the order
they are in the directory) then this really is O(N) both with and
without dirhash 'cos the probles will be O(1).

David.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Claus Assmann

On Thu, Sep 26, 2002, Terry Lambert wrote:
 Claus Assmann wrote:

  When we tested several filesystems for mailservers (to store the
  mail queue), JFS and ext3 (in journal mode) beat UFS with softupdates
  by about a factor of 2.
 
 Hi Claus!  Nice to hear from someone who actually tests things!
 
 I think that what you were probably testing was directory entry
 layout and O(N) (linear) vs. O(log2(N)+1) search times for both
 non-existant entries on creates, and for any entry on lookup
 ( / 2 on lookup) .

I doubt it. The number of files in the queue directories was fairly
small during the runs.  Moreover, ReiserFS showed fairly poor
performance, even though it should be good for directory lookups,
right?

 The best answer for inbound mail is to go to per domain mail
 queues, and the best for outbound is to go to hashed outbound
 domains (as we discussed at the 2000 Sendmail MOTM gathering).
 Per domain mail queues inbound give you a 100% hit rate on
 a directory traversal for a queue flush; using hashed outbound
 directories isn't a 100% hit rate, but you can keep it above
 85% with the right hashing structure, which makes the miss
 rate have only 1-2% impact on processing.

Per domain doesn't work easily if you have multiple recipients.
Anyway, the new design clearly distinguishes between the content
files and the data that is necessary for delivery.

If someone is interested:
http://www.sendmail.org/~ca/email/sm-9-rfh.html

Just as a small data point: I get message acceptance rates of
400msgs/s on a journalling file system (using a normal PC) that
writes the data into the journal too. AFAICT that's due to the fact
that fsync() is much fast for this kind of storage.

The important part for mailservers here is the rate at which content
files can by safely written to disk. From my limited experience
journalling file systems are here much better than softupdates.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Zhihui Zhang



On Thu, 26 Sep 2002, Claus Assmann wrote:

 On Thu, Sep 26, 2002, Terry Lambert wrote:
  Claus Assmann wrote:
 
   When we tested several filesystems for mailservers (to store the
   mail queue), JFS and ext3 (in journal mode) beat UFS with softupdates
   by about a factor of 2.
  
  Hi Claus!  Nice to hear from someone who actually tests things!
  
  I think that what you were probably testing was directory entry
  layout and O(N) (linear) vs. O(log2(N)+1) search times for both
  non-existant entries on creates, and for any entry on lookup
  ( / 2 on lookup) .
 
 I doubt it. The number of files in the queue directories was fairly
 small during the runs.  Moreover, ReiserFS showed fairly poor
 performance, even though it should be good for directory lookups,
 right?
 
  The best answer for inbound mail is to go to per domain mail
  queues, and the best for outbound is to go to hashed outbound
  domains (as we discussed at the 2000 Sendmail MOTM gathering).
  Per domain mail queues inbound give you a 100% hit rate on
  a directory traversal for a queue flush; using hashed outbound
  directories isn't a 100% hit rate, but you can keep it above
  85% with the right hashing structure, which makes the miss
  rate have only 1-2% impact on processing.
 
 Per domain doesn't work easily if you have multiple recipients.
 Anyway, the new design clearly distinguishes between the content
 files and the data that is necessary for delivery.
 
 If someone is interested:
 http://www.sendmail.org/~ca/email/sm-9-rfh.html
 
 Just as a small data point: I get message acceptance rates of
 400msgs/s on a journalling file system (using a normal PC) that
 writes the data into the journal too. AFAICT that's due to the fact
 that fsync() is much fast for this kind of storage.
 
 The important part for mailservers here is the rate at which content
 files can by safely written to disk. From my limited experience
 journalling file systems are here much better than softupdates.

Can you tell me the approximate sizes of these mails and how they are
stored?

-Zhihui


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Scott Dodson

I've been having loads of problems with the bg-fsck. After recovering from
a crash/power failure my machine will boot and start the check.  If there's
moderate activity during the time its checking it will panic and reboot, getting
stuck in a loop most of the time.  I've not seen anyone mention this on the
list, but I was wondering if anyone's experienced this?  This has been ongoing
across many cvsups and buildworlds.

Thanks,
Scott

 Alexander Leidinger[EMAIL PROTECTED] 09/26/02 12:23PM 
On Thu, 26 Sep 2002 10:52:18 -0500 Dan Nelson [EMAIL PROTECTED]
wrote:

  We have something better than those. SoftUpdates. Much faster than
  jfs in metadata intensive operations.
 
 If you can stand the 20 minutes of severly degraded performance while
 the background fsck runs after a crash, and the loss of any files

Sometimes it's better to have 20 minutes (or how long it takes to do the
bg-fsck on your FS) degraded performance, than no performance at all
(you can have this too, just configure the system to make an fg-fsck
instead of a bg-fsck)... (how long does it take to check the journal and
to do some appropriate actions depending on the journal?)


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Terry Lambert

David Malone wrote:
   On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote:
I think that what you were probably testing was directory entry
layout and O(N) (linear) vs. O(log2(N)+1) search times for both
non-existant entries on creates, and for any entry on lookup
( / 2 on lookup) .
  
   Though dirhash should eliminate most of this...
 
  Everybody alsways says that, and then backs off, when they realize
  that a traversal of a mail queue of 100,000 entries, in which the
  destination is known by the contents of the file, rather than the
  file name, is involved.  8-).
 
 If you are searching based on contents of a file, then any directory
 layout scheme will require mean N/2 probes on success and N on
 failure surely? And if these probes are linear (ie. in the order
 they are in the directory) then this really is O(N) both with and
 without dirhash 'cos the probles will be O(1).

~O((N^4)/8)/2, actually.

You linearly traverse for the queue element files, and then the
queue elelement files tell you name of the queue content file,
which you you have to look up.  So it's a combined traversal
and lookup on the same directory (in fact: a dirhash buster,
with some of the least optimal behaviour possible).  There are
two additional lookups, which occur to unlink the queue entry

file, and the message file, so it's really (for n queue entries,
which means twice that many directory entries):

 N = n*2
 
 \
  O(N) * O(N/2) * O(N/2) * O((N-1)/2)
 /
 
 N = 0

(Assuming successful delivery and removal of the queue files
on each element iterated).

The way this is fixed in ext3 or most JFS implementations
(both XFS and IBM's OS/2 JFS for Linux) is that the linear
traversal is linear... meaning you don't restart the scan
each time... and the explicit file lookup is O(log2(N+1)).

N * log2(N+1)^3 is significantly smaller than (N^4)/8 (in case
you were wondering about the /8, it's because statistically,
you only have to traverse 50% of the directory entries, on
average, for a linear lookup that results in a hit, but that
only applies to explicit lookups, not the traversal); the result
is (again for n queue entries):

 N = n*2
 
 \
  O(N) * O(log2(N+1)) * O(log2(N+1)) * O(log2(N))
 /
 
 N = 0

There are other data structures that could reduce this further
than btree, FWIW, but implementing them in directories is
moderately hard because of metadata ordering guarantees, and
directory entry locking.  Still, it's probably worth doing, if
you can figure out a way to eliminate the need for directory
vnode locking for modification operations (or can make them
over into range-lock operations, instead).

One obvious fix is to time-order file creations, to try and
keep block locality close to time locality (i.e., if you are
going to create 2 files (f1,f2) in some time interval [t1..t2],
then you try to guarantee that the directory entry block that
contains f2 is after the cone that contains f1, so that a linear
progressive search from the current linear traversal location
that resulted in f1 being found is likely to find f2, either
immediately, or at least within the next block or two).  The
problem with doing this is the inability to ensure that the
file you are creating does not exist... without a full traversal.

This requires that there is some cooperation involved, so
that the lookup traversal is picked up following the current
offset of the linear traversal in progress.  It also fails,
if simultaneos traversals occur... assuming that the offset is
not maintained per-process, but instead, per directory.  If it's
per-process, it actually works out, but only because each process
in the sendmail case only has one queue run going on at a time.

Note that none of this accounts for the queue entry creation of
two additional files; that's a O(4*N+1), since create requires
that the file not already exist (and is not helped by dirhash
at all, being linear, by definition).  For a btree, that's only
O(2*log2(N+1)+2) for the two insertions.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Alexander Leidinger

On Thu, 26 Sep 2002 10:36:27 -0700 Terry Lambert
[EMAIL PROTECTED] wrote:

 That said, journalling and Soft Updates are totally orthogonal
 technologies, just as btree and linear directory structures are
 two orthogonal things.
 
 Journalling has advantages that a non-journalling FS with soft
 updates does not -- can not -- have, particularly since it is
 not possible to distinguish a power failure from a hardware
 failure from (some) software failures, and those cases need to

Power failure:
   No problem for both.
Hardware failure (I assume you think about a HDD failure):
   Read failure: doesn't matter here
   Write failure: either the sector gets remapped (no problem
  for both), or the disk is in self destruct
  mode (both can't cope with this)
Software failure:
   Are you talking about bugs in the FS code? Or about a nasty
   person which writes some bad data into the FS structures?

 be treated differently for the purposes of recovery.  The soft

Sorry, I don't get it. Can you please be more verbose?

 updates background recover can not do this; the foregound
 recovery can, but only if it's not the abbreviated version.  A

What are you talking about? Did you managed to get an unexpected
softupdates inconsistency after the last bugfix?

I don't see a difference in the power or hardware failure cases for a
journaled fs and SO. The only reason for a fg-fsck instead of a bg-fsck
(in the there's no bug in the bg-fsck code path case) is if someone
damages the fs-structures on disk (I assume there are no bugs in SO
anymore which result in an unexpected SO inconsistency).

Note: I don't think the actual code path for bg-fsck is bugfree at the
moment (read: I don't trust it at the moment).

 JFS that journals both data and metadata can recover from all
 three, to a consistant state, and one that journals only
 metadata can recover from two of them.

SO writes the data directly to free sectors in the target filesystem. I
don't see where journaled data is an improvement in fs-consistency here.

Bye,
Alexander.

-- 
   It's not a bug, it's tradition!

http://www.Leidinger.net   Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Alexander Leidinger

On Thu, 26 Sep 2002 14:54:00 -0400 Scott Dodson
[EMAIL PROTECTED] wrote:

 I've been having loads of problems with the bg-fsck. After recovering
 from a crash/power failure my machine will boot and start the check. 
 If there's moderate activity during the time its checking it will
 panic and reboot, getting stuck in a loop most of the time.  I've not
 seen anyone mention this on the list, but I was wondering if anyone's
 experienced this?  This has been ongoing across many cvsups and
 buildworlds.

Yes, bg-fsck isn't really usable at the moment.

Bye,
Alexander.

-- 
   Press every key to continue.

http://www.Leidinger.net   Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Terry Lambert

Claus Assmann wrote:
[ ... out of order answer, not related to main topic ... ]
 Per domain doesn't work easily if you have multiple recipients.
 Anyway, the new design clearly distinguishes between the content
 files and the data that is necessary for delivery.

Actually, it works fine, since it performs queue entry splitting,
in the case of multiple recipients.  That yields a 100% hit rate
for per domain queue traversals, since they contain only messages
destined for the domain in question.  But back to JFS...


[ ... ]
 I doubt it. The number of files in the queue directories was fairly
 small during the runs.  Moreover, ReiserFS showed fairly poor
 performance, even though it should be good for directory lookups,
 right?
[ ... ]
 Just as a small data point: I get message acceptance rates of
 400msgs/s on a journalling file system (using a normal PC) that
 writes the data into the journal too. AFAICT that's due to the fact
 that fsync() is much fast for this kind of storage.
 
 The important part for mailservers here is the rate at which content
 files can by safely written to disk. From my limited experience
 journalling file systems are here much better than softupdates.

I didn't realize you qere running in safe mode; I should have
realized that, since it was supposed to be the only possibility
in a future revision, the last time I looked at the particular
code in question.  I guess I had a stale cache.  8-) 8-).


Note that fsync() is a data operation, not a metadata operation,
in this case, and what we are talking about is queue contents
being committed to stable storage (prior to the 250 Accepted
response, presumably).


Yes, soft updates does nothing of user data, it is a metadata
technology.  Journalling is implementation dependent; not all
JFS implementations will journal data which is not metadata, so
your results would depend on the JFS.


Yes, if your data is journalled, too, then what it means is that
an fsync() is, effectively, a noop, since the commit to the stable
journal entry is (supposedly) guaranteed before the write call
returns.  That's a *big* supposedly, though.

Note that this is potentially not a real commit, though, and you
would be better off testing with power disconnects on very large
queues.  The reason for this is that you need to verify that the
drives are not, in fact, lying to you, by enabling write caching,
and then returning that the data has been committed, when in fact
it has not.

The difference you are seeing might be attributable to the drive
setting for write caching, in the various OSs (e.g. one with it
disabled, the other with it enabled).

Journalling does not always mean data integrity (it was only ever
intended to mean transactional data integrity, in any case, meaning
you can and sometimes do lose transactions in event of a failure).

If you want to compare apples and apples, you should verify that
the data is in fact journalled, that the fsync() actually does
what it's supposed to do, if the data is not, and that the code
path all the way to the disk supports real commits to stable
storage (#1 thing here is: turn off drive write caching in all
cases).


Large queue testing would show the effects that I've discussed in
other emails.  I don't think large throughput with short queue
depths is representative of mail servers (unless you are an open
relay, of course ;^)).  I understand the desire for this, though,
if you are comparing a 2-file queue to a 1-file queue, given the
other effects on deeper queues.  8-(.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Matthias Schuendehuette

Terry Lambert wrote:
 Yes, soft updates does nothing of user data, it is a metadata
 technology.  Journalling is implementation dependent; not all
 JFS implementations will journal data which is not metadata, so
 your results would depend on the JFS.

I think you are not correct here. If I understand Kirks paper right, 
Soft Updates do a sorting/nesting of data and metadata within the 
buffer cache. My knowledge is, that most of the journaling 
implementations do metadata journaling and do not guarantee data 
consistency (ext3 with data=journal is the only exception I know of), 
whereas SU *does* guarantee data consistency (admittedly with a time 
lag) because of that nesting from data with metadata.

I'm far away from beeing able to follow this discussion in every detail, 
but please correct me if I'm wrong...
-- 
Ciao/BSD - Matthias

Matthias Schuendehuette msch [at] snafu.de, Berlin (Germany)
Powered by FreeBSD 4.7-RC


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread David O'Brien

On Thu, Sep 26, 2002 at 09:13:41PM +0200, Alexander Leidinger wrote:
 Yes, bg-fsck isn't really usable at the moment.

They work fine for me for quite a while.  The last buildworld on my
server was Sept 15th.

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-26 Thread Claus Assmann

On Thu, Sep 26, 2002, Zhihui Zhang wrote:
 On Thu, 26 Sep 2002, Claus Assmann wrote:

  If someone is interested:
  http://www.sendmail.org/~ca/email/sm-9-rfh.html

  Just as a small data point: I get message acceptance rates of
  400msgs/s on a journalling file system (using a normal PC) that
  writes the data into the journal too. AFAICT that's due to the fact
  that fsync() is much fast for this kind of storage.
  
  The important part for mailservers here is the rate at which content
  files can by safely written to disk. From my limited experience
  journalling file systems are here much better than softupdates.

 Can you tell me the approximate sizes of these mails and how they are
 stored?

The test for sendmail 9 were made with small sizes (1-4KB).  They
were stored in flat files using 16 directories.

The performance tests for sendmail 8 were done with sizes from 1
to 40 KB, in a single queue directory (AFAIR).

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-25 Thread Brooks Davis

On Wed, Sep 25, 2002 at 04:19:34PM +0300, Anton Yudin wrote:
 
   Does CURRENT support journaled filesystem ?

There are not journaling file systems in current at this time.  Efforts
to port both xfs and jfs are underway.

-- Brooks

-- 
Any statement of the form X is the one, true Y is FALSE.
PGP fingerprint 655D 519C 26A7 82E7 2529  9BF0 5D8E 8BE9 F238 1AD4



msg43385/pgp0.pgp
Description: PGP signature


Re: Journaled filesystem in CURRENT

2002-09-25 Thread Matthias Schuendehuette

If I may add a comment here...

You already *have* a kind of journaled filesystem for some time now.

Please read Soft Updates vs. Journalling Filesystems from M.K. 
McKusick (www.mckusick.com).

I'm really sad if see the efforts done especially for porting JFS to 
FreeBSD, which has already under Linux a more than poor performance.

The only reason for porting JFS is IMHO to be able to mount JFS Volumes 
under FreeBSD - if that's worth the effort...

Why begging for 'Journaling' if you have 'Journaling next generation'?
-- 
Ciao/BSD - Matthias

Matthias Schuendehuette msch [at] snafu.de, Berlin (Germany)
Powered by FreeBSD 4.7-RC


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Journaled filesystem in CURRENT

2002-09-25 Thread Juli Mallett

* De: Matthias Schuendehuette [EMAIL PROTECTED] [ Data: 2002-09-25 ]
[ Subjecte: Re: Journaled filesystem in CURRENT ]
 If I may add a comment here...
 
 You already *have* a kind of journaled filesystem for some time now.
 
 Please read Soft Updates vs. Journalling Filesystems from M.K. 
 McKusick (www.mckusick.com).
 
 I'm really sad if see the efforts done especially for porting JFS to 
 FreeBSD, which has already under Linux a more than poor performance.
 
 The only reason for porting JFS is IMHO to be able to mount JFS Volumes 
 under FreeBSD - if that's worth the effort...
 
 Why begging for 'Journaling' if you have 'Journaling next generation'?

People concentrating on interoperability uses of filesystems are out
of their minds to be writing them in-kernel, when they could be running
them from userland as userland nfs servers, accessing the raw disks.
All we need is to make this a default part of the system, add a libuserfs
to provide an abstraction layer, and tada.

Let me know when there's a JFS4NFS and I'll give a damn, cause then I
can use it everywhere I'd need to.

FWIW, background fsck and softdep and ufs2 will give you all of the good
stuff you *see* from using a journaled fs, without the corruption of a
whole disk, like my girlfriend's laptop went through, running with Linnex
XFS.

juli.
-- 
Juli Mallett [EMAIL PROTECTED]   | FreeBSD: The Power To Serve
Will break world for fulltime employment. | finger [EMAIL PROTECTED]
http://people.FreeBSD.org/~jmallett/  | Support my FreeBSD hacking!

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message