Re: Journaled filesystem in CURRENT
On Fri, 27 Sep 2002 13:06:00 -0400 (EDT) Garrett Wollman [EMAIL PROTECTED] wrote: On Thu, Sep 26, 2002 at 09:13:41PM +0200, Alexander Leidinger wrote: Yes, bg-fsck isn't really usable at the moment. They work fine for me for quite a while. The last buildworld on my server was Sept 15th. Worked fine for me on my home desktop as well -- but I know that fsck had little to do in all of the instances I've seen it work, and there was no significant disk activity. Disk activity was/is the key in this case... I haven't tried it with a recent kernel yet. Bye, Alexander. -- Speak softly and carry a cellular phone. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, 26 Sep 2002, Claus Assmann wrote: On Thu, Sep 26, 2002, Zhihui Zhang wrote: On Thu, 26 Sep 2002, Claus Assmann wrote: If someone is interested: http://www.sendmail.org/~ca/email/sm-9-rfh.html Just as a small data point: I get message acceptance rates of 400msgs/s on a journalling file system (using a normal PC) that writes the data into the journal too. AFAICT that's due to the fact that fsync() is much fast for this kind of storage. The important part for mailservers here is the rate at which content files can by safely written to disk. From my limited experience journalling file systems are here much better than softupdates. Can you tell me the approximate sizes of these mails and how they are stored? The test for sendmail 9 were made with small sizes (1-4KB). They were stored in flat files using 16 directories. The performance tests for sendmail 8 were done with sizes from 1 to 40 KB, in a single queue directory (AFAIR). Hope I can bother you with two more questions (I know nothing about sendmail beyond its name): (1) Can sendmail be configured to generate automatic messages for the purpose of performance test? (2) Is each mail stored in its own file? Thanks, -Zhihui To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Sat, Sep 28, 2002, Zhihui Zhang wrote: Hope I can bother you with two more questions (I know nothing about sendmail beyond its name): (1) Can sendmail be configured to generate automatic messages for the purpose of performance test? No. sendmail is an MTA, not a performance testing tool... There are different load generators available: Netscape has something (I forgot the name), postal is a test program (which I never got to compile), and smtp-source/smtp-sink from postfix. I have also written some load generators, but they require a specific environment. (2) Is each mail stored in its own file? The format of the sendmail mail queue is documented in doc/op/op.* (should be in /usr/share/doc/smm/08.sendmailop on FreeBSD). In brief: it uses two files: data (message body) and envelope / headers / some other routing data. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
Thus spake Alexander Leidinger [EMAIL PROTECTED]: On Thu, 26 Sep 2002 10:52:18 -0500 Dan Nelson [EMAIL PROTECTED] wrote: We have something better than those. SoftUpdates. Much faster than jfs in metadata intensive operations. If you can stand the 20 minutes of severly degraded performance while the background fsck runs after a crash, and the loss of any files Sometimes it's better to have 20 minutes (or how long it takes to do the bg-fsck on your FS) degraded performance, than no performance at all (you can have this too, just configure the system to make an fg-fsck instead of a bg-fsck)... (how long does it take to check the journal and to do some appropriate actions depending on the journal?) A journalling FS is slightly slower in day-to-day ``everything works'' operation because all metadata is written twice. However, it can recover from a crash with less work, because all it needs to do is commit the operations in the log following the last checkpoint. It doesn't have to look at every cylinder group to figure out exactly where it might have been interrupted. The speed at which background fsck operates should be tunable. If you're willing to allow it to take a really long time, you can make its impact on the system minimal. I think Kirk's paper on the topic discusses a way of moderating its speed automatically, but I forget the details. created up to 30 seconds (by default) before the crash. There's no guarantee with a journaled fs, that the data before the crash is on the disk. A journaled fs is in the same boat with SO here. Any service that really cares about its data being committed to stable storage should use fsync(2). Mail servers do this, for example[1]. If the filesystem implements fsync correctly, then there is no additional risk. If you don't use fsync, you can lose data in a crash with journalling or softupdates, but the metadata will make it to disk sooner with journalling. [1] I'm not so sure about qm***. DJB seems to think that anything but sync mounts are unreliable for mail, so maybe he expects that fsync(2) is broken on these filesystems and therefore doesn't use it. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, 26 Sep 2002 12:40:49 -0700 Terry Lambert [EMAIL PROTECTED] wrote: Journalling has advantages that a non-journalling FS with soft updates does not -- can not -- have, particularly since it is not possible to distinguish a power failure from a hardware failure from (some) software failures, and those cases need to Power failure: No problem for both. Hardware failure (I assume you think about a HDD failure): Read failure: doesn't matter here Write failure: either the sector gets remapped (no problem for both), or the disk is in self destruct mode (both can't cope with this) Software failure: Are you talking about bugs in the FS code? Or about a nasty person which writes some bad data into the FS structures? be treated differently for the purposes of recovery. The soft Sorry, I don't get it. Can you please be more verbose? This has been discussed to death before, and Kirk McKusick has already posted the definitive post on the topic to FreeBSD-FS. Keywords (besides SO and Kirk McKusick)/timeframe/message ID/URL? The upshot is that it is important to distinguish between an FS that had only bad cylinder group bitmap contents, and an FS that needs a more thorough consistency checking. You can not do this if the failure reason for the system is not recorded in non-volatile memory somewhere. For a power failure, this is practically impossible, unless you have AC loss notification with a sufficient DC holdup time (e.g. like in the InterJet II power supply). Note that recent disk drives (I *will not* call them modern) will potentially trash sectors, if a power failure occurs during writes. They don't have a power reservoir large enough to write the entire content of their cache to disk? Damn. But I shouldn't wonder, the actual economy is the result of letting marketing people make decissions. One way to handle Scott Dodson's problem (for example) is to add a softcheck started flag in the superblock, so that if a crash occurs durin the abbreviated check, then the full check is done I asked Kirk a while ago what happens if we have a power failure while we do a bg-fsck. He told me that this isn't harmful, the actual code DTRT. JFS that journals both data and metadata can recover from all three, to a consistant state, and one that journals only metadata can recover from two of them. SO writes the data directly to free sectors in the target filesystem. I don't see where journaled data is an improvement in fs-consistency here. The write occurs, or it does not. The journal entry timestamp gets updated after the write completes, or it does not. Thus, you can always recover a JFS to a consistent state almost instantaneously, simply by finding the most recent valid journal entry timestamp, and ignoring anything else -- as long as data is journalled, and not just metadata. I'm with Matthias Schündehütte here. SO writes the data and then it writes the metadata. So either the just written blocks get referenced by metadata or it does not. So we can recover to a consistent state almost instantaneously too. The only problem is: when you delete some files, and the metadata (directory entries) is written, but the free blocks information isn't updated yet. Then you have to use (bg-)fsck to correct the free block information. But if you need to go online as fast as possible with a consistent FS SO doesn't holds you back from this. Bye, Alexander. -- I believe the technical term is Oops! http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, 26 Sep 2002 20:06:00 -0700 David O'Brien [EMAIL PROTECTED] wrote: On Thu, Sep 26, 2002 at 09:13:41PM +0200, Alexander Leidinger wrote: Yes, bg-fsck isn't really usable at the moment. They work fine for me for quite a while. The last buildworld on my server was Sept 15th. It depends on your usage pattern while doing a bg-fsck. I haven't checked it lately (I have a -current from Aug 26 which I try to get updated to an actual known good -current). Bye, Alexander. -- Reboot America. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
Terry Lambert [EMAIL PROTECTED] wrote: Claus Assmann wrote: [ ... out of order answer, not related to main topic ... ] Per domain doesn't work easily if you have multiple recipients. Anyway, the new design clearly distinguishes between the content files and the data that is necessary for delivery. Actually, it works fine, since it performs queue entry splitting, in the case of multiple recipients. That yields a 100% hit rate for per domain queue traversals, since they contain only messages destined for the domain in question. But back to JFS... Exim doesn't do per-domain queue runs; when it successfully delivers mail to a host it checks its hints database for any queued mail that can go to the same place and shoves them down the same connection -- no scanning of multiple files involved. Tony (exim bigot). -- f.a.n.finch [EMAIL PROTECTED] http://dotat.at/ ROCKALL: SOUTHWEST BACKING SOUTH 3 OR 4, OCCASIONALLY 5 IN NORTH. MAINLY FAIR. MODERATE OR GOOD. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, 26 Sep 2002 20:06:00 -0700, David O'Brien [EMAIL PROTECTED] said: On Thu, Sep 26, 2002 at 09:13:41PM +0200, Alexander Leidinger wrote: Yes, bg-fsck isn't really usable at the moment. They work fine for me for quite a while. The last buildworld on my server was Sept 15th. Worked fine for me on my home desktop as well -- but I know that fsck had little to do in all of the instances I've seen it work, and there was no significant disk activity. -GAWollman To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
Tony Finch wrote: Exim doesn't do per-domain queue runs; when it successfully delivers mail to a host it checks its hints database for any queued mail that can go to the same place and shoves them down the same connection -- no scanning of multiple files involved. So how does it implement ETRN and ATRN? -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
Alexander Leidinger wrote: Sorry, I don't get it. Can you please be more verbose? This has been discussed to death before, and Kirk McKusick has already posted the definitive post on the topic to FreeBSD-FS. Keywords (besides SO and Kirk McKusick)/timeframe/message ID/URL? McKusick AND fsck finds the following in the FreeBSD-arch archives: http://www.FreeBSD.org/cgi/getmsg.cgi?fetch=278083+282035+/usr/local/www/db/text/2001/freebsd-arch/20010401.freebsd-arch Note that recent disk drives (I *will not* call them modern) will potentially trash sectors, if a power failure occurs during writes. They don't have a power reservoir large enough to write the entire content of their cache to disk? Damn. But I shouldn't wonder, the actual economy is the result of letting marketing people make decissions. No, they do not. At one time Qunatum manufactured a 7200 RPM drive that could do a write/seek/write, using the rotational energy of th disk, but they quit manufactuing these. The main reason was that multimedia disk drives valued the ability to store quickly over the ability to store correctly (e.g. no thermal recalibration, etc.). One way to handle Scott Dodson's problem (for example) is to add a softcheck started flag in the superblock, so that if a crash occurs durin the abbreviated check, then the full check is done I asked Kirk a while ago what happens if we have a power failure while we do a bg-fsck. He told me that this isn't harmful, the actual code DTRT. I'm aware of this posting as well. The issue here is is the answer to the question What happens when I fail in such a way as to need a full fsck?. If your automated default is to do a backgroundfsck, your kernel will potentially panic as a result of you running on the FS with only a background fsck in progress (the panic which could occur would occur because of normal FS operations in progress, not as a result of the background fsck operation itself). After the panic, if you are in a background fsck mode, you come up and you panic again, for the same reason, because the underlying condition of the FS is not related to relatively harmless overallocations in the cylinder group bitmap, which is the base assumption the background fsck makes. Thus, to correct Scott's problem, you need to mark the start and end of a background fsck cycle, such that if there is a failure in the middle of a background fsck, when the system is rebooted, the failure is dealt with via a full non-background fsck, if the disk is in the state background fsck started but not yet completed. To deal with intermittant power outages as the source of failure, you could, as a tunable parameter, set a count of the number of times a fatal failure must occur before a background fsck is no longer an option (e.g. 3 failures duing a background fsck, and you go to a foreground fsck). I have quoted correct here, because it's really a workaround, not a fix, for the underlying problem. The write occurs, or it does not. The journal entry timestamp gets updated after the write completes, or it does not. Thus, you can always recover a JFS to a consistent state almost instantaneously, simply by finding the most recent valid journal entry timestamp, and ignoring anything else -- as long as data is journalled, and not just metadata. I'm with Matthias Schündehütte here. SO writes the data and then it writes the metadata. So either the just written blocks get referenced by metadata or it does not. So we can recover to a consistent state almost instantaneously too. Recovering to a consistent state is uninteresting. Let me explain: you do not want to recover to a consistent state, per se, you want to recover to *the* consistent state that the FS *would have been in*, has the failure not occurred, and the operations which can be rolled forward *had been successful*, and the operations which can not be rolled forward *had not been attempted*. This is the same argument against recovery of an async mounted FS, following a crash: the number of outstanding operations minux one is the number of potential operations that left the disk in its current end state. Thus, if operations are not ordered, the number of potential start states grows exponentially. For example, say I had N related operations in progress; therefore, the number of consistent states that could have led to the state if the disk at the time the recovery is attempted is (2^(N-1)). For ordered operations, N is always 1, and the result is always (2^0), or (1) -- therefore it is always possible to recover to *the* consistent state, rather than *a* consistent state. Only recovering to *a* consistent state loses implied metadata (e.g. related updates to record and index files in a relational database, etc.). This is unacceptable. The only problem is: when you delete some files, and the metadata (directory entries) is written, but the free blocks information isn't updated yet.
Re: Journaled filesystem in CURRENT
Terry Lambert [EMAIL PROTECTED] wrote: Tony Finch wrote: Exim doesn't do per-domain queue runs; when it successfully delivers mail to a host it checks its hints database for any queued mail that can go to the same place and shoves them down the same connection -- no scanning of multiple files involved. So how does it implement ETRN and ATRN? They're both sufficiently unimportant not to make it worth complicating the MTA to optimise them. Exim lets you specify a shell command that is run in order to implement these SMTP commands, so it's up to you whether this involves a queue run (with exim -R) or not. For example, you might route incoming mail to a dial-up host and use the appendfile transport to dump it in a directory with use_bsmtp, and cause ETRN commands to run over that directory. (Although the latter requires extra code.) I'm interested that you think ETRN is important, because to me it seems the wrong solution given POP with the *ENV extension, or decent IMAP. Tony. -- f.a.n.finch [EMAIL PROTECTED] http://dotat.at/ FASTNET: SOUTHEASTERLY 3 OR 4 INCREASING 5, OCCASIONALLY 6 LATER. MAINLY FAIR. MODERATE OR GOOD. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
Tony Finch wrote: Terry Lambert [EMAIL PROTECTED] wrote: Tony Finch wrote: Exim doesn't do per-domain queue runs; when it successfully delivers mail to a host it checks its hints database for any queued mail that can go to the same place and shoves them down the same connection -- no scanning of multiple files involved. So how does it implement ETRN and ATRN? They're both sufficiently unimportant not to make it worth complicating the MTA to optimise them. AKA It doesn't. 8-) 8-). Exim lets you specify a shell command that is run in order to implement these SMTP commands, so it's up to you whether this involves a queue run (with exim -R) or not. For example, you might route incoming mail to a dial-up host and use the appendfile transport to dump it in a directory with use_bsmtp, and cause ETRN commands to run over that directory. (Although the latter requires extra code.) I'm interested that you think ETRN is important, because to me it seems the wrong solution given POP with the *ENV extension, or decent IMAP. POP3 is uninteresting because you don't get the opportunity to reject the email before taking responsibility for delivering it, IMAP4 is ununteresting for that reason, and because of the amount of storage space required. I understand Mark Crispin's goals in designing IMAP4, but, practically, it's not very usable in a traditional ISP setting, any more than POP3 is, when what you are dealing with isn't local users (and implementing Sieve is generally not an option, both because it's computationally expensive to run filters on the ISP side, and because the mail clients aren't really built to replicate filter information to a mail server, even if you accepted the computational overhead). Generally, IMAP4 is not used in commercial mail services, because of where the value proposition is loaded in commercial mail services -- both because of how mail clients have been traditionally designed, and because of the service overhead. To be blunt, IMAP4 costs money, and POP3 makes money. ETRN (and ATRN) solve a totally different problem, actually. They solve the MTA store-and-forward problem for differential queue run latencies. The most common case of this is a transiently connected terminal email server for a domain where there are permanently connected MX's which only store and forward. In the context of the FS discussion -- which is the context in which this mail server discussion is taking place -- the issue is one of the ability of the mail server to permit email to transit it. There are really only three cases where this happens in high enough volume that anyone cares about FS performance: 1) Store and forwarding of queue contents for transiently connected mail servers (e.g. ETRN, ATRN, finger-based queue running, RADIUS accounting record triggered queue running, etc.). 2) Local delivery to a local maildrop (POP3/IMAP4/etc.), in which case queue performance is a heck of a lot less important than that of the local delivery agent -- but that's not what we were discussing. 3) If you are an open relay for SPAM. Frankly, if we are talking about #3, I'd just as soon your machines became so damaged that they could not be rebooted. If it could catch on fire, and burn down the hosting facility that wa willing to sell connectivity to a SPAM'mer, well, that would just be gravy. 8-). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Wed, 25 Sep 2002 11:12:34 -0700 Brooks Davis [EMAIL PROTECTED] wrote: Does CURRENT support journaled filesystem ? There are not journaling file systems in current at this time. Efforts to port both xfs and jfs are underway. We have something better than those. SoftUpdates. Much faster than jfs in metadata intensive operations. Bye, Alexander. -- Loose bits sink chips. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, Sep 26, 2002, Alexander Leidinger wrote: On Wed, 25 Sep 2002 11:12:34 -0700 Brooks Davis [EMAIL PROTECTED] wrote: Does CURRENT support journaled filesystem ? There are not journaling file systems in current at this time. Efforts to port both xfs and jfs are underway. We have something better than those. SoftUpdates. Much faster than jfs in metadata intensive operations. But much slower in some other applications. When we tested several filesystems for mailservers (to store the mail queue), JFS and ext3 (in journal mode) beat UFS with softupdates by about a factor of 2. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, 26 Sep 2002 10:52:18 -0500 Dan Nelson [EMAIL PROTECTED] wrote: We have something better than those. SoftUpdates. Much faster than jfs in metadata intensive operations. If you can stand the 20 minutes of severly degraded performance while the background fsck runs after a crash, and the loss of any files Sometimes it's better to have 20 minutes (or how long it takes to do the bg-fsck on your FS) degraded performance, than no performance at all (you can have this too, just configure the system to make an fg-fsck instead of a bg-fsck)... (how long does it take to check the journal and to do some appropriate actions depending on the journal?) created up to 30 seconds (by default) before the crash. There's no guarantee with a journaled fs, that the data before the crash is on the disk. A journaled fs is in the same boat with SO here. Bye, Alexander. -- Secret hacker rule #11: hackers read manuals. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
In the last episode (Sep 26), Alexander Leidinger said: On Wed, 25 Sep 2002 11:12:34 -0700 Brooks Davis [EMAIL PROTECTED] wrote: Does CURRENT support journaled filesystem ? There are not journaling file systems in current at this time. Efforts to port both xfs and jfs are underway. We have something better than those. SoftUpdates. Much faster than jfs in metadata intensive operations. If you can stand the 20 minutes of severly degraded performance while the background fsck runs after a crash, and the loss of any files created up to 30 seconds (by default) before the crash. -- Dan Nelson [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
Claus Assmann wrote: Does CURRENT support journaled filesystem ? There are not journaling file systems in current at this time. Efforts to port both xfs and jfs are underway. We have something better than those. SoftUpdates. Much faster than jfs in metadata intensive operations. But much slower in some other applications. When we tested several filesystems for mailservers (to store the mail queue), JFS and ext3 (in journal mode) beat UFS with softupdates by about a factor of 2. Hi Claus! Nice to hear from someone who actually tests things! I think that what you were probably testing was directory entry layout and O(N) (linear) vs. O(log2(N)+1) search times for both non-existant entries on creates, and for any entry on lookup ( / 2 on lookup) . The best answer for inbound mail is to go to per domain mail queues, and the best for outbound is to go to hashed outbound domains (as we discussed at the 2000 Sendmail MOTM gathering). Per domain mail queues inbound give you a 100% hit rate on a directory traversal for a queue flush; using hashed outbound directories isn't a 100% hit rate, but you can keep it above 85% with the right hashing structure, which makes the miss rate have only 1-2% impact on processing. That said, journalling and Soft Updates are totally orthogonal technologies, just as btree and linear directory structures are two orthogonal things. Journalling has advantages that a non-journalling FS with soft updates does not -- can not -- have, particularly since it is not possible to distinguish a power failure from a hardware failure from (some) software failures, and those cases need to be treated differently for the purposes of recovery. The soft updates background recover can not do this; the foregound recovery can, but only if it's not the abbreviated version. A JFS that journals both data and metadata can recover from all three, to a consistant state, and one that journals only metadata can recover from two of them. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote: I think that what you were probably testing was directory entry layout and O(N) (linear) vs. O(log2(N)+1) search times for both non-existant entries on creates, and for any entry on lookup ( / 2 on lookup) . Though dirhash should eliminate most of this... David. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
David Malone wrote: On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote: I think that what you were probably testing was directory entry layout and O(N) (linear) vs. O(log2(N)+1) search times for both non-existant entries on creates, and for any entry on lookup ( / 2 on lookup) . Though dirhash should eliminate most of this... Everybody alsways says that, and then backs off, when they realize that a traversal of a mail queue of 100,000 entries, in which the destination is known by the contents of the file, rather than the file name, is involved. 8-). IMO, dirhash is useful in small cases, particularly where locality of reference is important... which means not during linear traversals of 100% of a directory on create/iterate and not during linear traversals of 50% of a directory on lookup of a specific file which exists or 100% of a directory for a specific file that ends up not existing. Cranking the size of the hash up only works to a certain point. Claus would have to answer this, but I'm pretty sure that the machines he tested on would have had dirhash, and still ended up getting bad results for his application (sendmail queue directories). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote: I think that what you were probably testing was directory entry layout and O(N) (linear) vs. O(log2(N)+1) search times for both non-existant entries on creates, and for any entry on lookup ( / 2 on lookup) . Though dirhash should eliminate most of this... Everybody alsways says that, and then backs off, when they realize that a traversal of a mail queue of 100,000 entries, in which the destination is known by the contents of the file, rather than the file name, is involved. 8-). If you are searching based on contents of a file, then any directory layout scheme will require mean N/2 probes on success and N on failure surely? And if these probes are linear (ie. in the order they are in the directory) then this really is O(N) both with and without dirhash 'cos the probles will be O(1). David. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, Sep 26, 2002, Terry Lambert wrote: Claus Assmann wrote: When we tested several filesystems for mailservers (to store the mail queue), JFS and ext3 (in journal mode) beat UFS with softupdates by about a factor of 2. Hi Claus! Nice to hear from someone who actually tests things! I think that what you were probably testing was directory entry layout and O(N) (linear) vs. O(log2(N)+1) search times for both non-existant entries on creates, and for any entry on lookup ( / 2 on lookup) . I doubt it. The number of files in the queue directories was fairly small during the runs. Moreover, ReiserFS showed fairly poor performance, even though it should be good for directory lookups, right? The best answer for inbound mail is to go to per domain mail queues, and the best for outbound is to go to hashed outbound domains (as we discussed at the 2000 Sendmail MOTM gathering). Per domain mail queues inbound give you a 100% hit rate on a directory traversal for a queue flush; using hashed outbound directories isn't a 100% hit rate, but you can keep it above 85% with the right hashing structure, which makes the miss rate have only 1-2% impact on processing. Per domain doesn't work easily if you have multiple recipients. Anyway, the new design clearly distinguishes between the content files and the data that is necessary for delivery. If someone is interested: http://www.sendmail.org/~ca/email/sm-9-rfh.html Just as a small data point: I get message acceptance rates of 400msgs/s on a journalling file system (using a normal PC) that writes the data into the journal too. AFAICT that's due to the fact that fsync() is much fast for this kind of storage. The important part for mailservers here is the rate at which content files can by safely written to disk. From my limited experience journalling file systems are here much better than softupdates. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, 26 Sep 2002, Claus Assmann wrote: On Thu, Sep 26, 2002, Terry Lambert wrote: Claus Assmann wrote: When we tested several filesystems for mailservers (to store the mail queue), JFS and ext3 (in journal mode) beat UFS with softupdates by about a factor of 2. Hi Claus! Nice to hear from someone who actually tests things! I think that what you were probably testing was directory entry layout and O(N) (linear) vs. O(log2(N)+1) search times for both non-existant entries on creates, and for any entry on lookup ( / 2 on lookup) . I doubt it. The number of files in the queue directories was fairly small during the runs. Moreover, ReiserFS showed fairly poor performance, even though it should be good for directory lookups, right? The best answer for inbound mail is to go to per domain mail queues, and the best for outbound is to go to hashed outbound domains (as we discussed at the 2000 Sendmail MOTM gathering). Per domain mail queues inbound give you a 100% hit rate on a directory traversal for a queue flush; using hashed outbound directories isn't a 100% hit rate, but you can keep it above 85% with the right hashing structure, which makes the miss rate have only 1-2% impact on processing. Per domain doesn't work easily if you have multiple recipients. Anyway, the new design clearly distinguishes between the content files and the data that is necessary for delivery. If someone is interested: http://www.sendmail.org/~ca/email/sm-9-rfh.html Just as a small data point: I get message acceptance rates of 400msgs/s on a journalling file system (using a normal PC) that writes the data into the journal too. AFAICT that's due to the fact that fsync() is much fast for this kind of storage. The important part for mailservers here is the rate at which content files can by safely written to disk. From my limited experience journalling file systems are here much better than softupdates. Can you tell me the approximate sizes of these mails and how they are stored? -Zhihui To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
I've been having loads of problems with the bg-fsck. After recovering from a crash/power failure my machine will boot and start the check. If there's moderate activity during the time its checking it will panic and reboot, getting stuck in a loop most of the time. I've not seen anyone mention this on the list, but I was wondering if anyone's experienced this? This has been ongoing across many cvsups and buildworlds. Thanks, Scott Alexander Leidinger[EMAIL PROTECTED] 09/26/02 12:23PM On Thu, 26 Sep 2002 10:52:18 -0500 Dan Nelson [EMAIL PROTECTED] wrote: We have something better than those. SoftUpdates. Much faster than jfs in metadata intensive operations. If you can stand the 20 minutes of severly degraded performance while the background fsck runs after a crash, and the loss of any files Sometimes it's better to have 20 minutes (or how long it takes to do the bg-fsck on your FS) degraded performance, than no performance at all (you can have this too, just configure the system to make an fg-fsck instead of a bg-fsck)... (how long does it take to check the journal and to do some appropriate actions depending on the journal?) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
David Malone wrote: On Thu, Sep 26, 2002 at 10:36:27AM -0700, Terry Lambert wrote: I think that what you were probably testing was directory entry layout and O(N) (linear) vs. O(log2(N)+1) search times for both non-existant entries on creates, and for any entry on lookup ( / 2 on lookup) . Though dirhash should eliminate most of this... Everybody alsways says that, and then backs off, when they realize that a traversal of a mail queue of 100,000 entries, in which the destination is known by the contents of the file, rather than the file name, is involved. 8-). If you are searching based on contents of a file, then any directory layout scheme will require mean N/2 probes on success and N on failure surely? And if these probes are linear (ie. in the order they are in the directory) then this really is O(N) both with and without dirhash 'cos the probles will be O(1). ~O((N^4)/8)/2, actually. You linearly traverse for the queue element files, and then the queue elelement files tell you name of the queue content file, which you you have to look up. So it's a combined traversal and lookup on the same directory (in fact: a dirhash buster, with some of the least optimal behaviour possible). There are two additional lookups, which occur to unlink the queue entry file, and the message file, so it's really (for n queue entries, which means twice that many directory entries): N = n*2 \ O(N) * O(N/2) * O(N/2) * O((N-1)/2) / N = 0 (Assuming successful delivery and removal of the queue files on each element iterated). The way this is fixed in ext3 or most JFS implementations (both XFS and IBM's OS/2 JFS for Linux) is that the linear traversal is linear... meaning you don't restart the scan each time... and the explicit file lookup is O(log2(N+1)). N * log2(N+1)^3 is significantly smaller than (N^4)/8 (in case you were wondering about the /8, it's because statistically, you only have to traverse 50% of the directory entries, on average, for a linear lookup that results in a hit, but that only applies to explicit lookups, not the traversal); the result is (again for n queue entries): N = n*2 \ O(N) * O(log2(N+1)) * O(log2(N+1)) * O(log2(N)) / N = 0 There are other data structures that could reduce this further than btree, FWIW, but implementing them in directories is moderately hard because of metadata ordering guarantees, and directory entry locking. Still, it's probably worth doing, if you can figure out a way to eliminate the need for directory vnode locking for modification operations (or can make them over into range-lock operations, instead). One obvious fix is to time-order file creations, to try and keep block locality close to time locality (i.e., if you are going to create 2 files (f1,f2) in some time interval [t1..t2], then you try to guarantee that the directory entry block that contains f2 is after the cone that contains f1, so that a linear progressive search from the current linear traversal location that resulted in f1 being found is likely to find f2, either immediately, or at least within the next block or two). The problem with doing this is the inability to ensure that the file you are creating does not exist... without a full traversal. This requires that there is some cooperation involved, so that the lookup traversal is picked up following the current offset of the linear traversal in progress. It also fails, if simultaneos traversals occur... assuming that the offset is not maintained per-process, but instead, per directory. If it's per-process, it actually works out, but only because each process in the sendmail case only has one queue run going on at a time. Note that none of this accounts for the queue entry creation of two additional files; that's a O(4*N+1), since create requires that the file not already exist (and is not helped by dirhash at all, being linear, by definition). For a btree, that's only O(2*log2(N+1)+2) for the two insertions. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, 26 Sep 2002 10:36:27 -0700 Terry Lambert [EMAIL PROTECTED] wrote: That said, journalling and Soft Updates are totally orthogonal technologies, just as btree and linear directory structures are two orthogonal things. Journalling has advantages that a non-journalling FS with soft updates does not -- can not -- have, particularly since it is not possible to distinguish a power failure from a hardware failure from (some) software failures, and those cases need to Power failure: No problem for both. Hardware failure (I assume you think about a HDD failure): Read failure: doesn't matter here Write failure: either the sector gets remapped (no problem for both), or the disk is in self destruct mode (both can't cope with this) Software failure: Are you talking about bugs in the FS code? Or about a nasty person which writes some bad data into the FS structures? be treated differently for the purposes of recovery. The soft Sorry, I don't get it. Can you please be more verbose? updates background recover can not do this; the foregound recovery can, but only if it's not the abbreviated version. A What are you talking about? Did you managed to get an unexpected softupdates inconsistency after the last bugfix? I don't see a difference in the power or hardware failure cases for a journaled fs and SO. The only reason for a fg-fsck instead of a bg-fsck (in the there's no bug in the bg-fsck code path case) is if someone damages the fs-structures on disk (I assume there are no bugs in SO anymore which result in an unexpected SO inconsistency). Note: I don't think the actual code path for bg-fsck is bugfree at the moment (read: I don't trust it at the moment). JFS that journals both data and metadata can recover from all three, to a consistant state, and one that journals only metadata can recover from two of them. SO writes the data directly to free sectors in the target filesystem. I don't see where journaled data is an improvement in fs-consistency here. Bye, Alexander. -- It's not a bug, it's tradition! http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, 26 Sep 2002 14:54:00 -0400 Scott Dodson [EMAIL PROTECTED] wrote: I've been having loads of problems with the bg-fsck. After recovering from a crash/power failure my machine will boot and start the check. If there's moderate activity during the time its checking it will panic and reboot, getting stuck in a loop most of the time. I've not seen anyone mention this on the list, but I was wondering if anyone's experienced this? This has been ongoing across many cvsups and buildworlds. Yes, bg-fsck isn't really usable at the moment. Bye, Alexander. -- Press every key to continue. http://www.Leidinger.net Alexander @ Leidinger.net GPG fingerprint = C518 BC70 E67F 143F BE91 3365 79E2 9C60 B006 3FE7 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
Claus Assmann wrote: [ ... out of order answer, not related to main topic ... ] Per domain doesn't work easily if you have multiple recipients. Anyway, the new design clearly distinguishes between the content files and the data that is necessary for delivery. Actually, it works fine, since it performs queue entry splitting, in the case of multiple recipients. That yields a 100% hit rate for per domain queue traversals, since they contain only messages destined for the domain in question. But back to JFS... [ ... ] I doubt it. The number of files in the queue directories was fairly small during the runs. Moreover, ReiserFS showed fairly poor performance, even though it should be good for directory lookups, right? [ ... ] Just as a small data point: I get message acceptance rates of 400msgs/s on a journalling file system (using a normal PC) that writes the data into the journal too. AFAICT that's due to the fact that fsync() is much fast for this kind of storage. The important part for mailservers here is the rate at which content files can by safely written to disk. From my limited experience journalling file systems are here much better than softupdates. I didn't realize you qere running in safe mode; I should have realized that, since it was supposed to be the only possibility in a future revision, the last time I looked at the particular code in question. I guess I had a stale cache. 8-) 8-). Note that fsync() is a data operation, not a metadata operation, in this case, and what we are talking about is queue contents being committed to stable storage (prior to the 250 Accepted response, presumably). Yes, soft updates does nothing of user data, it is a metadata technology. Journalling is implementation dependent; not all JFS implementations will journal data which is not metadata, so your results would depend on the JFS. Yes, if your data is journalled, too, then what it means is that an fsync() is, effectively, a noop, since the commit to the stable journal entry is (supposedly) guaranteed before the write call returns. That's a *big* supposedly, though. Note that this is potentially not a real commit, though, and you would be better off testing with power disconnects on very large queues. The reason for this is that you need to verify that the drives are not, in fact, lying to you, by enabling write caching, and then returning that the data has been committed, when in fact it has not. The difference you are seeing might be attributable to the drive setting for write caching, in the various OSs (e.g. one with it disabled, the other with it enabled). Journalling does not always mean data integrity (it was only ever intended to mean transactional data integrity, in any case, meaning you can and sometimes do lose transactions in event of a failure). If you want to compare apples and apples, you should verify that the data is in fact journalled, that the fsync() actually does what it's supposed to do, if the data is not, and that the code path all the way to the disk supports real commits to stable storage (#1 thing here is: turn off drive write caching in all cases). Large queue testing would show the effects that I've discussed in other emails. I don't think large throughput with short queue depths is representative of mail servers (unless you are an open relay, of course ;^)). I understand the desire for this, though, if you are comparing a 2-file queue to a 1-file queue, given the other effects on deeper queues. 8-(. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
Terry Lambert wrote: Yes, soft updates does nothing of user data, it is a metadata technology. Journalling is implementation dependent; not all JFS implementations will journal data which is not metadata, so your results would depend on the JFS. I think you are not correct here. If I understand Kirks paper right, Soft Updates do a sorting/nesting of data and metadata within the buffer cache. My knowledge is, that most of the journaling implementations do metadata journaling and do not guarantee data consistency (ext3 with data=journal is the only exception I know of), whereas SU *does* guarantee data consistency (admittedly with a time lag) because of that nesting from data with metadata. I'm far away from beeing able to follow this discussion in every detail, but please correct me if I'm wrong... -- Ciao/BSD - Matthias Matthias Schuendehuette msch [at] snafu.de, Berlin (Germany) Powered by FreeBSD 4.7-RC To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, Sep 26, 2002 at 09:13:41PM +0200, Alexander Leidinger wrote: Yes, bg-fsck isn't really usable at the moment. They work fine for me for quite a while. The last buildworld on my server was Sept 15th. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Thu, Sep 26, 2002, Zhihui Zhang wrote: On Thu, 26 Sep 2002, Claus Assmann wrote: If someone is interested: http://www.sendmail.org/~ca/email/sm-9-rfh.html Just as a small data point: I get message acceptance rates of 400msgs/s on a journalling file system (using a normal PC) that writes the data into the journal too. AFAICT that's due to the fact that fsync() is much fast for this kind of storage. The important part for mailservers here is the rate at which content files can by safely written to disk. From my limited experience journalling file systems are here much better than softupdates. Can you tell me the approximate sizes of these mails and how they are stored? The test for sendmail 9 were made with small sizes (1-4KB). They were stored in flat files using 16 directories. The performance tests for sendmail 8 were done with sizes from 1 to 40 KB, in a single queue directory (AFAIR). To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
On Wed, Sep 25, 2002 at 04:19:34PM +0300, Anton Yudin wrote: Does CURRENT support journaled filesystem ? There are not journaling file systems in current at this time. Efforts to port both xfs and jfs are underway. -- Brooks -- Any statement of the form X is the one, true Y is FALSE. PGP fingerprint 655D 519C 26A7 82E7 2529 9BF0 5D8E 8BE9 F238 1AD4 msg43385/pgp0.pgp Description: PGP signature
Re: Journaled filesystem in CURRENT
If I may add a comment here... You already *have* a kind of journaled filesystem for some time now. Please read Soft Updates vs. Journalling Filesystems from M.K. McKusick (www.mckusick.com). I'm really sad if see the efforts done especially for porting JFS to FreeBSD, which has already under Linux a more than poor performance. The only reason for porting JFS is IMHO to be able to mount JFS Volumes under FreeBSD - if that's worth the effort... Why begging for 'Journaling' if you have 'Journaling next generation'? -- Ciao/BSD - Matthias Matthias Schuendehuette msch [at] snafu.de, Berlin (Germany) Powered by FreeBSD 4.7-RC To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: Journaled filesystem in CURRENT
* De: Matthias Schuendehuette [EMAIL PROTECTED] [ Data: 2002-09-25 ] [ Subjecte: Re: Journaled filesystem in CURRENT ] If I may add a comment here... You already *have* a kind of journaled filesystem for some time now. Please read Soft Updates vs. Journalling Filesystems from M.K. McKusick (www.mckusick.com). I'm really sad if see the efforts done especially for porting JFS to FreeBSD, which has already under Linux a more than poor performance. The only reason for porting JFS is IMHO to be able to mount JFS Volumes under FreeBSD - if that's worth the effort... Why begging for 'Journaling' if you have 'Journaling next generation'? People concentrating on interoperability uses of filesystems are out of their minds to be writing them in-kernel, when they could be running them from userland as userland nfs servers, accessing the raw disks. All we need is to make this a default part of the system, add a libuserfs to provide an abstraction layer, and tada. Let me know when there's a JFS4NFS and I'll give a damn, cause then I can use it everywhere I'd need to. FWIW, background fsck and softdep and ufs2 will give you all of the good stuff you *see* from using a journaled fs, without the corruption of a whole disk, like my girlfriend's laptop went through, running with Linnex XFS. juli. -- Juli Mallett [EMAIL PROTECTED] | FreeBSD: The Power To Serve Will break world for fulltime employment. | finger [EMAIL PROTECTED] http://people.FreeBSD.org/~jmallett/ | Support my FreeBSD hacking! To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message