[loewis@informatik.hu-berlin.de: Bug#131512: Need UTF-8 archives]
I received a UTF-8 feature request/patch [1] for MHonARC from a a Debian GNU/Linux user. Any comments? Is this something that MHonArc might consider incorporating directly? Cheers, Jeff 1. http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=131512&repeatmerged=yes
big5 l10n
Any comments on MHonARC compatibility with the big5 traditional chinese charater set? I recently received a big5 localization (http://mail-archive.com/rcfile.tw) but am not having particularly good luck with it so far, e.g. http://www.mail-archive.com/control%40anc.dyndns.org/msg00129.html Any suggestions or known incompatibilities? Do I need to somehow escape the big5 data in the rcfile? I get lots of warnings suggesting that something is not going well with the parsing, like: Warning: Unrecognized variable: "H: " Warning: Unrecognized variable: " <" Any comments appreciated, especially from those successfully using MHonARC with big5 or a similar character set. Cheers, Jeff
request, Debian sid now running mhonarc 2.5.1
Hi all, FYI, two Debian/MHonArc notes: 1) I'm got an interesting feature request regarding and default configurations and want to pass it on; see bug #115991 (http://bugs.debian.org) for commentary. 2) MHonArc 2.5.1 is now part of Debian/Sid (unstable) and should make it's way into Debian/Woody (testing) in ~14 days. -Jeff --- Start of forwarded message --- Envelope-to: [EMAIL PROTECTED] From: Debian Installer <[EMAIL PROTECTED]> To: Jeff Breidenbach <[EMAIL PROTECTED]> X-Katie: $Revision: 1.59 $ Subject: mhonarc_2.5.1-1_i386.changes INSTALLED Sender: James Troup <[EMAIL PROTECTED]> Date: Sun, 18 Nov 2001 15:05:17 -0500 Installing: mhonarc_2.5.1-1.dsc to pool/main/m/mhonarc/mhonarc_2.5.1-1.dsc mhonarc_2.5.1-1_all.deb to pool/main/m/mhonarc/mhonarc_2.5.1-1_all.deb mhonarc_2.5.1-1.diff.gz to pool/main/m/mhonarc/mhonarc_2.5.1-1.diff.gz mhonarc_2.5.1.orig.tar.gz to pool/main/m/mhonarc/mhonarc_2.5.1.orig.tar.gz Announcing to [EMAIL PROTECTED] Closing bugs: Thank you for your contribution to Debian. --- End of forwarded message ---
upgrade/downgrade adventure
Ok, I think mha-dbrecover is the final missing piece, and I am now getting back in business. It seems to completely fix smaller archives, and is currently chugging along automaticly (woohoo!) on a 50k-messages archive. -Jeff
patch applied to Debian/MHonARC
The OTHERINDEXES fix has been uploaded to Debian autobuilders and should hit Debian unstable tomorrow, at least on i386. Cheers, Jeff
Re: mhonarc 2.5.0 infinite loop on index creation?
>Jeff, do you use OTHERINDEXES? If so, this could be a culprit in >your performance problems. Yes, I use OTHERINDEXES for a very short rdf/rss index. Please note that I've not had any symptoms comparable to what Cygnus is reporting -- although at this point I'm in no position to confirm anything for sure. -Jeff
corruption problem
Mhonarc gurus, I seem to be experiencing a systematic corruption problem. For example, in the last half-dozen more recent entries of this date index [1], the message pages are non-existant and listed under the same URL. The version of mhonarc running on this archive has been 2.early -> 2.49 -> 2.5.0 -> 2.4.9 -> 2.5.0 MHonArc 2.5.0 is not generating any warnings and is returning a good return code. What steps are suggested for diagnosis, and are there any suggestions for a fix? Rebuilding the archive from scratch is possible but not desirable, due to the large number of archives affected. Currently my top priority is stabilizing the system. Thanks in advance for any suggestions. Cheers, Jeff [1] http://www.mail-archive.com/gossip%40jab.org/maillist.html
Re: [Gossip] Re: Mhonarc problems at mail-archive.com
>> I downgraded backed to mhonarc 2.4.9 to see if it would help with >> performance problems. > >>Was there a difference? I think so, although there were a lot of other things making the determination unclear. For starters: A near full filesystem, a runaway process consuming one of the CPU's, corrupted index pages, etc. One thing I notice is 2.4.9 seems to have a bounded time (~10 seconds) for putting one new message in a 1000 message archive. 2.5.0 seems to be less bounded. I also notice very long thread slices on some message pages. For example, see the bottom of: http://www.mail-archive.com/mhonarc%40ncsa.uiuc.edu/msg02482.html This makes me suspect I am taking a performance hit from 2.5.0, at least the way I have it configured. I've just gotten to the point where I've repaired the index pages and am ready to start running 2.5.0 again; I'll keep you posted. -Jeff The rcfile is accessable off http://mail-archive.com/faq.html (src/rcfile.int)
Re: [Gossip] Re: Mhonarc problems at mail-archive.com
>Version v2.5 avoids this problem since HEADER and FOOTER resources >are no longer supported. I downgraded backed to mhonarc 2.4.9 to see if it would help with performance problems. In fact, the time sequence went like this: 1) 2.4.9 + 2.5.0 config 2) 2.4.9 + 2.5.0 config 3) 2.5.0 + 2.5.0 config 4) 2.4.9 + 2.5.0 config So I'm not shocked if some there are a few hiccups... -Jeff
new maintainer, Debian package of MHonArc
This is a quick heads up that I'm now the maintainer for Debian's MHonArc package. Cheers, Jeff
ANNOUNCE: MHonArc v2.4.8
This point release is extremely helpful and addresses several real world problems. Earl, you kick butt. Jeff
image Content-Type statistics
Here's some rough statistics as promised -- essentially I am seeing that MIME type correctness for images varies a lot across different mailing lists. For example: One list archive with about 6000 image attachments has: 51 application/octet-stream 1134 image/pjpeg 4866 image/jpeg 18 image/jpg Another list archive with about 100 image attachments has: 75 application/octet-stream 0 image/pjpeg 15 image/jpeg 0 image/jpg As you can see, some lists are better than others in terms of proper Content-Type labeling. Interestingly, I didn't see much weirdness beyone application/octet-stream -- maybe I just didn't look hard enough. Jeff
Re: mhexternal.pl switch, message deletion switch
>You could easily do the following to exclude images: Ok, I just set the image/jpeg filter to mh2_null. I'll report in a few days what percent of .jpeg and .jpg files this actually kills off. >BTW, what do you do about the index pages? I.e. The message will >still be "removed" wrt index page generation, so the only way to get >to the files would be from search results. That's exactly the situation, and so far, it is workig very well. Cheers, Jeff
mhexternal.pl switch, message deletion switch
Earl, 1) Are you still thinking about modifying the mhexternal.pl to support exclusion of files based on filename regexp and/or content type? I'm about ready to declare a vendetta on image/jpeg. 2) Any chance of adding a resource to disable message deletion? (i.e. I can have a constant size .mhonarc.db through MAXSIZE but unlimited message files). I'm currently using patched code to achieve this. Jeff
Re: Idea for the future
> I have thought of this along time ago. It is a change on some of > the functional goals of MHonArc: no dynamic system is required to view > archives. I.e. One can read MHonArc archives w/o the need of a server. Let me count the ways I like static HTML files. * simplicity * simplicity * their benefit from internet caching infrastructure * computational cheapness of serving them (disk is cheaper than CPU) * manipulability with many, many tools * benefit from OS level improvements Some people, like me, despise databases and like static files a lot. Some people are the other way around. Anyway, us static-file lovers are a valid user set. Jeff
Re: Invisible threads in lyx-devel@lists.lyx.org
Hi Rae, I think you are seeing an issue with MHonArc, where URL of a thread index page with a particular message will change out from under you if it is set to show newest threads it the top of the page. In essence, I think I can't help you except to say "link to messages, not the index page." I don't think there's an easy solution from the MHonArc side, but am CCing them just in case. Cheers, Jeff -- Envelope-to: [EMAIL PROTECTED] Sender: [EMAIL PROTECTED] Date: Mon, 24 Jul 2000 20:29:41 +1000 From: Allan Rae <[EMAIL PROTECTED]> X-Accept-Language: en To: [EMAIL PROTECTED] Subject: Invisible threads in [EMAIL PROTECTED] Content-Type: text/plain; charset=us-ascii Hi, I'm a part of the LyX Team and am writing the LyX Development News. As such I like provide references to threads as well as individual emails in the archive. I've noticed that some threads just aren't appearing in the threaded list at all. However the individual emails can be searched for and found. For example msg12127.html is the start of a thread however if I try to link to it as: http://www.mail-archive.com/lyx-devel@lists.lyx.org/#12127 the page that is shown doesn't have this email or this thread visible anywhere on it. If I follow the "[earlier emails]" link a couple of times I eventually reach a point where the email should be visible (since these threads are ordered by message number of opening message) it's not there. There are a couple of other threads doing the same thing. Regards, Allan. (ARRae)
Re: why no META tag for charset?
>A potential solution would be to put the different message parts into >different files in the archive, and use the remainder of the message as >a container for URLs to those files, mimicking the MIME message >structure in HTML. I haven't looked to see how/if one can do that in >MHonarc, but this seems like a problem similar to archiving a multi-part >with multiple graphic inclusions. Sounds like that would work. I guess the competition is the "do-nothing-and-let-break" approach, and the "convert-to-unicode" approach. Each has distinct advantages. >Rely on a graphical interface instead of text? Grumble, grumble, grumble. I'd rather help make programs and protocols smart enough to deal with these sorts of (tractable) issues. >BTW, while I applaud the desire to display localized headers I hope that >any reply/follow-up interface is sending the canonical RFC 822 & later >headers and keywords "on the wire", and not helping create messages like: > >Subject: Re: Sv: Re: Ab: Re: Not an problem for me. Mail-Archive.com supplies a mailto: URL with an embedded, unadulterated subject to the user's existing MUA.
Re: why no META tag for charset?
>This is the problem, HTML does not support mixed character sets. >Also, the charset affects the entire HTML document. Therefore, your >resource settings would have to conform with the charset, and this >can be a big problem if messages existing in the archive have different >specified charsets. It would be hard to guarantee that all messages >will use the same charset. I think I understand ... is this right? If an single email contains two different character sets, you're screwed, I understand that. If two emails are received, each with a different character set 1) you are screwed on index pages, which will has a bunch of subject lines from different character sets 2) you are screwed on message pages, because navigational aids like the word "follow-ups" will be in a different character set from the messages. Ok, so I see how unicode would magically fix everything. But, imagine that wasn't available, and I get a message in an unknown character set. The result is an un meta-tagged message page (which will default to either iso-8859-1 or some browser heuristic). Assuming iso-8859-1, we get good navigational aids and an undreadable message. Had we used a meta tag the message would be readable and we'd lose the navigational aids. Yuck, yuck, yuck, it's a choice between two evils. Given just those options, I think a message page meta tag (generated from the corresponding email's character set) would be better, though. Converting to unicode won't be graceful either. If one converts everything unknown to unicode, I bet in practice a lot of iso-8859-1 messages will go to unicode and be unrenderable by legacy browswers. I guess legacy browswers will have to be replaced. Jeff
Re: MHonarc archives with internationalisation
I wrote a large (English) resource file that overrode almost everything. Then I wrote a short sed script to create a derived resouce file, replacing all the English words with the localized language. As far as I can tell, this is approach is the most reasonable way to support localization, if you need to support multiple languages. My files wouldn't be good for a "cookbook example" though, because I also do a lot of custom formatting in the resource file. >I am also interested in a localization, for the german language. Is there any >"cookbook example" which describes only those rcfile settings which are >necessyry for this job? There are examples for Mhonarc in non-english languages >(e.g. dutch), but I would like to see just the "minimum requirements" for a >localization.
why no META tag for charset?
Recently, I've been on an internationalization/localization kick. I just read the relevant portion of the HTML specification and found it refreshingly clear. Let's assume I want to process an email with some weird character set, like ISO646-SE. It appears the right thing for MHonArc to do is produce a HTML document that includes: But as far as I can tell, MHonArc won't produce that meta tag. Thus the character set information is lost, which can result in a difficult to render web page. I suspect there is a reason for this, but I'm not sure what it is. (I know there will be an issue if email contains multiple character sets, since this is not supported in HTML documents.) Jeff - HTML4 specification, character sets: http://www.w3.org/TR/html4/charset.html IANA Registry of character sets: ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets Character Set Converters Resource (MHonArc) http://www.mhonarc.org/MHonArc/doc/resources/charsetconverters.html
MHonarc archives with internationalisation
Regarding customization, you will have to create an rcfile to do the job. I chose to override and put the To/From/Subject customizations in there. Coincidentally, I'm also working on a French localization right now. My translator suggested Subject --> Sujet, but you mentioned Subject --> Objet. Which is better? Cheers, Jeff
Re: Namazu & MHonArc
For the record, htdig can index through the local filesystem (bypassing the HTTP protocol.) You are correct about htdig not supporting multi-byte characters.
report from the trenches, perl compiler
Hi all, My archive of linux-kernel hit 112,000 messages. It was way slow. It was causing out of memory errors on a 256MB machine. Rapidly advancing computer hardware wasn't enough. I finally broke down and switched it to monthly indexing. Works like a charm, of course. Anyway, just wanted to report from the trenches that things are well (or at least the line is holding, albeit with effort). It will be interesting to see how long I can stay in the scalability game. Jeff
is windowing still kicking around?
A long time ago, there was a discussion of making MHonArc work with windowing - for example, having new messages only thread against the most recent thousand messages in an archive. The idea was that things might go quicker for small additions to large archives. Is this idea still kicking around? The reason I ask, is that I am thinking about MHonArc performance again, and I see the possible improvements as: a) switch to a better filesystem like reiserfs b) faster hardware, both storage and processor(s) c) split up big archives If windowing is a possibility in the long term, I'll let a and b and keep me busy. Jeff
Re: how do I make sure that a message will be
>A lot of mail clients don't even know how to parse the >"&subject=" argument; to my knowledge there are none which attempt to >add arbitrary headers. Lynx 2.8.2rel.1 can parse mailto: URLs in accordance with RFC 2368 I'm using lynx right now, from the MHonArc produced mailto: at http://mail-archive.com/mhonarc%40ncsa.uiuc.edu/msg01506.html and I think it should pick up the embedded In-Reply-To: field fine. Jeff
Re: rcfile confusion?
>BTW, you can use $MSG$ to get the filename for a given message if >for some reason $A_HREF$ does not work for you. Ah, perfect. The problem with $A_HREF$ was not the relative URL, but rather the inclusion of the word "HREF=" in the output. Jeff
Re: rcfile confusion?
Hmmm... XML question #2. Is it possible to produce something like the following: http://host/msg5.html I see $A_ATTR$ and $A_HREF$ available for the LITEMPLATE resource, but nothing that appears to give the unadulterated message URL. My guess is that it is not an available resource variable, right? As far as I can tell, this is the only thing holding MHonArc back from being able to produce RDF files. Jeff
Re: rcfile confusion?
>It looks like the bug fix is much easier than I was expecting. >I have attached patch to mhindex.pl >(SCCS ID: mhindex.pl 1.4 99/06/25 14:21:22), that hopefully fixes >the problem. Brilliant - I applied the patch and it works great. Thanks, Earl! Jeff
Re: higher memory requirements in 2.4.0 ? [isolated]
>Note, my system configuration is different from yours. I am running >Perl 5.005_03 on RH 5.2 w/2.2.9 kernel. There is a possibility that >Perl 5.004 has some memory leaks exposed by MHonArc v2.4.0. The perl version that is recommended in RedHat 5.2's errata notes is: # rpm -q perl MHonArc perl-5.004m7-1 Running under this perl, all memory is consumed every time I try asking MHonArc 2.4.0 to add to the big database. So, I tried upgrading perl. Scrounging the net for a slightly newer version of perl, I found this one in ftp://contrib.redhat.com: # rpm -q perl perl-5.005_02-1 # perl -v This is perl, version 5.005_02 built for i386-linux-thread [...] MHonArc 2.4.0 worked fine under this perl and contained itself to about 143 meg. (More or less, I get the number from watching top during the run.) Switching back to the old version of perl caused the problem to reappear. I'm not going to draw any broad sweeping conclusions from this experiment, but it's safe to say I will be sticking with the newer version of perl. Jeff
higher memory requirements in 2.4.0 ?
I've been putting 2.4.0 through some paces. Upgrading to 2.4.0 caused out of memory errors when adding to a 60,000 document archive. In version 2.3.3, similar operations only required about 130 meg, or half of available memory. Running out memory is serious, because the lack of memory effectively stops the kernel from starting new processes (causing commands like 'ps' to coredump). This makes the machine unusable until MHonArc finally exits with return code 137. Smaller archives appear to work correctly. No changes were made to the (complicated) rcfile, except the removal of a timezone resource. No changes were made to the (complicated) command line arguments. I'd be happy to supply the core file, database file, rcfile, command line options, run profiling or debugging tools (need instructions) or provide remote access to a machine that demonstrates the problem. Here's some system information. Jeff --- # mhonarc -V MHonArc v2.4.0 (Perl 5.00405) Copyright (C) 1995-1999 Earl Hood, [EMAIL PROTECTED] MHonArc comes with ABSOLUTELY NO WARRANTY and MHonArc may be copied only under the terms of the GNU General Public License, which may be found in the MHonArc distribution. # rpm -q perl MHonArc perl-5.004m7-1 MHonArc-2.4.0-1 # uname -a Linux marmot.jab.org 2.0.36 #1 Tue Dec 29 13:11:13 EST 1998 i586 unknown # free total used free sharedbuffers cached Mem:257048 41856 215192 16804 14348 14736 -/+ buffers/cache: 12772 244276 Swap:72256 7832 64424 >From system logs, MHonArc output is reported as 'mailme'. The interpreter messages are from me trying to run programs as root and then as a regular user during the problem time. Jun 26 20:57:54 marmot mailme: Warning: Database (2.3.3) != program (2.4.0) version. Jun 26 20:59:27 marmot mailme: Out of memory! Jun 26 21:00:20 marmot kernel: Unable to load interpreter Jun 26 21:00:52 marmot kernel: Unable to load interpreter Jun 26 21:01:58 marmot last message repeated 3 times Jun 26 21:02:41 marmot last message repeated 4 times Jun 26 21:03:20 marmot last message repeated 3 times Jun 26 21:03:21 marmot PAM_pwdb[18495]: (su) session closed for user root Jun 26 21:03:24 marmot kernel: Unable to load interpreter Jun 26 21:03:36 marmot last message repeated 2 times Jun 26 21:04:09 marmot mailme: MHonArc returned exit code 137 for [EMAIL PROTECTED]
language detection
I was thinking about automatic language detection. If mailing list traffic was predominantly Icelandic, I would like to automaticly ask MHonArc switch over to a resource file localized for Icelandic. Being completely naive, I pulled up a few non-English emails and looked for some line in the headers that identified the language. How incredibly depressing. The only relevant headers I found were the character set, which appears common for dozens of langages. The only other header clue was the domain of the list server, which is hardly a sure thing, given the pervasiveness of both the English language and the .com domain name. What do people do for automatic language detection for email? Are they stuck with scanning the body for common dictionary words? Bleah!! So the question is: a) Am I missing something obvious b) Are there any languages that are easily detected (perhaps by a unqiue character set?) If so, are those languages supported by MHonArc? Oh, and what are they? I guess I'll have to scuttle the whole thing; if so that's too bad, since I really think it would be great to automatically customize to a particular language. Jeff PS Typical non-English language email headers appended. Return-Path: [EMAIL PROTECTED] Delivery-Date: Tue May 25 07:50:36 1999 Return-Path: <[EMAIL PROTECTED]> Received: from jab.org (u251.varesearch.com [209.81.8.251]) by marmot.jab.org (8.8.7/8.8.7) with ESMTP id HAA28750 for <[EMAIL PROTECTED]>; Tue, 25 May 1999 07:50:35 -0700 Received: from mars.mmedia.is (mars.mmedia.is [193.4.192.20]) by jab.org (8.8.7/8.8.7) with ESMTP id KAA16373 for ; Tue, 25 May 1999 10:48:37 -0400 Received: (from mail@localhost) by mars.mmedia.is (8.9.0/8.9.0-MMEDIA) id AAA03873 for kde-isl-list; Tue, 25 May 1999 00:23:36 GMT Received: from mailer.isholf.is (pop.isholf.is [194.105.226.2]) by mars.mmedia.is (8.9.0/8.9.0-MMEDIA) with ESMTP id AAA03857 for <[EMAIL PROTECTED]>; Tue, 25 May 1999 00:23:31 GMT Received: from [157.157.168.204] by mailer.isholf.is (NTMail 4.20.0009/NU2631.00.d894e447) with ESMTP id kgkacaaa for <[EMAIL PROTECTED]>; Tue, 25 May 1999 15:43:15 + Message-ID: <[EMAIL PROTECTED]> Date: Tue, 25 May 1999 15:41:44 + From: Jn Gumundsson <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] X-Mailer: Mozilla 3.04 (Win95; I) MIME-Version: 1.0 To: [EMAIL PROTECTED] Subject: [kde-isl]: Forritun fyrir KDE Hvar eru grunnkarnir!! Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Sender: [EMAIL PROTECTED] Precedence: normal Organization: Skgrkt R Einarsson [...]
successfully processed 10e6 emails
MHonArc just processed its one millionth email on my computer. As you can imagine, I'm extremely pleased. What great software! It's been a lot fun scaling up. Here's what I learned from my experience over the last year and a half, from a technical standpoint. The system is a single PC with an AMD K6-II processor, 256 megs of ram, and two 16 gig IBM IDE disks. MHonArc: * MHonArc, in batch mode, provides O(n) performance no matter how big an archive gets (tested to 60k) * The risk of an orphaned lock file is too great. It was better to use -nolock and manage concurrency myself. * It was better to buy RAM than to use -savemem * Once in a blue moon, perl processes can crash and core dump. No big deal if you remember to check process return values. * htdig makes for an excellent search engine for MHonArc pages Stock redhat linux 5.2: * The default open files limit (1000) is too low. * Mounting a hard disk takes minutes, e2fsck can take an hour, and ls can take quite a few seconds. * IDE disk thoughput increased when I tweaked settings with hdparm * When you do a lot of writing to system logs, syslogd starts hogging 25% of the processor. Rotating logfiles daily fixes this problem. * Better to put some limits on 'updated' * People can break into a stock system (due to security holes in software bundled with the OS) Other: * There are certain emails which can kill nmh. --- * My friend Brian Semmes was a math major. For some reason he took the introductory electrical engineering class, and it wasn't pretty. The professor said things like, "This resister has 10^6 ohms; heck, 10^6 is practically inifinity, so we'll just substitute infinity into this equation..." It drove him crazy, and I think of him whenever I hear the word "million".
along with the spam proofing
http://validator.w3.org doesn't give MHonArc message pages a thumbs up because the first line has to be a DOCTYPE declaration, such as: I think the comments MHonArc places at the top of the page are invalid for any version of HTML above 2.0. This has no practical significance, and I understand that a fixed position at the top of the file is helpful for machine readability. However, since the format of these comments are rumored to be changing (due to spam shielding) in the future versions of MHonArc, this might be a window of opportunity. Jeff
Anti-Spam Measures (RFC)
> My preference for the archives I maintain would be to > have a hook in mhonarc that would allow me to apply > my own address-modifying subroutine. Rot13 is probably > sufficient to stop the harvesters, but doesn't provide > real privacy. Are you thinking of hooks for strong encryption in place of rot13? Perhaps running the list through an anonymizing remailer would be just as effective. > I would also like to leave the message bodies in my archives > untouched. I would, too. By the way, I benchmarked MHonArc the other day as it rebuilt a 60,000 message archive from raw email, 870 messages at a time. It took about a day on a K6-2 300 with 256 MB of ram. Interestingly, each chunk of 870 messages took about the same amount of time to run, so in this mode of operation I'd say time requirements are O(n) where n is the number of messages. Jeff
[PATCH] Address kerflundering
I recieved the following patch to MHonArc after offering a small bounty at the Free Software Bazaar. Overall, I got two offers to do the patch within a week of the offer being posted. I'm impressed. This patch was written by Alexis Mikhailov. Jeff Alexis Mikhailov <[EMAIL PROTECTED]> writes: >Hello Jeff! > >Here is a patch against version 2.3.3 of MHonArc. I've checked it to some >extent. > >Alexis diff -ru MHonArc2.3.3/lib/mhamain.pl MH/lib/mhamain.pl --- MHonArc2.3.3/lib/mhamain.pl Sun Nov 8 21:06:23 1998 +++ MH/lib/mhamain.pl Fri Jun 11 17:18:44 1999 @@ -238,7 +238,8 @@ ## Get here, we are processing mail folders -local($mesg, $tmp, $index, $sub, $from, $i, $date, $fh); +local($mesg, $tmp, $index, $sub, $from, $i, $date, $fh ,$fromaddrname, + $fromaddrdomain); local(%fields); $i = $NumOfMsgs; @@ -255,7 +256,7 @@ $handle = $ADD; ## Read mail head - ($index,$from,$date,$sub,$header) = + ($index,$from,$date,$sub,$header,$fromaddrname,$fromaddrdomain) = &read_mail_header($handle, *mesg, *fields); if ($index ne '') { @@ -303,7 +304,7 @@ } print STDOUT "." unless $QUIET; $mesg = ''; - ($index,$from,$date,$sub,$header) = + ($index,$from,$date,$sub,$header,$fromaddrname,$fromaddrdomain) = &read_mail_header($fh, *mesg, *fields); # Process message if valid @@ -347,7 +348,7 @@ MBOX: while (!eof($fh)) { print STDOUT "." unless $QUIET; $mesg = ''; - ($index,$from,$date,$sub,$header) = + ($index,$from,$date,$sub,$header,$fromaddrname,$fromaddrdomain) = &read_mail_header($fh, *mesg, *fields); if ($index ne '') { @@ -667,6 +668,23 @@ print STDOUT "\n" unless $QUIET; } +sub split_address { +local($from) = @_; +local($fromaddrname, $fromaddrdomain); +local(@machines); + +$from =~ s/^.*\<(.*)\>.*$/$1/; +$from =~ s/\(.*\)//; +$from =~ s/^\s+//; +$from =~ s/\s+$//; + +@machines = split /\!/, $from; +if ($machines[-1] =~ /[@%]/) +{ + return split /[@%]/, $machines[-1]; +} +return ($machines[-1], $machines[-2]); +} ##--- ## read_mail_header() is responsible for parsing the header of ## a mail message. @@ -674,7 +692,7 @@ sub read_mail_header { local($handle, *mesg, *fields) = @_; my(%l2o, $header, $index, $date, $tmp, @refs, @array); -local($from, $sub, $msgid); +local($from, $sub, $msgid, $fromaddrname, $fromaddrdomain); local($_); $header = &readmail::MAILread_file_header($handle, *fields, *l2o); @@ -759,6 +777,7 @@ foreach (@FromFields) { next unless $fields{$_}; $from = $fields{$_}; + ($fromaddrname, $fromaddrdomain) = split_address($from); last; } $from = 'No Author' unless $from; @@ -802,7 +821,7 @@ &remove_dups(*refs);# Remove duplicate msg-ids $Refs{$index} = join($X, @refs) if (@refs); -($index,$from,$date,$sub,$header); +($index,$from,$date,$sub,$header,$fromaddrname,$fromaddrdomain); } ##--- diff -ru MHonArc2.3.3/lib/mhrcvars.pl MH/lib/mhrcvars.pl --- MHonArc2.3.3/lib/mhrcvars.plSun Nov 8 21:06:23 1998 +++ MH/lib/mhrcvars.pl Fri Jun 11 17:12:21 1999 @@ -139,13 +139,17 @@ ""; last REPLACESW; } - my($cnd1, $cnd2, $cnd3) = (0,0,0); + my($cnd1, $cnd2, $cnd3, $cnd4, $cnd5) = (0,0,0,0,0); if (($cnd1 = ($var eq 'FROM')) || ## Message "From:" ($cnd2 = ($var eq 'FROMADDR')) || ## Message from mail address - ($cnd3 = ($var eq 'FROMNAME'))) { ## Message from name + ($cnd3 = ($var eq 'FROMNAME')) || ## Message from name + ($cnd4 = ($var eq 'FROMADDRNAME')) || ## Message from user name + ($cnd5 = ($var eq 'FROMADDRDOMAIN'))) { ## Message from domain my $esub = $cnd1 ? sub { $_[0]; } : $cnd2 ? \&extract_email_address : - \&extract_email_name; + $cnd3 ? \&extract_email_name : + $cnd4 ? \&extract_email_addr_name : + \&extract_email_addr_domain; $canclip = 1; $raw = 1; ($lref, $key, $pos) = compute_msg_pos($index, $var, $arg); $tmp = defined($key) ? &$esub($From{$key}) : "(nil)"; diff -ru MHonArc2.3.3/lib/mhutil.pl MH/lib/mhutil.pl --- MHonArc2.3.3/lib/mhutil.pl Sat Oct 3 23:07:54 1998 +++ MH/lib/mhutil.plFri Jun 11 17:12:42 1999 @@ -44,6 +44,28 @@ $ret; } +sub extract_email_addr_name
address kerflundering
I decided to try something a bit unusual, and submit an offer for a small bounty at "The Free Software Bazaar" for a patch to MHonArc. Mainly, I'm curious to see if and how their bounty system works. Anyway, I thought it would be common courtesy to carbon copy to this list. I hope I didn't offend anyone with this experiment. Cheers, Jeff -- MHonArc is a popular GPL'd email to HTML converter written in Perl. I want a patch to add two new resources variables to MHonArc. Patch must follow guidelines below. Patch must be created with 'diff -uNr', be shorter than 100 lines and apply cleanly to MHonArc 2.3.3 or later. Patch may not destroy any existing functionality in MHonArc. Final condition: submit patch to MHonArc mailing list. Offer expires midnight, December 31, 1999, GMT. Helpful references: http://www.mail-archive.com/mhonarc@ncsa.uiuc.edu/msg01047.html http://www.oac.uci.edu/indiv/ehood/mhonarc.html $25 to developer. Jeff Breidenbach [EMAIL PROTECTED] phone: 908 210 9135 home phone: 908 938 9600 x3010 work http://www.jab.org (homepage) = The following advice is quoted from Earl Hood: = The two main functions to target are: mailUrl() in mhutil.pl: This function is used in the conversion of address in converted message headers. replace_li_var() in mhrcvars.pl: This is the general purpose function for doing resource variable interpolation. If any new resource variables are desired, this function would have to handle them. What I propose is the following new resource variables: $FROMADDRNAME$ The username portion of the email address $FROMADDRDOMAIN$ The domain name of the email address Example: [EMAIL PROTECTED] $FROMADDRNAME$ => "nobody" $FROMADDRDOMAIN$ => "foo.com" -ewh
rcfile - passing arguments to
Check out the -title and -ttitle command line options. Cheers, Jeff > I am using a resorce file and I would like to be > able to pass the to it. I'd like to specify the listname at > command line and then use it in the rcfile - is it possible to do this ?
delmsg + MAXSIZE windowing
Hi all, I commented out the following lines in delmsg in mhamain.pl, then added a MAXSIZE of 3000 to the rcfile of an archive which had about 1 messages. The goal is a windowing effect, where a small MHonArc database runs a big archive containing lots and lots of HTML message files. The results were unusual. First, the time required to add new messages dropped from 20 minutes to less than three minutes. (GREAT!) On the flip side, the MHonArc indexes did not work as hoped. The initial thread index page showed no change (and continues to not change, even as more messages are added) The first date index page now shows a rather old set of messages, and also does not appear to update as new messages are added. Did I make an obvious mistake? Thank you, Jeff PS This is obviously an unsupported topic and I don't want to waste people's time - so please send me on my way if this topic is too esoteric. -- #&file_remove($filename); #foreach $filename (split(/$X/o, $Derived{$key})) { # $pathname = (&OSis_absolute_path($filename)) ? # $filename : # join($DIRSEP, $OUTDIR, $filename); # if (-d $pathname) { # &dir_remove($pathname); # } else { # &file_remove($pathname); # } #} % mhonarc -v MHonArc v2.3.3 (Perl 5.00404) Copyright (C) 1995-1998 Earl Hood, [EMAIL PROTECTED] MHonArc comes with ABSOLUTELY NO WARRANTY and MHonArc may be copied only under the terms of the GNU General Public License, which may be found in the MHonArc distribution. The full rcfile is rather long... so I'm just including the bits that seem most relevant. 1 300 x-archive-with-date:received:date 3000 Relevant command line options were -add -nolock -savemem -quiet -rcfile rcfile -tidxfname index.html
MAXSIZE - poor man's windowing
So here's a crazy high level idea about how to implement windowing (i.e. having MHonArc only consider recent parts of the archive rather than the whole thing when indexing new messages.) What do people think? Here's the scenario: --- 3 100 So I start adding messages to the archive, and it grows and grows... at 101 messages we have two date index pages (due to IDXSIZE and MULTIPG). At 201 we get three date index pages. So far everything is normal. However, when we get to message 301, it gets more interesting. The database shrinks to size 200, (IDXSIZE * (WINDOW - 1)). The shrinkage is like a MAXSIZE shrinkage, however the existing html message files do not get deleted, nor does the index file we just orphaned. Both stick around and are legacy html files; i.e. perfectly good HTML files that have no representation in the MHonArc database. Then as messages get added to the archive, eventually the database size reaches > (WINDOW * IDXSIZE) and we repeat the process. - Here are the advantages/disadvantages I thought of; I'm sure there are others. Advantages -- * MHonArc would have the ability to handle large archives with a small database (saving on both memory and processing time) * It might be sane to implement Disadvanatages -- * It's not a perfectly generalized solution (i.e. your window size is quantized by IDXSIZE) * It may not offer enough advantages over stright MAXSIZE to be worth adding complexity to the code.
Re: MAXSIZE - poor man's windowing
>You can deal with the confusion of a half-indexed corpus of messages, >because you have your hands on the constructruction of the site and know >its structure inside out. Is there anybody else who needs to access the >site? Having indices that work for some messages and not for others sounds >like a gold plated way to convince would-be users that the resource is >broken and unusable. Just a thought. The site is http://www.mail-archive.com and has quite a few users. I suspect nobody would notice if I pruned the MHonArc indexes to the 5000 most recent messages. Given the current user interface, do you think I will alienate users? Jeff
MAXSIZE - poor man's windowing
As we know, it can take a fair amount of time to add messages to larger archives. I've seen filing times upwards of 20 minutes. For me, long filing times are starting to become a bottleneck. The usual solution for MHonArc is to split archives up, for example, putting each month in a separate directory. Another possibility is to use MAXSIZE to keep the archives from growing too large. However, I'm thinking of another possbibility. With MAXSIZE, new messages are added, and, if necessary old ones are deleted. However, if a MAXSIZE variant were to only delete entries in the the database, and not erase the html files, we'd still get the speed advantages of a small database. It would also allow me to keep the old message pages around. Of course, there would not be any index pages for those old message pages. In my case that's ok, since I use a search engine find old messages anyway. Any comments or thoughts? Jeff
Re: how to make gifs not inline?
Earl, I see it now - there was a Content-Disposition in the headers after all. I checked four times in the past, but I was stupidly looking at the message headers as opposed to the MIME headers. (I don't know what came over me!) Thank you for the suggestion about MIMEargs settings, and I'm very sorry for wasting your time on a false question. Jeff
how to make gifs not inline?
Looking at the docs, it appears image inlining is instigated by the MIMEARGS defaults. I tried to override that for a mail with an attached .gif and no Content-Disposition: header. It still got inlined. What am I missing? Thank you in advance, Jeff -- MHonArc v2.3.3 (Perl 5.00404) m2h_external::filter; usename useicon subdir iconurl="../attachment.gif" image/gif;
Time Zones (RFC)
The Aventists' list filled in most of the blanks. I did not look for discrepencies, but noticed some anyway in the New Zealand timezones. Don't know which list is correct. Jeff --- %Zone = ( 'ACDT', '-1030',# Australian Central Daylight 'ACST', '-0930',# Australian Central Standard 'ADT', '0300',# (US) Atlantic Daylight 'AEDT', '-1100',# Australian East Daylight 'AEST', '-1000',# Australian East Standard 'AHDT', '0900', 'AHST', '1000', 'AST', '0400',# (US) Atlantic Standard 'AT','0200',# Azores 'AWDT', '-0900',# Australian West Daylight 'AWST', '-0800',# Australian West Standard 'BAT', '-0300',# Bhagdad 'BDST', '-0200',# British Double Summer 'BET', '1100',# Bering Standard 'BST', '0300',# Brazil Standard # 'BST', '-0100',# British Summer 'BT', '-0300',# Baghdad 'BZT2', '0300',# Brazil Zone 2 'CADT', '-1030',# Central Australian Daylight 'CAST', '-0930',# Central Australian Standard 'CAT''1000',# Central Alaska 'CCT', '-0800',# China Coast 'CDT', '0500',# (US) Central Daylight 'CED', '-0200',# Central European Daylight 'CET', '-0100',# Central European 'CST', '0600',# (US) Central Standard 'EAST', '-1000',# Eastern Australian Standard 'EDT', '0400',# (US) Eastern Daylight 'EED', '-0300',# Eastern European Daylight 'EET', '-0200',# Eastern Europe 'EEST', '-0300',# Eastern Europe Summer 'EST', '0500',# (US) Eastern Standard 'FST', '-0200',# French Summer 'FWT', '-0100',# French Winter 'GMT', '',# Greenwich Mean 'GST', '-1000',# Guam Standard # 'GST', '0300',# Greenland Standard 'HDT', '0900',# Hawaii Daylight 'HST', '1000',# Hawaii Standard 'IDLE', '-1200',# Internation Date Line East 'IDLW', '1200',# Internation Date Line West 'IST', '-0530',# Indian Standard 'IT', '-0330',# Iran 'JST', '-0900',# Japan Standard 'JT', '-0700',# Java 'MDT', '0600',# (US) Mountain Daylight 'MED', '-0200',# Middle European Daylight 'MET', '-0100',# Middle European 'MEST', '-0200',# Middle European Summer 'MEWT', '-0100',# Middle European Winter 'MST', '0700',# (US) Mountain Standard 'MT', '-0800',# Moluccas 'NDT', '0230',# Newfoundland Daylight 'NFT', '0330',# Newfoundland 'NT','1100',# Nome 'NST', '-0630',# North Sumatra # 'NST', '0330',# Newfoundland Standard 'NZ', '-1100',# New Zealand 'NZST', '-1200',# New Zealand Standard #-1300 NEW ZEALAND STD. SUMMER 'NZDT', '-1300',# New Zealand Daylight 'NZT', '-1200',# New Zealand #NEW ZEALAND STD. 'PDT', '0700',# (US) Pacific Daylight 'PST', '0800',# (US) Pacific Standard 'ROK', '-0900',# Republic of Korea 'SAD', '-1000',# South Australia Daylight 'SAST', '-0900',# South Australia Standard 'SAT', '-0900',# South Australia Standard 'SDT', '-1000',# South Australia Daylight 'SST', '-0200',# Swedish Summer 'SWT', '-0100',# Swedish Winter 'USZ3', '-0400',# USSR Zone 3 'USZ4', '-0500',# USSR Zone 4 'USZ5', '-0600',# USSR Zone 5 'USZ6', '-0700',# USSR Zone 6 'UT','',# Universal Coordinated 'UTC', '',# Universal Coordinated 'UZ10', '-1100',# USSR Zone 10 'WAT', '0100',# West Africa 'WET', '',# West European 'WST', '-0800',# West Australian Standard 'YDT', '0800',# Yukon Daylight 'YST', '0900',# Yukon Standard 'ZP4', '-0400',# USSR Zone 3 'ZP5', '-0500',# USSR Zone 4 'ZP6', '-0600',# USSR Zone 5 );
rcfiles cascade - wow!
Hi all, Did I mention I was impressed by MHonArc? I just checked to see if rcfiles would cascade, i.e. mhonarc -rcfile a -rcfile b mbox It appears to work just like cacading stylesheets; i.e rcfile a is used, except where augmented or overridden by rcfile b. I am totally, totally impressed. Wow. Sorry for cluttering the list with praise, but that just knocked my socks off. Jeff
Time Zones (RFC)
I recommend mentioning your philosphy with respect to your proposed changes. Possible philosophies could be: 1) We support the mininum number of timezones required by RFC 822 by default. 2) We support a subset of popular timezones, in order to cover many common cases, by default. 3) We support all timezones listed in official standard X by default. a) We are extensible in full hour increments. b) We are extenisible in minute increments c) We extensibly deal with acronym name sapce collisions d) We extensibly deal with the historical changes of timezones; i.e. mail sent from Libya in 1913 will be time zone corrected differently than mail sent from Libya in 1953 Quite frankly, getting it "right" requires a huge historical lookup table, timezone offsets to the second, and other nightmares. The governemnt timezone documents I was looking at (referenced from http://www.bsdi.com/date) really were mind numbingly complex and required constant revision. I personally think philosphy 2a is quite adequate. I suspect nobody would ever take advantage of philosphy 2b, (which appears to be what you are suggesting) but have no fundamental opposition. For what it's worth, I was getting maybe 3 emails out of 1000 with an unrecognized timezone offset, and they were almost invariably MET and AEST. I think the more obscure the timezone, the more likely we would get a numerical offset in the email. Jeff
TIMEZONE defaults
I looked for official sounding timezone code and documents at http://www.bsdi.com/date, and found it incomprehensible. Instead I just used the list from the Adventists (mentioned earlier) Java 1.2 has a method java.util.TimeZone.getAvailableIDs() which may be a good source. Java tends to be pretty uptight about following standards for this sort of thing. (And if you run this on a recent Solaris maybe you'll at least get a POSIX list) Of course Java 1.2 final isn't out yet and I can't find out an more from the documentation. Anyway, at least now my logs will be a little less cluttered (AEST and MET were the most common offenders.) Jeff AHDT:9 AHST:10 AST:4 ACDT:-10 ACST:-9 AEDT:-11 AEST:-10 AWDT:-9 AWST:-8 AT:2 BAT:-3 BET:11 BZT2:3 BDST:-2 BST:-1 CDT:5 CED:-2 CET:-1 CST:6 CCT:-8 EDT:4 EED:-3 EET:-2 EST:5 GMT:0 GST:-10 HDT:9 HST:10 IST:-5 IDLE:-12 IDLW:12 IT:-3 JST:-9 JT:-7 MED:-2 MET:-1 MT:-8 MDT:6 MST:7 NZST:-13 NZT:-12 NZS:-12 NZ:-11 NT:11 NST:-6 PDT:7 PST:8 SAD:-10 SDT:-10 SAST:-9 SAT:-9 SST:-7 UZ10:-11 USZ3:-4 USZ4:-5 USZ5:-6 USZ6:-7 UT:0 WAT:1 YDT:8 YST:9
TIMEZONE defaults
Just curious; why are the default recognized not more comprehensive? Are timezone acronyms not standardized? Do RFC's only recommend knowing a few timeszones? Or is this a potential area for improvemnt? Jeff PS I didn't find an RFC or ISO standard during five miniutes of poking around, but did find some informal timezone listings at: http://sonne.net/Vicious/time.html http://news.adventist.org/sun/
2.3.3 RPM users : please update
The 2.3.3 RPM you *really* want is MHonArc-2.3.3-2.noarch.rpm or later. Don't settle for less, or you'll run into a path glitch I made during packaging. Sorry, Jeff
web based email services
I think there are a few such systems kicking about. www.freshmeat.com has a few listed (look under appindex on their site) I think one is called "atdot" or something similar. Anyway, I haven't used any of these myself and do not know if any utilize MHonArc. Jeff
uploaded RPM is really v2.3.3
Sorry, I made a typo; the RPM I uploaded is not 2.3.2; it is the latest 2.3.3. Expect to see it at ftp://contrib.redhat.com/noarch in a few days. Jeff
v2.3.2 RPM uploaded
Presumeably it will be available in a few days from ftp://contrib.redhat.com/noarch The RPM itself was revised (simplified) to take advantage of MHonArc's improved install script. Jeff
Re: Anyway to use include files (for nav bars) in a RCfile?
>(BTW, the search engine on the mhonarc list on mail-archive.com is >currently down, so I tried to search the archive before asking this!) I upgraded mail-archive.com today, and it took a several hours for the new search engine to re-index everything. Anyway, things are back up and running; you might even notice slight improvements in performance when searching. Also, for what it is worth, the HTML 4.0 specification (www.w3.org) has a tag for "include this bit of HTML from a file right in here" I forget the tag name. The upside is it's exactly what you want; the downside is I don't know of any browser that implements it. Perhaps it's a good choice if you want to be, ahem, ahead of the curve. Jeff
Re: Return code 139 (bug?)
>BTW, could you compress some of the data you put up on your FTP >site? The .mhonarc.db file is huge, and compressing it will make >download muck quicker. Done. ftp://jab.org/db.tgz (It's still pretty huge.) I will attempt the other diagnostics you suggested. I'd like to do some more observation on the 2.3.0 archives as well, to get a better characterization on whether the problem is deterministic or not. Jeff PS You may want to wait until I've done another analysis before cracking open the DB file... the preliminary results for 2.3.0 are promising.
Return code 139 (bug?)
Hi all, Here's an update to the error code 139 incident described earlier. (like you care!, but hey...) After digging around more on the return code 139, it again it appears to be database corruption, with a missing apostrophe in the DB file. I don't know why two of my databases periodicly get corrupted. In high hopes, I just have upgraded from 2.2.0 -> 2.3.0, and am now using the famous -nolock feature. We'll see what luck I have. Jeff
2.3.0 RPM uploaded
FYI, A RedHat linux RPM has been uploaded and should show up in the next few days. No changes were made to the RPM except a version update from 2.2.0 to 2.3.0 ftp://ftp.redhat.com/pub/crontib/noarch Jeff
Return code 139 (bug?)
Earl et all, I have been getting getting problems every few days or less on two of the 125 lists I have identically configured with MHonArc. These two lists are high traffic lists (other high traffic lists are doing ok, so far). From today's logs of stderr and stdout, we have about twenty successful runs, followed by a failure, which leaves a .mhonarc.lck lying around. Shown are portions of the last successful run, and the first failure. Subsequent failures are due to the .mhonarc.lck file lying around and have an exit code of 255. The comments about return values are from my wrapper script. I have placed a copy of the DB at ftp://jab.org/pub/.mhonarc.db I have placed a copy of the rcfile at ftp://jab.org/pub/rcfile I have placed the affected mail at ftp://jab.org/pub/catastrophe/ noting that some of the error code 255 mail also get diverted to this directory. The call to MhonArc is a shell script, the relevant lines are attached. You may assume all the shell variables are assigned to reasonable values. MHONARC=/usr/bin/mhonarc FLAGS="-reverse -treverse -tidxfname index.html" FLAGS="$FLAGS -rcfile $HOME/rcfile -savemem -tlevels 1" FLAGS="$FLAGS -ttitle $ESCAPED_NAME -title $NICKNAME" FLAGS="$FLAGS -idxsize 300 -multipg -add -nomailto" $MHONARC $FLAGS $HOME/Mail/$FILTER At no point are two MHonArc processes called at the same time. Suggestions? Comments on this particular return code? Jeff Reading database ... Reading resource file: /home/archive/rcfile ... Adding messages to . Reading /home/archive/Mail/filter.1998.10.20-11:10:33-27409 .. Writing mail Writing ./thrd81.html ... Writing ./thrd82.html ... Writing ./thrd83.html ... Writing database ... 24804 messages Successfully ran Mhonarc Now creating HTML archives. Reading database ... Reading resource file: /home/archive/rcfile ... Adding messages to . Reading /home/archive/Mail/filter.1998.10.20-11:10:01-2601 ..ERROR: MhonArc returned exit code 139. [jeff@multivac jeff]# mhonarc -v MHonArc v2.2.0 Copyright (C) 1995-1998 Earl Hood, [EMAIL PROTECTED] MHonArc comes with ABSOLUTELY NO WARRANTY and MHonArc may be copied only under the terms of the GNU General Public License, which may be found in the MHonArc distribution. [jeff@multivac jeff]# uname -a Linux multivac.jab.org 2.0.32 #1 Wed Nov 19 00:46:45 EST 1997 i586 unknown [jeff@multivac jeff]# perl -v This is perl, version 5.004_04 built for i386-linux Copyright 1987-1997, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5.0 source kit. [jeff@multivac jeff]# rpm -q perl perl-5.004-4 [jeff@multivac Mail]# free total used free sharedbuffers cached Mem:126788 124096 2692 29420 57324 27432 -/+ buffers/cache: 39340 87448 Swap:16092116 15976
Re: stale lockfile
Regarding the stale lockfile - good news. I finally managed to capture the error message (stderr is being logged...now) which was indicative of database corruption. Perhaops this wasn't as intermittant as I thought. Here's the message (MHonArc v2.2): Reading database Can't find string terminator "'" anywhere before \ EOF at ./.mhonarc.db line 59552. Thank you the tip about -savemem not being helpful for archiving single messages. My setup uses a daemon which sorts and processes a inbox queue every so often. It's surprisingly easy to pull off with the MH commands. In times of light traffic, it does one message at a time. During heavy traffic, they pile up and it is much more of a batch job. This seems to work well performance wise. As for fcntl() and friends, I personally am planning to move to a distributed filesystem in the medium/near future. I've been very impressed with Coda, a free and much improved descendent of AFS, and suspect it will grow quite popular with time. Thus I might prefer to use the -nolock feature in the future, and trust my wrapper script to keep things from happening concurrently. Jeff
stale lockfile
Hello, I am finding that MHonArc is intermittantly leaving a lockfile around, suggesting abnormal termination. I manually delete the file, and about a day or two later it happens again. And again. The weird part is, I am archiving 100 identically configured lists, and the problem consistantly (although intermittantly every few days) only occurs for one list. The list isn't my biggest, but it does have the high traffic (about 20 messages a day) and just shy of 20,000 messages archived. Larger list archives (albeit with lower traffic these days) seem to be doing fine. Only one MHonArc runs at a time, and I believe I have enough system resources. I use -savemem and -add, along with a lot of other customizations. Could I have a corrupt database? Do I need to rebuild it? Any thoughts on trouble shooting? I haven't caught this behavior during a manual run yet. Jeff
Re: attachment names and "Message Not Available"
Hi Earl, Wow! Thank you. The suggestions worked great! The emacsesque extensibility of MHonArc continues to impress me. Jeff
attachment names and "Message Not Available"
Hi all, Guess I'm on a roll (rut?) with MHonArc suggestions; I'm feeling guilty not having contributed any code at all, yet making all sorts of suggestions. Here goes with two more. Do these make sense? (Note I looked through the archive and didn't see discussion on either of these, but may have missed it.) Jeff (1) It would be nice to be able to stifle "Message Not Available" messages in the thread index. This would slightly beautify thread indexes, and many archive perusers don't actually care if some message is not available. (2) Mime handling is great with MHonarc. However many Mime objects get named files with things like 000432.bin. I would prefer, from a usuability standpoint, to have the files stored under their attachment name, for a couple of reasons. First, some OS's like Windows put great significance on file name attachments. Imagine a Windows person browsing a set of archives. Having the .doc extension on a Microsoft Word document renamed .bin turns a one step click-and-view into a multistep renaming process. I guess my first preference would be actually keeping the attachment names, so I guess putting attachments in a subdirectory per message would be required to avoid name space collisions. Not doing that, I'd much rather see a naming scheme like 000432.doc 000433.xls so at least browser and server Mime typing will work correctly.
no more
I always get the feeling that if I look hard enough in the Mhonarc documentation, the answer to any question is sitting there. But I couldn't find these: 1) I would like my index to be more confident about threading. That means not using the disclaimer when things are unsure. Let mistakes be made! Is there any way to turn off the disclaimer? 2) I've noticed there are many date resource variables for use in the web pages, anything from 01/02/98 to the ISO whateverwhatever official date string. But it would be nice if there was a date string that said something like "Jan 4, 1998" which has the advantage of being short, unambiguous to those of us who can't remember whether month comes first or day comes first in 02/02/02, oh, and year 2000 compliant! You know you are working with a polished system when the questions get this finely detailed! Sorry if the answers are already sitting in the documentation - I didn't see them. Jeff
Re: reproducible URLs
>8 base-46 characters is sufficient to have a minuscule collision >probability for archives of any reasonable size. That's still only 44 bits of namespace. I guess it depends on what you call reasonable risk; to me it feels a little high. Risk# of Messages (approximately) - 1:10 18,000 1:3000 100,000 3:100 1,000,000 >A 100,000 message archive seems two orders of magnitude too high for >MHonArc's basic design; anything that large using a filesystem as its >database needs to be organized hierarchically. That would add a >subdirectory namespace into the quota. Two orders of magnitude? I am running two archives that will exceed 100,000 messages in the next two years, at the rate they are growing. Their current size, 50,000 messages apiece, works fine under the ext2 (linux native) filesystem. I think a statistical limit of one million is better, as that better reflects the largest lists out there stored over many years. While many filesystems bog down with a large nuber of files in a particular directory, not all do. Perfermance with lots of files in a directory is not an inherent problem; it is directly tied to the design of the filesystem. An arguement could be made that it doesn't make sense to compensate for broken filesystems, whether due to some crazy 8.3 namespace limitation, or due to braindead performance with lots of files. The place for the fix would be in the filesystem and/or underlying OS, not MHonArc. (Kind of like it didn't really make sense to convolute Java applet code, just so the applet would work on a broken Netscape 2.01 browser.) Jeff
reproducible URLs
A while back there was a discussion of reprodicible URLs (to avoid messing up search engines when mail gets re-archived) and issues surrounding randomness, probabilities, MD-5, message-ID, and 8.3 filenames. Anyway, sorry I didn't jump in then, but the kind-of-fun question was implicitly raised: how many bits of randomness do you need for reproducible URLs in MHonArc? (Hey, it's not every day that real life questions can be tackled like problem sets!) We know that the more messages there are, the more likely that there will be a duplicated filename. So, lets assume file names have x bits of randomness and there are n messages. The probability of collision is n --n \/ / i which is| i di 2 -- approximately /n i=0 0 = -- --- x+1 x x 2 2 2 which is the total likelihood of collisions over the total number of possibilities (often called the sample space.) So we want a low chance of collision for any likely size of n. If n=10^6 (about 2^20) and we want a one-in-a-hundred-thousand odds of collision for such and extreme case, then x comes out to about 56. That's 56 bits of randomness. Well, in an 8.3 filename, with no case sensitivity, and only using numbers and letters, we get over 57 bits of randomness to play with, using all 11 characters. No problem. Now if we are restricted to ending the filenames with something like .htm, then there are only about 41 bits of randomness, and then we run about 1% risk of collision for a puny n=100,000 message archive. That's pushing it. Ok, one last note. If we use a real filesystem, with upper and lower case letters in the filenames, we'd still need 10 characters in the filename to meet/exceed the acceptable saftey margin (57 bits). So those lower case letters don't help us much in the region we are interested in. Using MD-5 checksums for filenames is complete overkill statisticly speaking. They are 128 bits, and would consume 20-odd characters in the filename. 10 character filenames would do the trick nicely. There is certainly no need to combine MD-5 and message-ID's from a statistical standpoint. Okay, sorry for the babbling! It's was the repressed student inside me. Jeff
reproducible URLs
I had been wracking my brains, trying to think of what could possibly be improved with the wonderful MHonArc program. And then it hit me. When I use MhonArc, I tend to think of it as a renderer - I feed it email and it renders a bunch of HTML files. On occasion, I will change something in the rcfile (or whatnot) and rerender the all the email messages. Now sometimes that results in what was msg00587.html to turn into msg00560.html. Not a big deal, except that it might leave a dozen major internet search engines (and my little minor search engine) with a bad idea of what is where, at least until the pages can be re-indexed. So, one potential feature for the future might be an option to use reproducible filenames for messges. Like naming the file after the MD5 checksum of a message, or the message ID, or something else that is statistically likely to be unique. Anyway, it's just a thought I wanted to throw out. It would be yet another stripe of icing on a wonderful cake. Jeff
Re: removing address from archives
>Can you provide more information on what you are trying to achieve? I am trying to achieve web pages where header fields for a message areB displayed normally (defaultly) except for the from: header field, which is displayed without an email address. The goal is to continue thwarting spambots yet still display some information about who sent the mail. Most specifically, I'd like to add a line "From: Earl Hood" just before the Subject: header field in the archived message http://www.mail-archive.com/mhonarc@ncsa.uiuc.edu/msg00569.html Currently, I am excluding the From: field entirely using the following markup in the rcfile. subect date Thanks for any insights, Jeff
removing address from archives
Excluding seems straightforward - EXCS or FIELDORDER will happily exclude a field. But... I don't know how to get the name in there; From: $FROMNAME$ doesn't do the trick since FROMNAME is not available to FIELDSBEG What am I missing? .> 1. Remove the e-mail address from the archives. So instead of showing: .> From: "Matthew Andersen" <[EMAIL PROTECTED]> I could have it show up as: .> From: Matthew Andersen Eliminating the actual address completely. . . You can use EXCS to exclude the field, and then use the $FROMNAME$ . resource variable to specify author.