On 2015-05-25 20:14:50 -0500, David Wright wrote: > Well, the discussion in these threads has ranged widely over trying to > speed up the reading of directories and large numbers of files. Every > so often, I think about what you're doing with that huge directory of > emails, all 145k of them.
Well, now with the sort by inode, this is mostly OK. This is not immediate, whether the data are in the cache or not, but not too slow, mainly given the fact I do not do changes much often. Of course, I would prefer it to be faster, but not with the drawback of getting more complex code, less readable and maintainable. > AIUI, and correct me if I'm wrong, you have to be able to read them > with a mail client (mutt). You have to check that (all) the header > lines are correctly formed and that each email has a single unique > message-id. Yes, and the first check is a consequence of the second one: checking whether an e-mail message has a single Message-Id makes sense only when the headers are well-formed. I have two archive mailboxes: one very big (146k messages currently) for old messages, and one small (not more than 3k messages) for recent messages. I use the same script on both (the validation is useful mainly on the latter one, and I fix messages manually for the *rare* ones that don't have a single Message-Id, otherwise I could do that automatically with procmail since I'm using it). BTW, I also detect that all the Message-Id's in the mailbox are different. > Not being conversant with the maildir format, I took a look at > http://wiki2.dovecot.org/MailboxFormat/Maildir to see how filenames > are used, and how flags are implemented. I see one also might have to > be careful about preserving timestamps. Well, tools that are based on timestamps are poorly designed, at least for mailboxes that are not incoming mailboxes. Timestamps are not used in the standard specification: http://cr.yp.to/proto/maildir.html http://en.wikipedia.org/wiki/Maildir > Anyway, the questions that pop into my head are things like: > > If an email doesn't have a message-id, why not give it one with a > X-header that you recognise as your own? (You could process duplicates > similarly.) I add a Message-Id header, no need for a X-header. > Why not put your X-header as the first line in the file? (In most > cases, it would be a copy of the original message-id.) Then you only > have to read one line to get at your X-header/message-id on every > subsequent occasion that you process the files. I prefer to keep the full header exactly as it is, if it doesn't need to be modified. This sometimes gives some information, e.g. when the Message-Id was added in the chain of servers (though this is not completely reliable). In any case, to ensure that the message-id is unique, I need to parse the whole header. > If a header line is malformed, why not fix it up straight away as best > you can (rather than die), perhaps flagging the fact. The goal of this script is to check whether this occurs. In practice, this shouldn't happen, and I prefer to look and fix it manually in the rare occasions where this occurs. Any fix normally occurs in the "recent" mailbox, before the messages go to the "old" one. Thus an error in the "old" one means a serious problem (IIRC this has never occurred yet). > BTW I couldn't help being amused by this paragraph in the dovecot wiki: > "Issues with the specification > > Locking > > Although maildir was designed to be lockless, Dovecot locks the > maildir while doing modifications to it or while looking for new > messages in it. This is required because otherwise Dovecot might > temporarily see mails incorrectly deleted, which would cause > trouble. Basically the problem is that if one process modifies the > maildir (eg. a rename() to change a message's flag), another process > in the middle of listing files at the same time could skip a file. The > skipping happens because readdir() system call doesn't guarantee that > all the files are returned if the directory is modified between the > calls to it. This problem exists with all the commonly used > filesystems. > " That's one of the problems discussed in another part of the thread. With a POSIX compliant file system (thus not like ext3), one can't miss a file; however one can get an entry under the old file name, but this is unavoidable and doesn't necessarily matter (and if it does because one needs to open the file, one can detect it[*] and re-read the directory). [*] i.e. if opening the file fails. To be completely secure, after opening the file, if this succeeds, one probably needs to check the inode with fstat() to detect a double rename which has changed the file (= inode); however such a double rename shouldn't occur in a maildir, i.e. a new e-mail message should always get a new filename (otherwise this would break other things anyway, yielding data loss or data corruption). -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20150526112444.ga32...@ypig.lip.ens-lyon.fr