Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)

Vincent Lefevre Tue, 26 May 2015 04:25:53 -0700

On 2015-05-25 20:14:50 -0500, David Wright wrote:
> Well, the discussion in these threads has ranged widely over trying to
> speed up the reading of directories and large numbers of files. Every
> so often, I think about what you're doing with that huge directory of
> emails, all 145k of them.


Well, now with the sort by inode, this is mostly OK. This is not
immediate, whether the data are in the cache or not, but not too
slow, mainly given the fact I do not do changes much often. Of
course, I would prefer it to be faster, but not with the drawback
of getting more complex code, less readable and maintainable.

> AIUI, and correct me if I'm wrong, you have to be able to read them
> with a mail client (mutt). You have to check that (all) the header
> lines are correctly formed and that each email has a single unique
> message-id.

Yes, and the first check is a consequence of the second one: checking
whether an e-mail message has a single Message-Id makes sense only
when the headers are well-formed.

I have two archive mailboxes: one very big (146k messages currently)
for old messages, and one small (not more than 3k messages) for recent
messages. I use the same script on both (the validation is useful
mainly on the latter one, and I fix messages manually for the *rare*
ones that don't have a single Message-Id, otherwise I could do that
automatically with procmail since I'm using it).

BTW, I also detect that all the Message-Id's in the mailbox are
different.

> Not being conversant with the maildir format, I took a look at
> http://wiki2.dovecot.org/MailboxFormat/Maildir to see how filenames
> are used, and how flags are implemented. I see one also might have to
> be careful about preserving timestamps.

Well, tools that are based on timestamps are poorly designed, at least
for mailboxes that are not incoming mailboxes. Timestamps are not used
in the standard specification:
  http://cr.yp.to/proto/maildir.html
  http://en.wikipedia.org/wiki/Maildir

> Anyway, the questions that pop into my head are things like:
> 
> If an email doesn't have a message-id, why not give it one with a
> X-header that you recognise as your own? (You could process duplicates
> similarly.)

I add a Message-Id header, no need for a X-header.

> Why not put your X-header as the first line in the file? (In most
> cases, it would be a copy of the original message-id.) Then you only
> have to read one line to get at your X-header/message-id on every
> subsequent occasion that you process the files.

I prefer to keep the full header exactly as it is, if it doesn't need
to be modified. This sometimes gives some information, e.g. when the
Message-Id was added in the chain of servers (though this is not
completely reliable).

In any case, to ensure that the message-id is unique, I need to parse
the whole header.

> If a header line is malformed, why not fix it up straight away as best
> you can (rather than die), perhaps flagging the fact.

The goal of this script is to check whether this occurs. In practice,
this shouldn't happen, and I prefer to look and fix it manually in
the rare occasions where this occurs. Any fix normally occurs in the
"recent" mailbox, before the messages go to the "old" one. Thus an
error in the "old" one means a serious problem (IIRC this has never
occurred yet).

> BTW I couldn't help being amused by this paragraph in the dovecot wiki:
> "Issues with the specification
> 
>  Locking
> 
>  Although maildir was designed to be lockless, Dovecot locks the
>  maildir while doing modifications to it or while looking for new
>  messages in it. This is required because otherwise Dovecot might
>  temporarily see mails incorrectly deleted, which would cause
>  trouble. Basically the problem is that if one process modifies the
>  maildir (eg. a rename() to change a message's flag), another process
>  in the middle of listing files at the same time could skip a file. The
>  skipping happens because readdir() system call doesn't guarantee that
>  all the files are returned if the directory is modified between the
>  calls to it. This problem exists with all the commonly used
>  filesystems. 
> "

That's one of the problems discussed in another part of the thread.
With a POSIX compliant file system (thus not like ext3), one can't
miss a file; however one can get an entry under the old file name,
but this is unavoidable and doesn't necessarily matter (and if it
does because one needs to open the file, one can detect it[*] and
re-read the directory).

[*] i.e. if opening the file fails. To be completely secure, after
opening the file, if this succeeds, one probably needs to check the
inode with fstat() to detect a double rename which has changed the
file (= inode); however such a double rename shouldn't occur in a
maildir, i.e. a new e-mail message should always get a new filename
(otherwise this would break other things anyway, yielding data loss
or data corruption).

-- 
Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/20150526112444.ga32...@ypig.lip.ens-lyon.fr

Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)

Reply via email to