On 8/31/2014 2:45 PM, Tim Chase wrote:
Tinkering around with a little script, I found myself with the need
to walk a directory tree and process mail messaged found within.
Sometimes these end up being mbox files (with multiple messages
within), sometimes it's a Maildir structure with messages in each
individual file and extra holding directories, and sometimes it's a
MH directory.  To complicate matters, there's also the possibility of
non-{mbox,maildir,mh) files such as binary MUA caches appearing
alongside these messages.

I know nothing about the format within such file but will make a couple of assumptions.

Python knows how to handle each just fine as long as I tell it what
type of file to expect.

By instantiating mailbox.mbox or mailbox.Maildir

 But is there a straight-forward way to
distinguish them?  (FWIW, the *nix "file" utility is just reporting
"ASCII text", sometimes "with very long lines", and sometimes
erroneously flags them as C or C++ files‽).

All I need is "is it maildir, mbox, mh, or something else" (I don't
have to get more complex for the "something else") inside an os.walk
loop.

Simple method: try to parse with mbox and then Maildir and if either succeeds, assume that the file was in the corresponding format.

try:
  <process as mbox>
except mailbox.FormatError:
  try:
    <process as Maildir>
  except mailbox.FormatError:
    pass

If a format is detectable in the first line or two, you could try to write a detect(path) that would return the corresponding class. That would perhaps be a good addition to the mailbox module. On the other hand, if you are interested in just those two classes, and not any of the Maildir subclasses, the above might be good enough. I am assuming that FormatError is raised without reading the whole file and doing a lot else before detecting the mismatch.

--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to