On Sun, Jan 02, 2022 at 01:23:38PM +0100, Jaroslaw Rafa <r...@rafa.eu.org> 
wrote:

> Dnia  1.01.2022 o godz. 11:40:37 Frank Hwa pisze:
> > 
> > For a multipart message, is text/plain part always in the first
> > location?
> > I just want to extract the plain text body of a message. I use the
> > code below (python), but was not very sure.
> 
> I have a perl script that extracts all text/plain parts from multipart
> messages, up to 5 levels nesting of multipart messages one inside another
> (that level is configurable via a parameter in the script).
> 
> If you want to look at it, it's here: http://rafa.eu.org/media/textconv.pl
> -- 
> Regards,
>    Jaroslaw Rafa
>    r...@rafa.eu.org
> --
> "In a million years, when kids go to school, they're gonna know: once there
> was a Hushpuppy, and she lived with her daddy in the Bathtub."

Another thing that might help is my "textmail" program
which is a mail filter that converts non-text
attachments into text attachments where possible (using
external translation programs), and deletes attachments
that can't be translated to text (like images).

It replaces multipart/alternative parts with the
text/plain part unless it looks vestigial, in which
case it replaces them with the other alternative part
converted to text. This is often much better than just
grabbing the text/plain attachment, since it might just
say something like "Your email client does not support
HTML email". There are a few builtin tests to identify
vestigial text/plain parts, and you can add new ones if
necessary.

It can also save attachments with particular mimetypes.

A command like this does something like what you want:

  cat msg | textmail | textmail -F text/plain -G /path/for/attachments 
>/dev/null

That performs the default transformations, then saves
all resulting text/plain attachments to a directory,
and discards the resulting mail message.

  https://raf.org/textmail
  https://github.com/raforg/textmail

However, it requires multiple external processes
(textmail/perl itself and the translators), and so
probably only works on UNIX-like systems.

If you need it to be pure Python, and aren't expecting
any vestigial text/plain parts, you could modify your
existing script to recursively examine all parts
looking for text/plain. Something like this:

    def get_text_parts(msg):
        parts = []
        if msg.is_multipart():
            for part in msg.get_payload():
                parts.extend(get_text_parts(part))
        elif msg.get_content_type() == 'text/plain':
            parts.append(msg.get_payload())
        return parts
    text_parts = get_text_parts(email.message_from_string(x))
    print('%r' % text_parts)

cheers,
raf

Reply via email to