Re: Filter script to remove html, fullquotes and header lines

2022-03-21 Thread Martin Trautmann

Am 2022-03-22 um 01:56 schrieb raf:

On Mon, Mar 21, 2022 at 01:28:28PM +0100, Martin Trautmann  wrote:


Am 2022-03-21 um 12:56 schrieb raf:

textmail can probably do at least some of what you want:

https://raf.org/textmail
https://github.com/raforg/textmail

and it has some extensibility so you can supply external
translation programs for the bits it doesn't do.

it can operate on individual mail messages or mbox files.
check the output carefully. :-)

Thanks, that looks very helpful to strip attachments, to remove headers
and to convert message bodies - but it lacks the option to perform a
search and replace on message bodies!?

The -C option lets you supply a custom external
"attachment translation" program. That might help.
Something like "-C text/plain:txt:cmd". You just need
to write the "cmd" program in the language of your
choosing. If it doesn't work, you could modify textmail
to do what you need (rather than writing the whole
thing) if you like perl.


Thanks, I didn't expect it there.

I guss some simple sed commands would be good enough here.

If you try it and run into problems, let me know (off list)
and we can probably get it to work.

Would I still need mutt to do the mails one by one and pipe them to
testmail?
I expected testmail to work on full mbox files, but it handled the first
mail only.



Re: Filter script to remove html, fullquotes and header lines

2022-03-21 Thread raf
On Mon, Mar 21, 2022 at 01:28:28PM +0100, Martin Trautmann  wrote:

> Am 2022-03-21 um 12:56 schrieb raf:
> > textmail can probably do at least some of what you want:
> > 
> >https://raf.org/textmail
> >https://github.com/raforg/textmail
> > 
> > and it has some extensibility so you can supply external
> > translation programs for the bits it doesn't do.
> > 
> > it can operate on individual mail messages or mbox files.
> > check the output carefully. :-)
> 
> Thanks, that looks very helpful to strip attachments, to remove headers
> and to convert message bodies - but it lacks the option to perform a
> search and replace on message bodies!?

The -C option lets you supply a custom external
"attachment translation" program. That might help.
Something like "-C text/plain:txt:cmd". You just need
to write the "cmd" program in the language of your
choosing. If it doesn't work, you could modify textmail
to do what you need (rather than writing the whole
thing) if you like perl.

And it might not work. "Converting" to the same
mimetype might send it into an infinite loop. :-) I
don't think so, but custom translations are done before
the built-in ones, so you might need to run textmail
twice.

I just had a look at your original post. textmail does
(1) well. It even detects vestigial text alternatives
and doesn't replace html with them. (2) would require a
custom translator (but it wouldn't have access to the
headers so it can't do the "bonus" part (as is)). it
can't delete headers based on their content (as is),
only their names, so (3) can't be done as stated.
however, you can probably identify the names of headers
that are likely to be that long and supply a list of
those names.

If you try it and run into problems, let me know (off list)
and we can probably get it to work.

cheers,
raf



Re: Filter script to remove html, fullquotes and header lines

2022-03-21 Thread Mark H. Wood
On Mon, Mar 21, 2022 at 08:46:52AM +1100, Cameron Simpson wrote:
> On 20Mar2022 13:36, Martin Trautmann  wrote:
> >do you know about any mutt script that would go from message to message 
> >and
> >
> >1) remove a html part if a plain text part is given
> >
> >2) remove all trailing lines,
> >   starting with a quote sign ">"
> >   and at least e.g. 10 occurences
> >
> >  such as (^>[.*][\r\n]){9,} before the end of the message
> >
> >  Maybe I could append xzxzxzx to the end of the message first, delete 
> >a fullquote up to there and remove xzxzxzx again?
> >
> >  Bonus: Do not remove fullquotes for messages without in-reply-to or 
> >references headers.
> >
> >3) remove header lines which are longer than 5 lines
> >
> >I want to shrink the size of some mailboxes for archive purposes, 
> >without throwing away too much.
> 
> I think you'll have to write your own.
> 
> At minimum you need a full mail message parser so that you are not 
> filtering, say, base64 or QP content incorrectly. So something which 
> scans a mailbox and for each message:
> - decodes it completely
> - applies your filters
> - assembles the new message
> and write this out to a new mailbox (so it isn't destructive and can be 
> compared to the original - you don't want to accidentally shred your 
> archive).

If you want to offload some of the work to existing code, you might
look at things like GNU mailutils, or the tools that come with
maildrop, or some of the subcommands of https://github.com/djcb/mu

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu


signature.asc
Description: PGP signature


Re: Filter script to remove html, fullquotes and header lines

2022-03-21 Thread Martin Trautmann

Am 2022-03-21 um 12:56 schrieb raf:

textmail can probably do at least some of what you want:

   https://raf.org/textmail
   https://github.com/raforg/textmail

and it has some extensibility so you can supply external
translation programs for the bits it doesn't do.

it can operate on individual mail messages or mbox files.
check the output carefully. :-)


Thanks, that looks very helpful to strip attachments, to remove headers
and to convert message bodies - but it lacks the option to perform a
search and replace on message bodies!?


Re: Filter script to remove html, fullquotes and header lines

2022-03-21 Thread raf
On Mon, Mar 21, 2022 at 07:18:01AM +0100, Martin Trautmann  wrote:

> Am 2022-03-20 um 22:46 schrieb Cameron Simpson:
> > I think you'll have to write your own.
> 
> I agree - but I hoped it could have been done with some fine tuning of an
> existing script.
> 
> 
> > At minimum you need a full mail message parser so that you are not
> > filtering, say, base64 or QP content incorrectly. So something which
> > scans a mailbox and for each message:
> 
> That's why I wondered to do it with mutt - it can do all that stuff.
> 
> > - decodes it completely
> > - applies your filters
> > - assembles the new message
> > and write this out to a new mailbox (so it isn't destructive and can be
> > compared to the original - you don't want to accidentally shred your
> > archive).
> I would work on a copy of the mailbox file first, of course.
> 
> > I'd do this in Python myself - it has a good email library and you can
> > do all the things you describe fairly easily with it.
> So Python it is... Never programmed it before myself.
> 
> Thans,
> Martin

i didn't see the original post but...

textmail can probably do at least some of what you want:

  https://raf.org/textmail
  https://github.com/raforg/textmail

and it has some extensibility so you can supply external
translation programs for the bits it doesn't do.

it can operate on individual mail messages or mbox files.
check the output carefully. :-)

cheers,
raf