Re: [Discuss] Please help with a sed script

E. William Horne Tue, 25 May 2021 15:36:26 -0700

THANK YOU for the scripts.

I apologize: I didn't write my request more clearly:


1. As with the discuss list, users whom subscribe to the Telecom Digest
   mailing list can choose to receive either each email that is sent to
   the mailing list, or to a "Digest" version, with all the emails for
   a day concatenated into a single "Digest" email. I receive a copy of
   The Telecom Digest's "Digest" edition, which is sent to me from a
   SYMPA email reflector at iecc.com in New York. The email message I
   used to test the scripts quoted here is at
   
http://telecom.csail.mit.edu/archives/back.issues/recent.single.issues/test-e145.txt
   
<http://telecom.csail.mit.edu/archives/back.issues/recent.single.issues/test-e145.txt>
   - it is a verbatim copy of the email, taken from my mbox after it
   arrived, with a few edits to help prevent spam.
2. Since some viewers prefer to get the Telecom Digest online, I
   prepare an HTML version of the daily digest email. To do that, I've
   been doing a lot of edits by hand, and I need a more automated
   method. To that end, I'm asking for help to write either a sed or
   awk or whatever-works script, which will convert the daily "Digest"
   email into an HTML page with that day's messages on it.
     * I will write a table-of-contents with the subjects from all the
       emails in it.
     * The email User ID's and addresses are to be removed before
       outputting the table.
     * There are other edits, but they not nearly as hard as the Table
       of Contents, so I'm asking for help with that

Today's Telecom Digest Table of Contents looks like this:

Table of contents:

* 1 - Re: [telecom] Cell phone bills too high? Here are some that start at
  just   $10 a month - "John Levine" <[email protected]>
* 2 - Re: [telecom] Cell phone bills too high? Here are some that start at
  just $10  a month - Bill Horne <[email protected]>
* 3 - [telecom] Opinion: CTL is going downhill fast - Moderator
<[email protected]>

I tried the sed script:

sed 's|^\* [0-9]* - $.*$\[telecom\] $.*$ -.*$|<tr><td>\1\2</td></tr>|g' test.txt >t1.txt


The result was:

* 1 - Re: [telecom] Cell phone bills too high? Here are some that start at
  just   $10 a month - "John Levine" <[email protected]>
* 2 - Re: [telecom] Cell phone bills too high? Here are some that start at
  just $10  a month - Bill Horne <[email protected]>
<tr><td>Opinion: CTL is going downhill fast</td></tr>
<[email protected]>

So, it looks like the sed option is going to need some refinement. ;-)

1. I could try testing if the line started with "* 0-9 - [telecom]" and
   ended with ">", and then figure out if there were extra hyphen in it
   and edit it using the last one as a delimiter.
2. If the line started with "* 0-9 - [telecom]", but didn't end with
   ">", then I'd try to write it out to the "hold" buffer, and read in
   the next line to see if /that/ /line/ ended with ">", and if it did,
   I'd like to combine the two lines in the hold area, move the hold
   area to the pattern space, and edit it there as if it were a single
   line.
3. I haven't thought of how to deal with three-line entries yet. I need
   a bigger thinking cap for this.

I then tried the awk script:

awk  '/^   \* [0-9]* - .*\[telecom\]/{if (NR>1) print ""} {printf $0} END{print 
""}' <test-e145.txt

and got this (edited for brevity) output:

: 5783Lines: 144telecom digest Tue, 25 May 2021Table of contents:* 1 - Re: [telecom] Cell phone bills too 
high? Here are some that start at  just   $10 a month - "John Levine" <[email protected]>* 2 
- Re: [telecom] Cell phone bills too high? Here are some that start at  just $10  a month - Bill Horne 
<[email protected]>* 3 - [telecom] Opinion: CTL is going downhill fast - Moderator  
<[email protected]>----------------------------------------------------------------------

Which is better in a way: if awk can produce continuous output, without newline 
characters, then it can probably edit the input as if it were one continuous 
line, which would make things easier. I'll have to find the awk manual and do 
more studying.

My thanks to Mr. Galperin for his help. I need all I can get!

Bill Horne


On 5/25/2021 9:54 AM, Gregory Galperin wrote:

awk  '/^   \* [0-9]* - .*\[telecom\]/{if (NR>1) print ""} {printf $0} END{print 
""}' | \
sed 's|^   \* [0-9]* - \(.*\)\[telecom\] \(.*\) - .*$|<tr><td>\1\2</td></tr>|g'

notes:
  * I assumed the 3 spaces before the * were part of the data (rather than
    just formatting by you in this particular email)
  * other than that, whitespace is considered unimportant and left alone,
    since html doesn't care.  if you want to squeeze redundant whitespace,
    | tr -s ' '
  * hyphens (and even the string " - ") can be in the subject, no problem
  * if the string " - " is in the free text part of the email name, then any
    part of that free text before the " - " is considered as part of the subject
  * if the subject has the string [telecom] in it a second time or more,
    only the last [telecom] gets eaten -- so e.g. a subject
        Re: [telecom] Why do all subject lines have [telecom] at the front?
    becomes
        Re: [telecom] Why do all subject lines have at the front?
  * the string [telecom] can be in the email field, no problem
    (but note that if the email field has both [telecom] and a " - " after it,
     the " - " makes everything before that show up in the subject, and then
     in the subject only the last [telecom] before the " - " gets eaten)
  * on the off chance the wrapping breaks a subject so that the continuation
    starts with a *, has one space and then a number and then " - " and has
    the string [telecom] somewhere on that line, this will consider that
    continuation to instead be a new message.

maybe try it on a couple months of digests and look through the results?

--grg


On Tue, May 25, 2021 at 02:21:53AM -0400, Bill Horne wrote:

Thanks for reading this: I appreciate your time.

I'm the Moderator of The Telecom Digest, which is the oldest e-zine on the
Internet.

The readers send in pointers to articles of interest, and each day, other
readers whom subscribe with the "digest" option receive an email with all
the previous day's stories.

Here's the table-of-contents from a typical day:

    * 1 - [telecom] Can robocalls be tracked? - "bob prohaska"
    <[email protected]>
    * 2 - Re: [telecom] Can robocalls be tracked? - Bill Horne
       <[email protected]>
    * 3 - [telecom] Verizon Media debuts ad-targeting solution without
    identifiers
       - Moderator<[email protected]>

And here's what I'd like to change it to, using (if possible) sed:

        (tr)(td)Can robocalls be tracked?(/td)(/tr)
        (tr)(td)Re: Can robocalls be tracked?(/td)(/tr)
        (tr)(td)Verizon Media debuts ad-targeting solution without
        identifiers(/td)(/tr)

        ("less-than" and "greater-than" symbols have been changed to
        parens here for obvious reasons.)

Things to note:

1. The Subjects lines vary in length, and may contain hyphens.
2. The name and email of the contributor is also published with the
    actual post, further on in each digest, so it doesn't have to appear
    in the Table of Contents.
3. The "m" option of sed, which the manual says will do a multi-line
    "s" command, doesn't appear to work on the OS I'm using, which is
    Ubuntu 16 LTS.

Up until now, I've been doing this change every day, with emacs macros and
the rest by-hand. I want to automate a lot more of the daily work, so I'm
hoping that there's a way to get Linux sed to do that. I don't need sed per
se: if awk or some other utility would be a better choice, please tell me
about that possible solution instead.

Thanks you again.

Bill

--
Bill Horne

_______________________________________________
Discuss mailing list
[email protected]
http://lists.blu.org/mailman/listinfo/discuss

_______________________________________________
Discuss mailing list
[email protected]
http://lists.blu.org/mailman/listinfo/discuss

Re: [Discuss] Please help with a sed script

Reply via email to