Re: I HATE SPAM (was Re: Mouse swapping on a laptop)

Kevin D. Clark Mon, 04 Aug 2003 05:56:32 -0700

I vote that we:

1:  Not require people to strip email addresses from the headers and
    body of posts.  This is too much work.  Humans shouldn't have to
    do this.


2:  Keep the archive going, keep it world-accessible

3:  Obfuscate email addresses in the web-archives.  Many, many lists
    that I am on do this already -- it makes sense.


WRT (3) above, I propose that we either configure our web-archiver to
do this (if it already does this), or else we just do this ourselves.
To this end, I have written (***) a Perl filter that does this
automagically.  This filter could be used to filter our existing
archives, or it could be used from a procmail script inside the
web-archiving box in order to obfuscate all future postings.

*** This is basically Jeff Friedl's well-known email regexp, combined
    with around 3 lines of my own code at the bottom.

Regards,

--kevin
-- 
"Some people, when confronted with a problem, think ``I know,
 I'll use regular expressions.'' Now they have two problems."

      -- Jamie Zawinski

#!/usr/bin/perl

# Program to build a regex to match an internet email address,
# from Chapter 7 of _Mastering Regular Expressions_ (Friedl / O'Reilly)
# (http://www.ora.com/catalog/regexp/)
#
# Optimized version.
#
# Copyright 1997 O'Reilly & Associates, Inc.
#



# Some things for avoiding backslashitis later on.
$esc        = '\\\\';               $Period      = '\.';
$space      = '\040';               $tab         = '\t';
$OpenBR     = '\[';                 $CloseBR     = '\]';
$OpenParen  = '\(';                 $CloseParen  = '\)';
$NonASCII   = '\x80-\xff';          $ctrl        = '\000-\037';
$CRlist     = '\n\015';  # note: this should really be only \015.

# Items 19, 20, 21
$qtext = qq/[^$esc$NonASCII$CRlist\"]/;               # for within "..."
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/;  # for within [...]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character

##############################################################################
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at most one level of 
nesting.
$ctext   = qq< [^$esc$NonASCII$CRlist()] >;

# $Cnested matches one non-nested comment.
# It is unrolled, with normal of $ctext, special of $quoted_pair.
$Cnested = qq<
   $OpenParen                            #  (
      $ctext*                            #     normal*
      (?: $quoted_pair $ctext* )*        #     (special normal*)*
   $CloseParen                           #                       )
>;

# $comment allows one level of nested parentheses
# It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
$comment = qq<
   $OpenParen                              #  (
       $ctext*                             #     normal*
       (?:                                 #       (
          (?: $quoted_pair | $Cnested )    #         special
           $ctext*                         #         normal*
       )*                                  #            )*
   $CloseParen                             #                )
>;

##############################################################################

# $X is optional whitespace/comments.
$X = qq<
   [$space$tab]*                    # Nab whitespace.
   (?: $comment [$space$tab]* )*    # If comment found, allow more spaces.
>;



# Item 10: atom
$atom_char   = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
  $atom_char+    # some number of atom characters...
  (?!$atom_char) # ..not followed by something that could be part of an atom
>;

# Item 11: doublequoted string, unrolled.
$quoted_str = qq<
    \"                                     # "
       $qtext *                            #   normal
       (?: $quoted_pair $qtext * )*        #   ( special normal* )*
    \"                                     #        "
>;

# Item 7: word is an atom or quoted string
$word = qq<
    (?:
       $atom                 # Atom
       |                       #  or
       $quoted_str           # Quoted string
     )
>;

# Item 12: domain-ref is just an atom
$domain_ref  = $atom;

# Item 13: domain-literal is like a quoted string, but [...] instead of  "..."
$domain_lit  = qq<
    $OpenBR                            # [
    (?: $dtext | $quoted_pair )*     #    stuff
    $CloseBR                           #           ]
>;

# Item 9: sub-domain is a domain-ref or domain-literal
$sub_domain  = qq<
  (?:
    $domain_ref
    |
    $domain_lit
   )
   $X # optional trailing comments
>;

# Item 6: domain is a list of subdomains separated by dots.
$domain = qq<
     $sub_domain
     (?:
        $Period $X $sub_domain
     )*
>;

# Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
$route = qq<
    \@ $X $domain
    (?: , $X \@ $X $domain )*  # additional domains
    :
    $X # optional trailing comments
>;

# Item 6: local-part is a bunch of $word separated by periods
$local_part = qq<
    $word $X
    (?:
        $Period $X $word $X # additional words
    )*
>;

# Item 2: addr-spec is [EMAIL PROTECTED]
$addr_spec  = qq<
  $local_part \@ $X $domain
>;

# Item 4: route-addr is <route? addr-spec>
$route_addr = qq[
    < $X                 # <
       (?: $route )?     #       optional route
       $addr_spec        #       address spec
    >                    #                 >
];


# Item 3: phrase........
$phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab

# Like atom-char, but without listing space, and uses phrase_ctrl.
# Since the class is negated, this matches the same as atom-char plus space and tab
$phrase_char =
   qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;

# We've worked it so that $word, $comment, and $quoted_str to not consume trailing $X
# because we take care of it manually.
$phrase = qq<
   $word                        # leading word
   $phrase_char *               # "normal" atoms and/or spaces
   (?:
      (?: $comment | $quoted_str ) # "special" comment or quoted string
      $phrase_char *            #  more "normal"
   )*
>;

## Item #1: mailbox is an addr_spec or a phrase/route_addr
$mailbox = qq<
    $X                                  # optional leading comment
    (?:
            $addr_spec                  # address
            |                             #  or
            $phrase  $route_addr      # name and address
     )
>;



###########################################################################
# Here's a little snippet to test it.
# Addresses given on the commandline are described.
#

while (<>) {
  s{($X)                                 # optional leading comment
    (?:
            $addr_spec                   # address
            |                            #  or
            ($phrase)  $route_addr       # name and address
     )}{$1 $2 <EMAIL-HIDDEN>}gxo;
  print;
}

Re: I HATE SPAM (was Re: Mouse swapping on a laptop)

Reply via email to