Re: [HACKERS] Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

Bruce Momjian Fri, 12 Mar 2010 19:16:48 -0800

Steve Atkins wrote:
> 
> On Mar 12, 2010, at 5:18 PM, Tom Lane wrote:
> 
> > Bruce Momjian <[email protected]> writes:
> >> Well, I think the big question is whether we need to honor RFC 5322
> >> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are
> >> all valid characters:
> > 
> >>    http://en.wikipedia.org/wiki/E-mail_address
> > 
> >>    * Uppercase and lowercase English letters (a-z, A-Z)
> >>    * Digits 0 to 9
> >>    * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
> >>    * Character . (dot, period, full stop) provided that it is not the
> >>      first or last character, and provided also that it does not appear two
> >>      or more times consecutively.
> > 
> > That's an awful lot of special characters.  For the RFC's purposes,
> > it's not hard to be flexible because in an email message there is
> > external context telling where to expect an address.  I think if we
> > tried to allow all of those in email addresses in tsearch, we'd have
> > "email addresses" gobbling up a whole lot of adjacent text, to nobody's
> > benefit.
> > 
> > I can see the case for adding "+" because that's fairly common as Alvaro
> > notes, but I think we should be very circumspect about going farther.
> 
> I've been working with recognizing email addresses in text for
> years, with many millions of documents processed. Recognizing
> them in text is a very different problem to validating them or sanitizing
> them. Using the RFC spec to match things that "might be an email
> address" isn't a great idea in the wild, so +1 on the circumspect.
> 
> I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts
> of "real" email addresses in free text in the wild, without getting being
> too prone to grab things that just look vaguely like email addresses. 
> Obviously
> there are some things it'll match that aren't email addresses, and some
> email addresses it won't match, but for indexing it's been really pretty
> good when combined with a good regex for domain parts[1].


OK, based on your experience, I think we have gone far enough by
allowing underscores.  I have applied the attached patch to document
what symbols we do allow.

Just for thrills, I want to point out that even the description is not
accurate.  Look what happens when a dash follows an underscore:

        test=> select ts_parse('default', ' [email protected] '   );
              ts_parse
        ---------------------
         (12," ")
         (4,[email protected])
         (12," ")
        (3 rows)
        
        test=> select ts_parse('default', ' [email protected] '   );
            ts_parse
        -----------------
         (12," ")
         (16,a-b)
         (11,a)
         (12,-)
         (11,b)
         (12,-_)
         (4,[email protected])
         (12," ")
        (8 rows)

-- 
  Bruce Momjian  <[email protected]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

Index: doc/src/sgml/textsearch.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v
retrieving revision 1.53
diff -c -c -r1.53 textsearch.sgml
*** doc/src/sgml/textsearch.sgml	14 Aug 2009 14:53:20 -0000	1.53
--- doc/src/sgml/textsearch.sgml	13 Mar 2010 03:03:24 -0000
***************
*** 1943,1948 ****
--- 1943,1955 ----
      languages, token types <literal>word</> and <literal>asciiword</>
      should be treated alike.
     </para>
+ 
+    <para>
+     <literal>email</> does not support all valid email characters as
+     defined by RFC 5322.  Specifically, the only non-alphanumeric
+     characters supported for email user names are period, dash, and
+     underscore.
+    </para>
    </note>
  
    <para>

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [BUGS] BUG #5021: ts_parse doesn't recognize email addresses with underscores

Reply via email to