Steve Atkins wrote: > > On Mar 12, 2010, at 5:18 PM, Tom Lane wrote: > > > Bruce Momjian <br...@momjian.us> writes: > >> Well, I think the big question is whether we need to honor RFC 5322 > >> (http://www.rfc-editor.org/rfc/rfc5322.txt). Wikipedia says these are > >> all valid characters: > > > >> http://en.wikipedia.org/wiki/E-mail_address > > > >> * Uppercase and lowercase English letters (a-z, A-Z) > >> * Digits 0 to 9 > >> * Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~ > >> * Character . (dot, period, full stop) provided that it is not the > >> first or last character, and provided also that it does not appear two > >> or more times consecutively. > > > > That's an awful lot of special characters. For the RFC's purposes, > > it's not hard to be flexible because in an email message there is > > external context telling where to expect an address. I think if we > > tried to allow all of those in email addresses in tsearch, we'd have > > "email addresses" gobbling up a whole lot of adjacent text, to nobody's > > benefit. > > > > I can see the case for adding "+" because that's fairly common as Alvaro > > notes, but I think we should be very circumspect about going farther. > > I've been working with recognizing email addresses in text for > years, with many millions of documents processed. Recognizing > them in text is a very different problem to validating them or sanitizing > them. Using the RFC spec to match things that "might be an email > address" isn't a great idea in the wild, so +1 on the circumspect. > > I've found that /[a-z0-9_][^<\"@\\s]{0,80})@/ is good at finding local parts > of "real" email addresses in free text in the wild, without getting being > too prone to grab things that just look vaguely like email addresses. > Obviously > there are some things it'll match that aren't email addresses, and some > email addresses it won't match, but for indexing it's been really pretty > good when combined with a good regex for domain parts[1].
OK, based on your experience, I think we have gone far enough by allowing underscores. I have applied the attached patch to document what symbols we do allow. Just for thrills, I want to point out that even the description is not accurate. Look what happens when a dash follows an underscore: test=> select ts_parse('default', ' a-...@yahoo.com ' ); ts_parse --------------------- (12," ") (4,a-...@yahoo.com) (12," ") (3 rows) test=> select ts_parse('default', ' a-b...@yahoo.com ' ); ts_parse ----------------- (12," ") (16,a-b) (11,a) (12,-) (11,b) (12,-_) (4,c...@yahoo.com) (12," ") (8 rows) -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Index: doc/src/sgml/textsearch.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/textsearch.sgml,v retrieving revision 1.53 diff -c -c -r1.53 textsearch.sgml *** doc/src/sgml/textsearch.sgml 14 Aug 2009 14:53:20 -0000 1.53 --- doc/src/sgml/textsearch.sgml 13 Mar 2010 03:03:24 -0000 *************** *** 1943,1948 **** --- 1943,1955 ---- languages, token types <literal>word</> and <literal>asciiword</> should be treated alike. </para> + + <para> + <literal>email</> does not support all valid email characters as + defined by RFC 5322. Specifically, the only non-alphanumeric + characters supported for email user names are period, dash, and + underscore. + </para> </note> <para>
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers