Thanks for having a look at this bug.
According to section 12.8.2 of the postgres manual, ts_parse is
supposed to recognize different types of data, one of which (#4) is an
email address.
The list of recognized data formats for parse can be selected via this query:
SELECT * FROM ts_token_type('default');
The example in the bug I reported is valid email address, according to
the RFC, but isn't recognized as such by the full text search in
postgres. This bug will have a real impact on anybody using ts
functions to locate email addresses, as only some of them are found in
the query.
Regards
Dan
On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <[email protected]> wrote:
> On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <[email protected]> wrote:
>>
>> The following bug has been logged online:
>>
>> Bug reference: 5021
>> Logged by: Dan O'Hara
>> Email address: [email protected]
>> PostgreSQL version: 8.3.7
>> Operating system: win32
>> Description: ts_parse doesn't recognize email addresses with
>> underscores
>> Details:
>>
>> In the following example,
>>
>> select distinct token as email
>> from ts_parse('default', ' [email protected] ' )
>> where tokid = 4
>>
>> ts_parse returns [email protected] rather than [email protected] It seems
>> that any text prior to the underscore is truncated. If the portion
>> following the underscore is only numeric, such as this example,
>>
>> select distinct token as email
>> from ts_parse('default', ' [email protected] ' )
>> where tokid = 4
>>
>> then ts_parse returns nothing at all.
>>
>> section 3.2.3 of RFC 5322 indicates that underscores are valid characters in
>> an email address.
>>
>> http://tools.ietf.org/html/rfc5322
>
> I don't think this has much to do with email addresses. If you do:
>
> select token from ts_parse('a_b');
>
> ...you get three tokens. In your case you're pulling out the fourth
> token, but some of your examples don't have four tokens, so then you
> get nothing at all.
>
> I'm not real familiar with ts_parse(), but I'm thinking that it
> doesn't have any special casing for email addresses and is just
> intended to parse text for full-text-search - in which case splitting
> on _ is a pretty good algorithm.
>
> ...Robert
>
--
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
[email protected]
613 288-8733
--
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs