On Tue, Jan 03, 2012 at 06:04:23PM +0000, [email protected] wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      6375
> Logged by:          Valentine Gogichashvili
> Email address:      [email protected]
> PostgreSQL version: 9.1.1
> Operating system:   Debian 4.4.5-8
> Description:        
> 
> Hello, 
> 
> default tsearch parser does not recognize all valid email addresses and
> tokenizes them as text, splitting into tokens. 
> 
> For example:
> 
> postgres=# select to_tsquery('simple', '[email protected]' );
>      to_tsquery     
> ────────────────────
>  '[email protected]'
> (1 row)
> 
> here it behaves ok;
> 
> postgres=# select to_tsquery('simple', '[email protected]' );
>         to_tsquery        
> ──────────────────────────
>  '[email protected]'
> (1 row)
> 
> here it trims '-' from the beginning of an email. This is not correct, but
> will at least find that email.
> 
> postgres=# select to_tsquery('simple', '[email protected]'
> );
>                                   to_tsquery                                
>   
> ───────────────────────────────────────────────────────────────────────────────
>  'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
> (1 row)
> 
> and this is now a real problem as it leads to finding emails that are not
> the same, but are "super-sets" of that one.
> 
> Valid email characters, that are not correctly treated also are at least '+'
> and '.'

Yep.  :-(

You can see the oddness here:

        test=> SELECT alias, description, token FROM 
ts_debug('[email protected]');
         alias |  description  |      token
        -------+---------------+------------------
         blank | Space symbols | -
         email | Email address | [email protected]
        (2 rows)
        
        test=> SELECT alias, description, token FROM 
ts_debug('[email protected]');
         alias |  description  |       token
        -------+---------------+-------------------
         blank | Space symbols | -
         email | Email address | [email protected]
        (2 rows)
        
        test=> SELECT alias, description, token FROM 
ts_debug('[email protected]');
              alias      |           description           |   token
        -----------------+---------------------------------+-----------
         blank           | Space symbols                   | -
         asciihword      | Hyphenated word, all ASCII      | myna-me
         hword_asciipart | Hyphenated word part, all ASCII | myna
         blank           | Space symbols                   | -
         hword_asciipart | Hyphenated word part, all ASCII | me
         blank           | Space symbols                   | -@
         host            | Host                            | gmail.com
        (7 rows)

The first and second show that the leading-dash is separated.  The third
ones shows that a trailing dash causes the middle-dash to also be
separated.

This email thread from 2010 has a similar problem:

        http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php

What is limiting a fix for this is the breaking of existing behavior,
and the breaking of indexes used during pg_upgrade.

I have added your email to the existing TODO item:

        http://wiki.postgresql.org/wiki/Todo#Text_Search

        Improve handling of dash and plus signs in email address user names, and
        perhaps improve URL parsing
        
            http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
            tsearch does not recognize all valid emails 

-- 
  Bruce Momjian  <[email protected]>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to