Re: [HACKERS] english parser in text search: support for multiple words in the same position

Sushant Sinha Thu, 06 Jan 2011 07:45:01 -0800

Do not know if this mail got lost in between or no one noticed it!

On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote:
Just a reminder that this patch is discussing  how to break url, emails
etc into its components.
> 
> On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane <[email protected]> wrote:
>         [ sorry for not responding on this sooner, it's been hectic
>         the last
>          couple weeks ]
>         
>         Sushant Sinha <[email protected]> writes:
>         
>         >> I looked at this patch a bit.  I'm fairly unhappy that it
>         seems to be
>         >> inventing a brand new mechanism to do something the ts
>         parser can
>         >> already do.  Why didn't you code the url-part mechanism
>         using the
>         >> existing support for compound words?
>         
>         > I am not familiar with compound word implementation and so I
>         am not sure
>         > how to split a url with compound word support. I looked into
>         the
>         > documentation for compound words and that does not say much
>         about how to
>         > identify components of a token.
>         
>         
>         IIRC, the way that that works is associated with pushing a
>         sub-state
>         of the state machine in order to scan each compound-word
>         part.  I don't
>         have the details in my head anymore, though I recall having
>         traced
>         through it in the past.  Look at the state machine actions
>         that are
>         associated with producing the compound word tokens and
>         sub-tokens.
>


I did look around for compound word support in postgres. In particular,
I read the documentation and code in tsearch/spell.c that seems to
implement the compound word support. 

So in my understanding the way it works is:

1. Specify a dictionary of words in which each word will have applicable
prefix/suffix flags

2. Specify a flag file that provides prefix/suffix operations on those
flags

3. flag z indicates that a word in the dictionary can participate in
compound word splitting

4. When a token matches words specified in the dictionary (after
applying affix/suffix operations), the matching words are emitted as
sub-words of the token (i.e., compound word)

If my above understanding is correct, then I think it will not be
possible to implement url/email splitting using the compound word
support.

The main reason is that the compound word support requires the
"PRE-DETERMINED" dictionary of words. So to split a url/email we will
need to provide a list of *all possible* host names and user names. I do
not think that is a possibility.

Please correct me if I have mis-understood something.

-Sushant. 



-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] english parser in text search: support for multiple words in the same position

Reply via email to