Do not know if this mail got lost in between or no one noticed it! On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote: Just a reminder that this patch is discussing how to break url, emails etc into its components. > > On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > [ sorry for not responding on this sooner, it's been hectic > the last > couple weeks ] > > Sushant Sinha <sushant...@gmail.com> writes: > > >> I looked at this patch a bit. I'm fairly unhappy that it > seems to be > >> inventing a brand new mechanism to do something the ts > parser can > >> already do. Why didn't you code the url-part mechanism > using the > >> existing support for compound words? > > > I am not familiar with compound word implementation and so I > am not sure > > how to split a url with compound word support. I looked into > the > > documentation for compound words and that does not say much > about how to > > identify components of a token. > > > IIRC, the way that that works is associated with pushing a > sub-state > of the state machine in order to scan each compound-word > part. I don't > have the details in my head anymore, though I recall having > traced > through it in the past. Look at the state machine actions > that are > associated with producing the compound word tokens and > sub-tokens. >
I did look around for compound word support in postgres. In particular, I read the documentation and code in tsearch/spell.c that seems to implement the compound word support. So in my understanding the way it works is: 1. Specify a dictionary of words in which each word will have applicable prefix/suffix flags 2. Specify a flag file that provides prefix/suffix operations on those flags 3. flag z indicates that a word in the dictionary can participate in compound word splitting 4. When a token matches words specified in the dictionary (after applying affix/suffix operations), the matching words are emitted as sub-words of the token (i.e., compound word) If my above understanding is correct, then I think it will not be possible to implement url/email splitting using the compound word support. The main reason is that the compound word support requires the "PRE-DETERMINED" dictionary of words. So to split a url/email we will need to provide a list of *all possible* host names and user names. I do not think that is a possibility. Please correct me if I have mis-understood something. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers