> On 08/01/2010 08:04 PM, Sushant Sinha wrote: > > 1. We do not have separate tokens "wikipedia" and "org" > > 2. If we have the two tokens we should have them at adjacent position so > > that a phrase search for "wikipedia org" should work. > > This would needlessly increase the number of tokens. Instead you'd > better make it work like compound word support, having just "wikipedia" > and "org" as tokens.
The current text parser already returns url and url_path. That already increases the number of unique tokens. I am only asking for adding of normal english words as well so that if someone types only "wikipedia" he gets a match. > > Searching for "wikipedia.org" or "wikipedia org" should then result in > the same search query with the two tokens: "wikipedia" and "org". Earlier people have expressed the need to index urls/emails and currently the text parser already does so. Reverting that would be a regression of functionality. Further, a ranking function can take advantage of direct match of a token. > > position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant) > > IMO the differentiation between WORDs and URLs is not something the text > search engine should have to take care a lot. Let it just do the > searching and make it do that well. Postgres english parser already emits urls as tokens. Only thing I am asking is on improving the tokenization and positioning. > What does a token "wikipedia.org/search?q=sushant" buy you in terms of > text searching? Or even result highlighting? I wouldn't expect anybody > to want to search for a full URL, do you? There have been need expressed in past. And an exact token match can result in better ranking functions. For example, a tf-idf ranking will rank matching of such unique tokens significantly higher. -Sushant. > Regards > > Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers