Re: [GENERAL] HTML tags and tsearch2

Oleg Bartunov Thu, 26 Jun 2008 05:06:54 -0700

On Thu, 26 Jun 2008, Joanna Sharman wrote:

Hi,
I have recently started experimenting with tsearch2 and it seems that thedefault behaviour is to ignore HTML tags and treat them as word-separators.What I would like it to do is to ignore HTML tags within words, but insteadof creating separate words, combine the characters separated by the tag intoone word.
For example: in the database I have words like 'K<sub>ir</sub>' that need tobe searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTMLtags are ignored and two words are stored in the vector, 'k' and 'ir'. Iwould like only one word, 'kir', to be stored in the vector, so that searchesusing the word 'kir' will match the row.


2 options - write HTML parser and preprocess text before to_tsvector.

A second, related question is whether it is possible to cause tsearch2 tosplit up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'.

you can write your own dictionary or use dict_regex fromhttp://vo.astronet.ru/arxiv/dict_regex.html

I am not sure if this functionality is possible to implement using tsearch2or if there might be a better way, so I would be grateful for any advice orpointers to further reading on how I might do this. (I am using PostgreSQLversion 8.1.10)


think about upgrading to 8.3


Many thanks in advance,
Joanna


        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] HTML tags and tsearch2

Reply via email to