On Thu, 26 Jun 2008, Joanna Sharman wrote:

Hi,

I have recently started experimenting with tsearch2 and it seems that the default behaviour is to ignore HTML tags and treat them as word-separators. What I would like it to do is to ignore HTML tags within words, but instead of creating separate words, combine the characters separated by the tag into one word.

For example: in the database I have words like 'K<sub>ir</sub>' that need to be searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTML tags are ignored and two words are stored in the vector, 'k' and 'ir'. I would like only one word, 'kir', to be stored in the vector, so that searches using the word 'kir' will match the row.

2 options - write HTML parser and preprocess text before to_tsvector.


A second, related question is whether it is possible to cause tsearch2 to split up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'.

you can write your own dictionary or use dict_regex from http://vo.astronet.ru/arxiv/dict_regex.html


I am not sure if this functionality is possible to implement using tsearch2 or if there might be a better way, so I would be grateful for any advice or pointers to further reading on how I might do this. (I am using PostgreSQL version 8.1.10)

think about upgrading to 8.3


Many thanks in advance,
Joanna



        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to