On Thu, 26 Jun 2008, Joanna Sharman wrote:
Hi,
I have recently started experimenting with tsearch2 and it seems that the
default behaviour is to ignore HTML tags and treat them as word-separators.
What I would like it to do is to ignore HTML tags within words, but instead
of creating separate words, combine the characters separated by the tag into
one word.
For example: in the database I have words like 'K<sub>ir</sub>' that need to
be searched using the term without HTML tags, i.e. 'Kir'. Currently, the HTML
tags are ignored and two words are stored in the vector, 'k' and 'ir'. I
would like only one word, 'kir', to be stored in the vector, so that searches
using the word 'kir' will match the row.
2 options - write HTML parser and preprocess text before to_tsvector.
A second, related question is whether it is possible to cause tsearch2 to
split up words when it encounters digits, e.g. 'TM8' into 'TM' and '8'.
you can write your own dictionary or use dict_regex from
http://vo.astronet.ru/arxiv/dict_regex.html
I am not sure if this functionality is possible to implement using tsearch2
or if there might be a better way, so I would be grateful for any advice or
pointers to further reading on how I might do this. (I am using PostgreSQL
version 8.1.10)
think about upgrading to 8.3
Many thanks in advance,
Joanna
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general