On Tue, 17 Jul 2007, Bruce Momjian wrote:
Oleg Bartunov wrote:
On Tue, 17 Jul 2007, Bruce Momjian wrote:
I think the tsearch documentation is nearing completion:
http://momjian.us/expire/fulltext/HTML/textsearch.html
but I am not happy with how tsearch is enabled in a user table:
http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html
Aside from the fact that it needs more examples, it only illustrates an
example where someone creates a table, populates it, then adds a
tsvector column, populates that, then creates an index.
That seems quite inflexible. Is there a way to avoid having a separate
tsvector column? What happens if the table is dynamic? How is that
column updated based on table changes? Triggers? Where are the
examples? Can you create an index like this:
I agree, that there are could be more examples, but text search doesn't
require something special !
*Example* of trigger function is documented on
http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html
Yes, I see that in tsearch() here:
http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$
I assume my_filter_name is optional right? I have updated the prototype
to be:
tsearch([vector_column_name], [my_filter_name], text_column_name [, ...
])
Is this accurate? What does this text below it mean?
no, this in inaccurate. First, vector_column_name is not optional argument,
it's a name of tsvector column name.
There can be many functions and text columns specified in a tsearch()
trigger. The following rule is used: a function is applied to all
subsequent TEXT columns until the next matching column occurs.
The idea, is to provide user to preprocess text before applying
tsearch machinery. my_filter_name() preprocess text_column_name1,
text_column_name2,....
The original syntax allows to specify for every text columns their
preprocessing functions.
So, I suggest to keep original syntax, change 'vector_column_name' to
'tsvector_column_name'.
Why are we allowing my_filter_name here? Isn't that something for a
custom trigger. Is calling it tsearch() a good idea? Why not
tsvector_trigger().
I don't see any benefit from the tsvector_trigger() name. If you want to add
some semantic, than tsvector_update_trigger() would be better. Anyway,
this trigger is an illustration.
CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
That avoids having to have a separate column because you can just say:
WHERE to_query('XXX') @@ to_tsvector(column)
yes, it's possible, but without ranking, since currently it's impossible
to store any information in index (it's pg's feature). btw, this should
works and for GiST index also.
What if they use @@@. Wouldn't that work because it is going to check
the heap?
It would work, it'd recalculate to_tsvector(column) for rows found
( for GiST - to remove false hits and for weight information, for
GIN - for weight information only).
That kind of search is useful if there is another natural ordering of search
results, for example, by timestamp.
How do we make sure that the to_query is using the same text search
configuration as the 'column' or index? Perhaps we should suggest:
please, keep in mind, it's not mandatory to use the same configuration
at search time, that was used at index creation.
Well, sort of. If you have stop words in the tquery configuration, you
aren't going to hit any matches in the tsvector, right? Same for
synonymns, I suppose. I can see that stemming would work if there was a
mismatch between tsquery and tsvector.
CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));
so that at least the configuration is documented in the index.
yes, it's better to always explicitly specify configuration name and not
rely on default configuration.
Unfortunately, configuration name doesn't saved in the index.
as Teodor corrected me, index doesn't know about configuration at all !
What accurate user could do, is to provide configuration name in the
comment for tsvector column. Configuration name is an accessory of
to_tsvector() function.
In principle, tsvector as any data type could be obtained by any other ways,
for example, OpenFTS construct tsvector following its own rules.
I was more concerned that there is nothing documenting the configuration
used by the index or the tsvector table column trigger. By doing:
again, index has nothing with configuration name.
Our trigger function is an example, which uses default configuration name.
User could easily write it's own trigger to keep tsvector column up to date
and use configuration name as a parameter.
CREATE INDEX textsearch_idx ON pgweb USING
gin(to_tsvector('english',column));
you guarantee that the index uses 'english' for all its entries. If you
omit the 'english' or use a different configuration, it will heap scan
the table, which at least gives the right answer.
sometimes it's useful not to use explicitly configuration name
to be able to use index with different configuration. Just change
tsearch_conf_name.
Also, how do you guarantee that tsearch() triggers always uses the same
configuration? The existing tsearch() API seems to make that
impossible. I am wondering if we need to add the configuration name as
a mandatory parameter to tsearch().
Using the same tsearch_conf_name, which could be defined by many ways,
you guarantee to use the same configuration.
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
http://www.postgresql.org/about/donate