Re: [HACKERS] Updated tsearch documentation

Oleg Bartunov Wed, 18 Jul 2007 12:51:46 -0700

On Wed, 18 Jul 2007, Bruce Momjian wrote:

Why are we allowing my_filter_name here?  Isn't that something for a
custom trigger.  Is calling it tsearch() a good idea?  Why not
tsvector_trigger().


I don't see any benefit from the tsvector_trigger() name. If you want to add
some semantic, than tsvector_update_trigger() would be better.  Anyway,
this trigger is an illustration.


Well, the filter that removes '@' might be an example, but tsearch() is
indeed sort of built-in trigger function to be used for simple cases.
My point is that because it is only for simple cases, why add complexity
and allow a filter?  It seems best to just remove the filter idea and
let people write their own triggers if they want that functionality.

If you aware about documentation simplicity than we could just documenttwo versions:

1. without filter function - simple, well understood syntax
2. with filter function - for advanced users

I don't want to remove the feature which works for year without any problem.

        CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));

That avoids having to have a separate column because you can just say:

        WHERE to_query('XXX') @@ to_tsvector(column)


yes, it's possible, but without ranking, since currently it's impossible
to store any information in index (it's pg's feature). btw, this should
works and for GiST index also.


What if they use @@@.  Wouldn't that work because it is going to check
the heap?


It would work, it'd recalculate to_tsvector(column) for rows found
( for GiST - to remove false hits and for weight information, for
GIN - for weight information only).


Right.  Currently to use text search on a table, you have to do three
things:

        o  add a tsvector column to the table
        o  add a trigger to keep the tsvector column current
        o  add an index to the tsvector column

My question is why bother with the first two steps?  If you do:

CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));

you don't need a separate column and a trigger to keep it current.  The
index is kept current as part of normal query processing.  The only
downside is that you have to do to_tsvector() in the heap to avoid false
hits, but that seems minor compared to the disk savings of not having
the separate column.  Is to_tsvector() an expensive function?


Bruce, you oversimplify the text search, the document could be fully virtual,

not a column(s), it could be a result of any SQL commands, so it could bevery expensive just to obtain document, and yes, to_tsvector could be

very expensive, depending on the document size, parser and dictionaries used.

And, again, current postgres architecture forces to use heap to store
positional and weight information for ranking.

The use case for what you described is very limited - simple text search
on one/several column of the same table without ranking.

 CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));

so that at least the configuration is documented in the index.


yes, it's better to always explicitly specify configuration name and not
rely on default configuration.
Unfortunately, configuration name doesn't saved in the index.


as Teodor corrected me, index doesn't know about configuration at all !
What accurate user could do, is to provide configuration name in the
comment for tsvector column. Configuration name is an accessory of
to_tsvector() function.


Well, if you create the index with the configuration name it is
guaranteed to match:

CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
                                                             -------
And if someone does:

        WHERE 'friend'::tsquery @@ to_tsvector('english',column))

the index is used.  Now if the default configuration is 'english' and
they use:

        WHERE 'friend'::tsquery @@ to_tsvector(column))

the index is not used, but this just a good example of why default
configurations aren't that useful.  One problem I see is that if the
default configuration is not 'english', then when the index consults the
heap, it would be using a different configuration and yield incorrect
results.  I am unsure how to fix that.

again, you consider very simple case and actually, your example is agood example of usefulness of default configuration ! Just think before

you develop your application, but this is very general rule. There are
zillions situations you could do bad things, after all.

Moreover, consider text search on text column, there is no way to specifyconfiguration at all ! We rely on default configuration here


CREATE INDEX textsearch_idx ON pgweb USING gin(title);


With the trigger idea, you have to be sure your configuration is the same
every time you INSERT/UPDATE the table or the index will have mixed
configuration entries and it will yield incorrect results, aside from
the heap configuration lookup not matching the index.

Once we nail this down we will have to have a documentation section
about configuration mismatches.


So what is your proposal ? I'm lost a bit. I suggest to begin new thread :)

In principle, tsvector as any data type could be obtained by any other ways,
for example, OpenFTS construct tsvector following its own rules.


I was more concerned that there is nothing documenting the configuration
used by the index or the tsvector table column trigger.  By doing:


again, index has nothing with configuration name.
Our trigger function is an example, which uses default configuration name.
User could easily write it's own trigger to keep tsvector column up to date
and use configuration name as a parameter.


Right. I am thinking beyond that issue.

        CREATE INDEX textsearch_idx ON pgweb USING 
gin(to_tsvector('english',column));

you guarantee that the index uses 'english' for all its entries.  If you
omit the 'english' or use a different configuration, it will heap scan
the table, which at least gives the right answer.


sometimes it's useful not to use explicitly configuration name
to be able to use index with different configuration. Just change
tsearch_conf_name.


I assume you are saying the benefit is for tsquery to use a different
configuration, not having some tsvector index entries using different
configurations than others.

sure, but not necessarily. You can mix different configurations if theyare 'compatible'. Documents could be created from different sources and

could require different dictionaries. I don't want to limit people

in creating complex applications. If you unsure, then always useexplicit configuration name. What's the problem ?

Also, how do you guarantee that tsearch() triggers always uses the same
configuration?  The existing tsearch() API seems to make that
impossible.  I am wondering if we need to add the configuration name as
a mandatory parameter to tsearch().


Using the same tsearch_conf_name, which could be defined by many ways,
you guarantee to use the same configuration.


Yea, I am sure you _can_ do it.  The question is how can we make it less
error-prone.

We have only ONE variable - configuration name. It could be explicitlydefined and then there is no problem at all, or specified byGUC variable tsearch_conf_name. What could be simpler !


The question is do we need to define tsearch_conf_name automagically if
it's not defined ? I inclined don't do that and just issue error.

I believe this is enough error-prone.

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [HACKERS] Updated tsearch documentation

Reply via email to