Re: [HACKERS] Updated tsearch documentation

Oleg Bartunov Thu, 21 Jun 2007 03:14:21 -0700

On Wed, 20 Jun 2007, Bruce Momjian wrote:

Oleg Bartunov wrote:

On Wed, 20 Jun 2007, Bruce Momjian wrote:

Comments to editorial work of Bruce Momjian.


fulltext-intro.sgml:

it is useful to have a predefined list of lexemes.

Bruce, here should be list of types of lexemes !


Agreed.  Are the list of lexemes parser-specific?


yes, it it parser which defines types of lexemes.


OK, how will users get a list of supported lexemes?  Do we need a list
per supported parser?


it's documented, see "Parser functions" for token_type();

postgres=# select * from token_type('default');
 tokid |    alias     |            description
-------+--------------+-----------------------------------
     1 | lword        | Latin word
     2 | nlword       | Non-latin word
     3 | word         | Word
     4 | email        | Email
     5 | url          | URL
     6 | host         | Host
     7 | sfloat       | Scientific notation
     8 | version      | VERSION
     9 | part_hword   | Part of hyphenated word
    10 | nlpart_hword | Non-latin part of hyphenated word
    11 | lpart_hword  | Latin part of hyphenated word
    12 | blank        | Space symbols
    13 | tag          | HTML Tag
    14 | protocol     | Protocol head
    15 | hword        | Hyphenated word
    16 | lhword       | Latin hyphenated word
    17 | nlhword      | Non-latin hyphenated word
    18 | uri          | URI
    19 | file         | File or path name
    20 | float        | Decimal notation
    21 | int          | Signed integer
    22 | uint         | Unsigned integer
    23 | entity       | HTML Entity

The integer option controls several behaviors which is done using bit-wise
fields and <literal>|</literal> (for example, <literal>2|4</literal>):
<!-- why so complex? -->

to avoid 2 arguments


But I don't see why you would want to set two of those values --- they
seem mutually exclusive, e.g.

        1 divides the rank by the 1 + logarithm of the document length
        2 divides the rank by the length itself

I assume you do either one, not both.


but what's about others variants ?


OK, here is the full list:

        0 (the default) ignores document length
        1 divides the rank by the 1 + logarithm of the document length
        2 divides the rank by the length itself
        4 divides the rank by the mean harmonic distance between extents
        8 divides the rank by the number of unique words in document
        16 divides the rank by 1 + logarithm of the number of unique words in
           document

so which ones would be both enabled?

no one ! This is a list of possible values of rank normalization flag, whichcould be ORed together.


=# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1);
  rank_cd
-----------
 0.0279055
=# select rank_cd('1:1,2,3 4:5 6:7', '1&4',1|16);
  rank_cd
-----------
 0.0139528


What I missed is the definition of extent.

From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking

Extent is a shortest and non-nested sequence of words, which satisfy a query.


I don't understand how that relates to this.

because of"4 divides the rank by the mean harmonic distance between extents"

                                                          ^^^^^^^
it reflects how dense extents which satisfy query are in document.

its <replaceable>id</replaceable> or <replaceable>ts_name</replaceable>; <!-- n
if none is specified that the current configuration is used.

I don't understand this question


Same issue as above --- why allow a number here when the name works just
fine.  We don't allow tables to be specified by number, so why
configurations?

<para>
<!-- why?  -->
Note that the cascade dropping of the <function>headline</function> function
cause dropping of the <literal>parser</literal> used in fulltext configuration
<replaceable>tsname</replaceable>.
</para>

hmm, probably it should be reversed - cascade dropping of the parser cause
dropping of the headline function.


Agreed.


In example below, <literal>fulltext_idx</literal> is
a GIN index:<!-- why isn't this automatic -->

It's explained above. The problem is that current index api doesn't allow
to say if search was lossy or exact, so to preserve performance of
GIN index we had to introduce @@@ operator, which is the same as @@, but
lossy.


Well, then we have to fix the API.  Telling users to use a different
operator based on what index is defined is just bad style.


This was raised by Heikki and we discussed it a bit in Ottawa, but it's
unclear if it's doable for 8.3.  @@@ operator is in rare use, so we could
say it will be improved in future versions.


Uh, I am wondering if we just have to force heap access in all cases
until it is fixed.


no-no ! We'll lost performance of GIN index, which isn't lossy and don't
need heap access. I don't see what's wrong if we say that some feature
doesn't supported by text search operator with GIN index.

We need to decide if we need oids as user-visible argument. I don't see
any value, probably Teodor think other way.


This is a good time to clean up the API because there are going to be
user-visible changes anyway.


I agree. Keep in mind this, until we get more serious tasks done.

        Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org

Re: [HACKERS] Updated tsearch documentation

Reply via email to