[GENERAL] Normalization in text search ranking

Tim van der Linden Sat, 03 May 2014 18:28:26 -0700

Hi all

Another question regarding full text, this time about ranking.
The ts_ranking() and ts_ranking_cd() accept a normalization integer/bit mask.


In the documentation the different integers are somewhat laid out and it is 
said that some take into account the document length (1 and 2) while others 
take into account the number of unique words (8 and 16).

To illustrate my following questions, take this tsvector:

'ate':9 'cat':3 'fat':2,11

Now, I was wondering how document length and unique words are calculated (from 
a high level perspective). 

I am correct in saying that, when counting the document length, the number of 
total pointers is summed up, meaning that in the above tsvector we have 4 words 
(resulting in an integer of 4 to use to divide the float).

And when counting unique words, the calculation for the above tsvector would be 
3, only counting the actual lexemes in there and not the amount of pointers?

Also, final question, if you use integer 8 or 16 to influence the ranking float 
calculated, you would actual "punish" documents who are more unique? Meaning 
that this is just another way of giving shorter documents precedence over 
longer ones?

Thanks again!

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Normalization in text search ranking

Reply via email to