Re: [HACKERS] TS: Limited cover density ranking
I suggest you work on more general approach, see http://www.sai.msu.su/~megera/wiki/2009-08-12 for example. btw, I don't like you changed ts_rank_cd arguments. Oleg On Fri, 27 Jan 2012, karave...@mail.bg wrote: Hello, I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage: -- normal cover density ranking : not changed luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) -- limited to 2 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0 (1 row) luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.1 (1 row) -- limited to 3 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.13 (1 row) Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. I have some questions: 1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions? If this is the case, there are some further questions: - should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ? - should I define define this new sql level functions in core or should I go only with this 2 lines change in calc_rank_cd() and define the new functions as an extension? If we prefer the later, could I overload core functions with functions defined in extensions? - and finally there is always the possibility to duplicate the code and make an independent extension. 2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the new functions in the system catalog without reinitializing the cluster? Best regards luben -- Luben Karavelov Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TS: Limited cover density ranking
- Цитат от Oleg Bartunov (o...@sai.msu.su), на 28.01.2012 в 21:04 - I suggest you work on more general approach, see http://www.sai.msu.su/~megera/wiki/2009-08-12 for example. btw, I don't like you changed ts_rank_cd arguments. Hello Oleg, Thanks for the feedback. Is it OK to begin with adding an exta argument and check in calc_rank_cd? I could change the function names in order not to overload ts_rank_cd arguments. My proposition is : at sql level: ts_rank_lcd([weights], tsvector, tsquery, limit, [method]) at C level: ts_ranklcd_wttlf ts_ranklcd_wttl ts_ranklcd_ttlf ts_ranklcd_ttl Adding the functions could be done as an extension but they are just trampolines into calc_rank_cd(). I agree that what you describe in the wiki page is more general approach. So this : SELECT ts_rank_lcd(to_tsvector('a b c'), to_tsquery('ac'),2 )0; could be replaced with SELECT to_tsvector('a b c') @@ to_tsquery('(a ?2 c)|(c ?2 a) '); but if we need to look for 3 or more nearby terms without order the tsquery with '?' operator will became quite complicated. For example SELECT tsvec @@ '(a ? b ? c) | (a ? c ? b) | (b ? a ? c) | (b ? c ? a) | (c ? a ? b) | (c ? b ? a)'::tsquery; is the same as SELECT ts_rank_lcd(tsvec, 'abc'::tsquery,2)0; So this is the reason to think that the general approach does not exclude the the usefulness of the approach that I am proposing. Best regards -- Luben Karavelov
[HACKERS] TS: Limited cover density ranking
Hello, I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage: -- normal cover density ranking : not changed luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) -- limited to 2 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0 (1 row) luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.1 (1 row) -- limited to 3 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.13 (1 row) Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. I have some questions: 1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions? If this is the case, there are some further questions: - should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ? - should I define define this new sql level functions in core or should I go only with this 2 lines change in calc_rank_cd() and define the new functions as an extension? If we prefer the later, could I overload core functions with functions defined in extensions? - and finally there is always the possibility to duplicate the code and make an independent extension. 2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the new functions in the system catalog without reinitializing the cluster? Best regards luben -- Luben Karavelov
Re: [HACKERS] TS: Limited cover density ranking
And here is the patch, that I forgot to attach Hello, I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage: ... Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. -- Luben Karavelovdiff -pur postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c /usr/src/postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c --- postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c 2011-12-01 23:47:20.0 +0200 +++ /usr/src/postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c 2012-01-27 07:45:34.558028176 +0200 @@ -724,7 +724,7 @@ get_docrep(TSVector txt, QueryRepresenta } static float4 -calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method) +calc_rank_cd(int limit, float4 *arrdata, TSVector txt, TSQuery query, int method) { DocRepresentation *doc; int len, @@ -768,6 +768,9 @@ calc_rank_cd(float4 *arrdata, TSVector t int nNoise; DocRepresentation *ptr = ext.begin; +if (limit 0 ext.end-pos - ext.begin-pos limit) +continue; + while (ptr = ext.end) { InvSum += invws[ptr-wclass]; @@ -834,7 +837,7 @@ ts_rankcd_wttf(PG_FUNCTION_ARGS) int method = PG_GETARG_INT32(3); float res; - res = calc_rank_cd(getWeights(win), txt, query, method); + res = calc_rank_cd(0, getWeights(win), txt, query, method); PG_FREE_IF_COPY(win, 0); PG_FREE_IF_COPY(txt, 1); @@ -850,7 +853,7 @@ ts_rankcd_wtt(PG_FUNCTION_ARGS) TSQuery query = PG_GETARG_TSQUERY(2); float res; - res = calc_rank_cd(getWeights(win), txt, query, DEF_NORM_METHOD); + res = calc_rank_cd(0, getWeights(win), txt, query, DEF_NORM_METHOD); PG_FREE_IF_COPY(win, 0); PG_FREE_IF_COPY(txt, 1); @@ -866,7 +869,7 @@ ts_rankcd_ttf(PG_FUNCTION_ARGS) int method = PG_GETARG_INT32(2); float res; - res = calc_rank_cd(getWeights(NULL), txt, query, method); + res = calc_rank_cd(0, getWeights(NULL), txt, query, method); PG_FREE_IF_COPY(txt, 0); PG_FREE_IF_COPY(query, 1); @@ -880,9 +883,75 @@ ts_rankcd_tt(PG_FUNCTION_ARGS) TSQuery query = PG_GETARG_TSQUERY(1); float res; - res = calc_rank_cd(getWeights(NULL), txt, query, DEF_NORM_METHOD); + res = calc_rank_cd(0, getWeights(NULL), txt, query, DEF_NORM_METHOD); PG_FREE_IF_COPY(txt, 0); PG_FREE_IF_COPY(query, 1); PG_RETURN_FLOAT4(res); } + +Datum +ts_rankcd_lwttf(PG_FUNCTION_ARGS) +{ +int limit = PG_GETARG_INT32(0); + ArrayType *win = (ArrayType *) PG_DETOAST_DATUM(PG_GETARG_DATUM(1)); + TSVector txt = PG_GETARG_TSVECTOR(2); + TSQuery query = PG_GETARG_TSQUERY(3); + int method = PG_GETARG_INT32(4); + float res; + + res = calc_rank_cd(limit, getWeights(win), txt, query, method); + + PG_FREE_IF_COPY(win, 1); + PG_FREE_IF_COPY(txt, 2); + PG_FREE_IF_COPY(query, 3); + PG_RETURN_FLOAT4(res); +} + +Datum +ts_rankcd_lwtt(PG_FUNCTION_ARGS) +{ +int limit = PG_GETARG_INT32(0); + ArrayType *win = (ArrayType *) PG_DETOAST_DATUM(PG_GETARG_DATUM(1)); + TSVector txt = PG_GETARG_TSVECTOR(2); + TSQuery query = PG_GETARG_TSQUERY(3); + float res; + + res = calc_rank_cd(limit, getWeights(win), txt, query, DEF_NORM_METHOD); + + PG_FREE_IF_COPY(win, 1); + PG_FREE_IF_COPY(txt, 2); + PG_FREE_IF_COPY(query, 3); + PG_RETURN_FLOAT4(res); +} + +Datum +ts_rankcd_lttf(PG_FUNCTION_ARGS) +{ +int limit = PG_GETARG_INT32(0); + TSVector txt = PG_GETARG_TSVECTOR(1); + TSQuery query = PG_GETARG_TSQUERY(2); + int method = PG_GETARG_INT32(3); + float res; + + res = calc_rank_cd(limit, getWeights(NULL), txt, query, method); + + PG_FREE_IF_COPY(txt, 1); + PG_FREE_IF_COPY(query, 2); + PG_RETURN_FLOAT4(res); +} + +Datum +ts_rankcd_ltt(PG_FUNCTION_ARGS) +{ +int limit = PG_GETARG_INT32(0); + TSVector txt = PG_GETARG_TSVECTOR(1); + TSQuery query = PG_GETARG_TSQUERY(2); + float res; + + res = calc_rank_cd(limit, getWeights(NULL), txt, query, DEF_NORM_METHOD); + + PG_FREE_IF_COPY(txt, 1); + PG_FREE_IF_COPY(query, 2); + PG_RETURN_FLOAT4(res); +} diff -pur postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h /usr/src/postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h --- postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h 2011-12-01 23:47:20.0 +0200 +++ /usr/src/postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h 2012-01-27 05:45:53.944979678 +0200 @@ -4159,6 +4159,15 @@ DATA(insert OID = 3709 ( ts_rank_cd PGN DESCR(relevance); DATA(insert OID = 3710 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 2 0 700 3614 3615 _null_ _null_ _null_ _null_ ts_rankcd_tt _null_ _null_ _null_ )); DESCR(relevance); +DATA(insert OID = 3675 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 5 0 700 23 1021 3614
Re: [HACKERS] TS: Limited cover density ranking
The rank counts 1/coversize. So bigger covers will not have much impact anyway. What is the need of the patch? -Sushant. On Fri, 2012-01-27 at 18:06 +0200, karave...@mail.bg wrote: Hello, I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage: -- normal cover density ranking : not changed luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) -- limited to 2 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0 (1 row) luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.1 (1 row) -- limited to 3 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.13 (1 row) Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. I have some questions: 1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions? If this is the case, there are some further questions: - should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ? - should I define define this new sql level functions in core or should I go only with this 2 lines change in calc_rank_cd() and define the new functions as an extension? If we prefer the later, could I overload core functions with functions defined in extensions? - and finally there is always the possibility to duplicate the code and make an independent extension. 2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the new functions in the system catalog without reinitializing the cluster? Best regards luben -- Luben Karavelov -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TS: Limited cover density ranking
- Цитат от Sushant Sinha (sushant...@gmail.com), на 27.01.2012 в 18:32 - The rank counts 1/coversize. So bigger covers will not have much impact anyway. What is the need of the patch? -Sushant. If you want to find only combinations of words that are close one to another, with the patch you could use something as: WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('ad') AS query) SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)0; I could not find another way to make this type of queries. If there is an alternative, I am open to suggestions Best regards -- Luben Karavelov
Re: [HACKERS] TS: Limited cover density ranking
- Цитат от karave...@mail.bg, на 27.01.2012 в 18:48 - - Цитат от Sushant Sinha (sushant...@gmail.com), на 27.01.2012 в 18:32 - The rank counts 1/coversize. So bigger covers will not have much impact anyway. What is the need of the patch? -Sushant. If you want to find only combinations of words that are close one to another, with the patch you could use something as: WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('ad') AS query) SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)0; Another example, if you want to match 'b c d' only, you could use: WITH A AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('bcd') AS query) SELECT * FROM A WHERE vec @@ query AND ts_rank_cd(2,vec,query)0; The catch is that it will match also 'b d c', 'd c b', 'd b c', 'c d b' and 'd b d', so it is not a replacement for exact phrase match but something that I find useful -- Luben Karavelov