Re: [GENERAL] Fragments in tsearch2 headline
[moved to -hackers, because talk is about implementation details] I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1 (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php) Thank you. 1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c now contrib/tsearch2 is compatibility layer for old applications - they don't know about new features. So, this part isn't needed. 2 solution to compile function (ts_headline_with_fragments) into core, but using it only from contrib module looks very odd. So, new feature can be used only with compatibility layer for old release :) 3 headline_with_fragments() is hardcoded to use default parser, but what will be in case when configuration uses another parser? For example, for japanese language. 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] ) and function should accept 'NumFragments=>N' for default parser. Another parsers may use another options. 5 it just doesn't work correctly, because new code doesn't care of parser specific type of lexemes. contrib_regression=# select headline_with_fragments('english', 'wow asd-wow wow', 'asd', ''); headline_with_fragments -- ...wow asd-wowasd-wow wow (1 row) So, I incline to use existing framework/infrastructure although it may be a subject to change. Some description: 1 ts_headline defines a correct parser to use 2 it calls hlparsetext to split text into structure suitable for both goals: find the best fragment(s) and concatenate that fragment(s) back to the text representation 3 it calls parser specific method prsheadline which works with preparsed text (parse was done in hlparsetext). Method should mark a needed words/parts/lexemes etc. 4 ts_headline glues fragments into text and returns that. We need a parser's headline method because only parser knows all about its lexemes. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fragments in tsearch2 headline
On Fri, May 23, 2008 at 7:10 AM, Sushant Sinha <[EMAIL PROTECTED]> wrote: > Teodor did not want a separate function. He wanted it as an extension to > ts_headline. One way to do this will be to invoke it only when options > like MaxCoverSize is used. It will be slightly ugly though. What I understand is that Teodor wants hlparsetext to be used instead of hlparsetext_with_cover. This is not too hard to do if hlparsetext gives the positions of lexems. I started writting the patch, but I'm stucked with the function LexizeExec which I do not totally understand (... and is not well documents too :) ) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fragments in tsearch2 headline
Thanks Pierre for porting this! I just tested this for my application and it works. There was a small bug in that startHL has to be initialized to 0 for each chosen cover. I fixed that and attached the new patch. Teodor did not want a separate function. He wanted it as an extension to ts_headline. One way to do this will be to invoke it only when options like MaxCoverSize is used. It will be slightly ugly though. It still seems to have bugs. I will try to clean that up. -Sushant. On Thu, 2008-05-22 at 13:31 +0200, Pierre-Yves Strub wrote: > Hi, > > I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1 > (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php) > > W.r.t, http://archives.postgresql.org/pgsql-general/2008-03/msg00806.php > I can continue the work until this becomes an acceptable patch for pg. > > Pierre-yves. diff -Nurb postgresql-8.3.1/contrib/tsearch2/tsearch2.c postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c --- postgresql-8.3.1/contrib/tsearch2/tsearch2.c 2008-01-01 14:45:45.0 -0500 +++ postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c 2008-05-22 18:44:16.0 -0400 @@ -82,6 +82,7 @@ Datum tsa_to_tsquery_name(PG_FUNCTION_ARGS); Datum tsa_plainto_tsquery_name(PG_FUNCTION_ARGS); Datum tsa_headline_byname(PG_FUNCTION_ARGS); +Datum tsa_headline_with_fragments(PG_FUNCTION_ARGS); Datum tsa_ts_stat(PG_FUNCTION_ARGS); Datum tsa_tsearch2(PG_FUNCTION_ARGS); Datum tsa_rewrite_accum(PG_FUNCTION_ARGS); @@ -101,6 +102,7 @@ PG_FUNCTION_INFO_V1(tsa_to_tsquery_name); PG_FUNCTION_INFO_V1(tsa_plainto_tsquery_name); PG_FUNCTION_INFO_V1(tsa_headline_byname); +PG_FUNCTION_INFO_V1(tsa_headline_with_fragments); PG_FUNCTION_INFO_V1(tsa_ts_stat); PG_FUNCTION_INFO_V1(tsa_tsearch2); PG_FUNCTION_INFO_V1(tsa_rewrite_accum); @@ -358,6 +360,24 @@ return result; } +/* tsa_headline_with_fragments(text, tsvector, text, tsquery, text) */ +Datum +tsa_headline_with_fragments(PG_FUNCTION_ARGS) +{ + text *cfgname = PG_GETARG_TEXT_P(0); + Datum arg1 = PG_GETARG_DATUM(1); + Datum arg2 = PG_GETARG_DATUM(2); + Datum arg3 = PG_GETARG_DATUM(3); + Datum arg4 = PG_GETARG_DATUM(4); + Oid config_oid; + + config_oid = TextGetObjectId(regconfigin, cfgname); + + return DirectFunctionCall5(ts_headline_with_fragments, + ObjectIdGetDatum(config_oid), + arg1, arg2, arg3, arg4); +} + /* * tsearch2 version of update trigger * diff -Nurb postgresql-8.3.1/contrib/tsearch2/tsearch2.sql.in postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.sql.in --- postgresql-8.3.1/contrib/tsearch2/tsearch2.sql.in 2007-11-28 14:33:04.0 -0500 +++ postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.sql.in 2008-05-22 18:44:16.0 -0400 @@ -384,6 +384,11 @@ LANGUAGE INTERNAL RETURNS NULL ON NULL INPUT IMMUTABLE; +CREATE FUNCTION headline_with_fragments(text, text, tsquery, text) +RETURNS text +AS 'MODULE_PATHNAME', 'tsa_headline_with_fragments' +LANGUAGE C RETURNS NULL ON NULL INPUT IMMUTABLE; + -- CREATE the OPERATOR class CREATE OPERATOR CLASS gist_tsvector_ops FOR TYPE tsvector USING gist diff -Nurb postgresql-8.3.1/src/backend/tsearch/ts_parse.c postgresql-8.3.1-orig/src/backend/tsearch/ts_parse.c --- postgresql-8.3.1/src/backend/tsearch/ts_parse.c 2008-01-01 14:45:52.0 -0500 +++ postgresql-8.3.1-orig/src/backend/tsearch/ts_parse.c 2008-05-22 18:49:53.0 -0400 @@ -578,6 +578,112 @@ FunctionCall1(&(prsobj->prsend), PointerGetDatum(prsdata)); } +#define COVER_SEP "..." +#define COVER_SEP_LEN (sizeof(COVER_SEP)-1) + +void +hlparsetext_with_covers(Oid cfgId, +HeadlineParsedText *prs, +TSQuery query, +text *in, +struct coverpos*covers, +int4numcovers) +{ + TSParserCacheEntry *prsobj; + TSConfigCacheEntry *cfg; +void *prsdata; + LexizeData ldata; + int4 icover, startpos, endpos, currentpos = 0; + + char *lemm = NULL; + int4lenlemm = 0; + ParsedLex *lexs; + int4type, startHL = 0; + TSLexeme *norms; + int4oldnumwords, newnumwords, i; + + cfg = lookup_ts_config_cache(cfgId); + prsobj = lookup_ts_parser_cache(cfg->prsId); + +prsdata = (void*) DatumGetPointer(FunctionCall2(&(prsobj->prsstart), +PointerGetDatum(VARDATA(in)), +Int32GetDatum(VARSIZE(in) - VARHDRSZ))); + + LexizeInit(&ldata, cfg); + + for (icover = 0; icover < numcovers; icover++) +{ +if (!covers[icover].in) +continue; + +startpos = covers[icover].startpos; +endpos = covers[icover].endpos; + + if (curre
Re: [GENERAL] Fragments in tsearch2 headline
Hi, I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1 (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php) W.r.t, http://archives.postgresql.org/pgsql-general/2008-03/msg00806.php I can continue the work until this becomes an acceptable patch for pg. Pierre-yves. diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c postgresql-8.3.1/contrib/tsearch2/tsearch2.c --- postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c 2008-01-01 20:45:45.0 +0100 +++ postgresql-8.3.1/contrib/tsearch2/tsearch2.c 2008-05-22 11:35:07.0 +0200 @@ -82,6 +82,7 @@ Datum tsa_to_tsquery_name(PG_FUNCTION_ARGS); Datum tsa_plainto_tsquery_name(PG_FUNCTION_ARGS); Datum tsa_headline_byname(PG_FUNCTION_ARGS); +Datum tsa_headline_with_fragments(PG_FUNCTION_ARGS); Datum tsa_ts_stat(PG_FUNCTION_ARGS); Datum tsa_tsearch2(PG_FUNCTION_ARGS); Datum tsa_rewrite_accum(PG_FUNCTION_ARGS); @@ -101,6 +102,7 @@ PG_FUNCTION_INFO_V1(tsa_to_tsquery_name); PG_FUNCTION_INFO_V1(tsa_plainto_tsquery_name); PG_FUNCTION_INFO_V1(tsa_headline_byname); +PG_FUNCTION_INFO_V1(tsa_headline_with_fragments); PG_FUNCTION_INFO_V1(tsa_ts_stat); PG_FUNCTION_INFO_V1(tsa_tsearch2); PG_FUNCTION_INFO_V1(tsa_rewrite_accum); @@ -358,6 +360,24 @@ return result; } +/* tsa_headline_with_fragments(text, tsvector, text, tsquery, text) */ +Datum +tsa_headline_with_fragments(PG_FUNCTION_ARGS) +{ + text *cfgname = PG_GETARG_TEXT_P(0); + Datum arg1 = PG_GETARG_DATUM(1); + Datum arg2 = PG_GETARG_DATUM(2); + Datum arg3 = PG_GETARG_DATUM(3); + Datum arg4 = PG_GETARG_DATUM(4); + Oid config_oid; + + config_oid = TextGetObjectId(regconfigin, cfgname); + + return DirectFunctionCall5(ts_headline_with_fragments, + ObjectIdGetDatum(config_oid), + arg1, arg2, arg3, arg4); +} + /* * tsearch2 version of update trigger * diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.sql.in postgresql-8.3.1/contrib/tsearch2/tsearch2.sql.in --- postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.sql.in 2007-11-28 20:33:04.0 +0100 +++ postgresql-8.3.1/contrib/tsearch2/tsearch2.sql.in 2008-05-22 11:55:51.0 +0200 @@ -384,6 +384,11 @@ LANGUAGE INTERNAL RETURNS NULL ON NULL INPUT IMMUTABLE; +CREATE FUNCTION headline_with_fragments(text, text, tsquery, text) +RETURNS text +AS 'MODULE_PATHNAME', 'tsa_headline_with_fragments' +LANGUAGE C RETURNS NULL ON NULL INPUT IMMUTABLE; + -- CREATE the OPERATOR class CREATE OPERATOR CLASS gist_tsvector_ops FOR TYPE tsvector USING gist diff -Nrub postgresql-8.3.1-orig/src/backend/tsearch/ts_parse.c postgresql-8.3.1/src/backend/tsearch/ts_parse.c --- postgresql-8.3.1-orig/src/backend/tsearch/ts_parse.c 2008-01-01 20:45:52.0 +0100 +++ postgresql-8.3.1/src/backend/tsearch/ts_parse.c 2008-05-22 13:02:39.0 +0200 @@ -578,6 +578,111 @@ FunctionCall1(&(prsobj->prsend), PointerGetDatum(prsdata)); } +#define COVER_SEP "..." +#define COVER_SEP_LEN (sizeof(COVER_SEP)-1) + +void +hlparsetext_with_covers(Oid cfgId, +HeadlineParsedText *prs, +TSQuery query, +text *in, +struct coverpos*covers, +int4numcovers) +{ + TSParserCacheEntry *prsobj; + TSConfigCacheEntry *cfg; +void *prsdata; + LexizeData ldata; + int4 icover, startpos, endpos, currentpos = 0; + + char *lemm = NULL; + int4lenlemm = 0; + ParsedLex *lexs; + int4type, startHL = 0; + TSLexeme *norms; + int4oldnumwords, newnumwords, i; + + cfg = lookup_ts_config_cache(cfgId); + prsobj = lookup_ts_parser_cache(cfg->prsId); + +prsdata = (void*) DatumGetPointer(FunctionCall2(&(prsobj->prsstart), +PointerGetDatum(VARDATA(in)), +Int32GetDatum(VARSIZE(in) - VARHDRSZ))); + + LexizeInit(&ldata, cfg); + + for (icover = 0; icover < numcovers; icover++) +{ +if (!covers[icover].in) +continue; + +startpos = covers[icover].startpos; +endpos = covers[icover].endpos; + + if (currentpos > endpos) + { +/* XXX - something wrong ... we have gone past the cover */ +continue; + } + + /* see if we need to add a cover seperator */ + if (currentpos < startpos && startpos > 0) + { + hladdword(prs, COVER_SEP, COVER_SEP_LEN, 3); +prs->words[prs->curwords - 1].in = 1; + } + + do + { + type = DatumGetInt32(FunctionCall3(&(prsobj->prstoken), +
Re: [GENERAL] Fragments in tsearch2 headline
Where are we on this? --- Teodor Sigaev wrote: > > The patch takes into account the corner case of overlap. Here is the > > code for that > > // start check > > if (!startHL && *currentpos >= startpos) > >startHL = 1; > > > > The headline generation will not start until currentpos has gone past > > startpos. > Ok > > > > > You can also check how this headline function is working at my website > > indiankanoon.com. Some example queries are murder, freedom of speech, > > freedom of press etc. > Looks good. > > > Should I develop the patch for the current cvs head of postgres? > > I'd like to commit your patch, but if it should be: > - for current HEAD > - as extension of existing ts_headline. > > -- > Teodor Sigaev E-mail: [EMAIL PROTECTED] > WWW: http://www.sigaev.ru/ > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Bruce Momjian <[EMAIL PROTECTED]>http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fragments in tsearch2 headline
The patch takes into account the corner case of overlap. Here is the code for that // start check if (!startHL && *currentpos >= startpos) startHL = 1; The headline generation will not start until currentpos has gone past startpos. Ok You can also check how this headline function is working at my website indiankanoon.com. Some example queries are murder, freedom of speech, freedom of press etc. Looks good. Should I develop the patch for the current cvs head of postgres? I'd like to commit your patch, but if it should be: - for current HEAD - as extension of existing ts_headline. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fragments in tsearch2 headline
Ah I missed this email. I agree with Teodor that this is not the best way to implement this functionality. At the time I was in a bit of hurry to have something better than the default one and just hacked this. And if we want to have this functionality across languages and parsers it will be better to be implemented in the general framework. The patch takes into account the corner case of overlap. Here is the code for that // start check if (!startHL && *currentpos >= startpos) startHL = 1; The headline generation will not start until currentpos has gone past startpos. You can also check how this headline function is working at my website indiankanoon.com. Some example queries are murder, freedom of speech, freedom of press etc. Should I develop the patch for the current cvs head of postgres? Thanks, -Sushant. On Mon, 2008-03-17 at 22:00 +0300, Teodor Sigaev wrote: > > Teodor, Oleg, do we want this? > > http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php > > I suppose, we want it. But there are a questions/issues: > - Is it needed to introduce new function? may be it will be better to add > option > to existing headline function. I'd like to keep current layout: ts_headline > provides some common interface to headline generation. Finding and marking > fragments is deal of parser's headline method and generation of exact pieces > of > text is made by ts_headline. > - Covers may be overlapped. So, overlapped fragments will be looked odd. > > > In any case, the patch was developed for contrib version of tsearch. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fragments in tsearch2 headline
Teodor, Oleg, do we want this? http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php I suppose, we want it. But there are a questions/issues: - Is it needed to introduce new function? may be it will be better to add option to existing headline function. I'd like to keep current layout: ts_headline provides some common interface to headline generation. Finding and marking fragments is deal of parser's headline method and generation of exact pieces of text is made by ts_headline. - Covers may be overlapped. So, overlapped fragments will be looked odd. In any case, the patch was developed for contrib version of tsearch. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fragments in tsearch2 headline
Teodor, Oleg, do we want this? http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php --- Sushant Sinha wrote: > I wrote a headline generation function for my app and I have attached > the patch (against the cvs head). It generates multiple contexts in > which the query appears. Essentially, it uses the cover function to > generate all covers, chooses smallest covers and stretches each > selected cover according to the chosen parameters. I think ideally > changes should be made to prsd_headline function but I couldn't > understand that segment of code well. > > The sql interface is > > headline_with_fragments(text parser, tsvector docvector, text doc, > tsquery queryin, int4 maxcoverSize, int4 mincoverSize, int4 maxWords) > RETURNS text > > This will generate headline that contain maxWords and each cover > stretched to maxcoverSize. It will not add any fragment with less than > mincoverSize. > I am running my app with maxcoverSize = 20, mincoverSize = 5, maxWords = 40. > So it shows roughly two fragments per query. > > If Teoder or Oleg want to add this to main branch, I will be happy to > clean it up and test it better. > > -Sushant. > > > > > On Oct 31, 2007 6:26 PM, Catalin Marinas <[EMAIL PROTECTED]> wrote: > > On 30/10/2007, Oleg Bartunov <[EMAIL PROTECTED]> wrote: > > > ok, then you have to formalize many things - how long should be excerpts, > > > how much excerpts to show, etc. In tsearch2 we have get_covers() function, > > > which produces all excerpts like: > > > > > > =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), > > > '2&3'::tsquery); > > > get_covers > > > > > > 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3 > > > (1 row) > > > > This function generates the lexemes, so cannot be used directly, but > > it is probably a good starting point. > > > > > Once you formalize your requirements, you can look on it and adapt to your > > > needs (and share with people). I think it could be nice contrib module. > > > > It seems that Sushant already wants to implement this function. He > > would probably be faster than me :-) (I'm relatively new to db stuff). > > Since I mainly rely on whatever a web hosting company provides, I'll > > probably stick with a Python implementation outside the SQL query. > > > > Thanks for your answers. > > > > -- > > Catalin > > > > ---(end of broadcast)--- > > > > TIP 5: don't forget to increase your free space map settings > > [ Attachment, skipping... ] > > ---(end of broadcast)--- > TIP 6: explain analyze is your friend -- Bruce Momjian <[EMAIL PROTECTED]>http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fragments in tsearch2 headline
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --- Sushant Sinha wrote: > I wrote a headline generation function for my app and I have attached > the patch (against the cvs head). It generates multiple contexts in > which the query appears. Essentially, it uses the cover function to > generate all covers, chooses smallest covers and stretches each > selected cover according to the chosen parameters. I think ideally > changes should be made to prsd_headline function but I couldn't > understand that segment of code well. > > The sql interface is > > headline_with_fragments(text parser, tsvector docvector, text doc, > tsquery queryin, int4 maxcoverSize, int4 mincoverSize, int4 maxWords) > RETURNS text > > This will generate headline that contain maxWords and each cover > stretched to maxcoverSize. It will not add any fragment with less than > mincoverSize. > I am running my app with maxcoverSize = 20, mincoverSize = 5, maxWords = 40. > So it shows roughly two fragments per query. > > If Teoder or Oleg want to add this to main branch, I will be happy to > clean it up and test it better. > > -Sushant. > > > > > On Oct 31, 2007 6:26 PM, Catalin Marinas <[EMAIL PROTECTED]> wrote: > > On 30/10/2007, Oleg Bartunov <[EMAIL PROTECTED]> wrote: > > > ok, then you have to formalize many things - how long should be excerpts, > > > how much excerpts to show, etc. In tsearch2 we have get_covers() function, > > > which produces all excerpts like: > > > > > > =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), > > > '2&3'::tsquery); > > > get_covers > > > > > > 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3 > > > (1 row) > > > > This function generates the lexemes, so cannot be used directly, but > > it is probably a good starting point. > > > > > Once you formalize your requirements, you can look on it and adapt to your > > > needs (and share with people). I think it could be nice contrib module. > > > > It seems that Sushant already wants to implement this function. He > > would probably be faster than me :-) (I'm relatively new to db stuff). > > Since I mainly rely on whatever a web hosting company provides, I'll > > probably stick with a Python implementation outside the SQL query. > > > > Thanks for your answers. > > > > -- > > Catalin > > > > ---(end of broadcast)--- > > > > TIP 5: don't forget to increase your free space map settings > > [ Attachment, skipping... ] > > ---(end of broadcast)--- > TIP 6: explain analyze is your friend -- Bruce Momjian <[EMAIL PROTECTED]>http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] Fragments in tsearch2 headline
I wrote a headline generation function for my app and I have attached the patch (against the cvs head). It generates multiple contexts in which the query appears. Essentially, it uses the cover function to generate all covers, chooses smallest covers and stretches each selected cover according to the chosen parameters. I think ideally changes should be made to prsd_headline function but I couldn't understand that segment of code well. The sql interface is headline_with_fragments(text parser, tsvector docvector, text doc, tsquery queryin, int4 maxcoverSize, int4 mincoverSize, int4 maxWords) RETURNS text This will generate headline that contain maxWords and each cover stretched to maxcoverSize. It will not add any fragment with less than mincoverSize. I am running my app with maxcoverSize = 20, mincoverSize = 5, maxWords = 40. So it shows roughly two fragments per query. If Teoder or Oleg want to add this to main branch, I will be happy to clean it up and test it better. -Sushant. On Oct 31, 2007 6:26 PM, Catalin Marinas <[EMAIL PROTECTED]> wrote: > On 30/10/2007, Oleg Bartunov <[EMAIL PROTECTED]> wrote: > > ok, then you have to formalize many things - how long should be excerpts, > > how much excerpts to show, etc. In tsearch2 we have get_covers() function, > > which produces all excerpts like: > > > > =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), > > '2&3'::tsquery); > > get_covers > > > > 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3 > > (1 row) > > This function generates the lexemes, so cannot be used directly, but > it is probably a good starting point. > > > Once you formalize your requirements, you can look on it and adapt to your > > needs (and share with people). I think it could be nice contrib module. > > It seems that Sushant already wants to implement this function. He > would probably be faster than me :-) (I'm relatively new to db stuff). > Since I mainly rely on whatever a web hosting company provides, I'll > probably stick with a Python implementation outside the SQL query. > > Thanks for your answers. > > -- > Catalin > > ---(end of broadcast)--- > > TIP 5: don't forget to increase your free space map settings > ? GNUmakefile ? config.log ? config.status ? contrib/tsearch2/rank.h ? src/Makefile.global ? src/include/pg_config.h ? src/include/stamp-h ? src/interfaces/ecpg/include/ecpg_config.h Index: contrib/tsearch2/rank.c === RCS file: /projects/cvsroot/pgsql/contrib/tsearch2/rank.c,v retrieving revision 1.23 diff -c -r1.23 rank.c *** contrib/tsearch2/rank.c 27 Feb 2007 23:48:06 - 1.23 --- contrib/tsearch2/rank.c 12 Nov 2007 03:28:20 - *** *** 21,26 --- 21,27 #include "tsvector.h" #include "query.h" #include "common.h" + #include "rank.h" PG_FUNCTION_INFO_V1(rank); Datum rank(PG_FUNCTION_ARGS); *** *** 419,433 } - typedef struct - { - ITEM **item; - int16 nitem; - bool needfree; - uint8 wclass; - int32 pos; - } DocRepresentation; - static int compareDocR(const void *a, const void *b) { --- 420,425 *** *** 457,473 } } ! typedef struct ! { ! int pos; ! int p; ! int q; ! DocRepresentation *begin; ! DocRepresentation *end; ! } Extention; ! ! ! static bool Cover(DocRepresentation * doc, int len, QUERYTYPE * query, Extention * ext) { DocRepresentation *ptr; --- 449,455 } } ! bool Cover(DocRepresentation * doc, int len, QUERYTYPE * query, Extention * ext) { DocRepresentation *ptr; *** *** 538,544 return Cover(doc, len, query, ext); } ! static DocRepresentation * get_docrep(tsvector * txt, QUERYTYPE * query, int *doclen) { ITEM *item = GETQUERY(query); --- 520,526 return Cover(doc, len, query, ext); } ! DocRepresentation * get_docrep(tsvector * txt, QUERYTYPE * query, int *doclen) { ITEM *item = GETQUERY(query); Index: contrib/tsearch2/ts_cfg.c === RCS file: /projects/cvsroot/pgsql/contrib/tsearch2/ts_cfg.c,v retrieving revision 1.24 diff -c -r1.24 ts_cfg.c *** contrib/tsearch2/ts_cfg.c 6 Apr 2007 04:21:41 - 1.24 --- contrib/tsearch2/ts_cfg.c 12 Nov 2007 03:28:20 - *** *** 646,648 --- 646,715 ts_error(NOTICE, "TSearch cache cleaned"); PG_RETURN_VOID(); } + + void + add_cover_to_hl(WParserInfo *prsobj, HLPRSTEXT *prs, QUERYTYPE *query, LexizeData *ldata, int4 *currentpos, int4 startpos, int4 endpos) + { + char *lemm = NULL; + int4 lenlemm = 0; + ParsedLex *lexs; + int4 type, startHL = 0; + TSLexeme *norms; + char *coversep = " ... "; + int4 coverseplen = strlen(coversep); + int4 oldnumwords, newnumwords, i; + if (*cur
Re: [GENERAL] Fragments in tsearch2 headline
On 30/10/2007, Oleg Bartunov <[EMAIL PROTECTED]> wrote: > ok, then you have to formalize many things - how long should be excerpts, > how much excerpts to show, etc. In tsearch2 we have get_covers() function, > which produces all excerpts like: > > =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), > '2&3'::tsquery); > get_covers > > 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3 > (1 row) This function generates the lexemes, so cannot be used directly, but it is probably a good starting point. > Once you formalize your requirements, you can look on it and adapt to your > needs (and share with people). I think it could be nice contrib module. It seems that Sushant already wants to implement this function. He would probably be faster than me :-) (I'm relatively new to db stuff). Since I mainly rely on whatever a web hosting company provides, I'll probably stick with a Python implementation outside the SQL query. Thanks for your answers. -- Catalin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] Fragments in tsearch2 headline
On Tue, 30 Oct 2007, Tom Lane wrote: On 10/30/07, Oleg Bartunov <[EMAIL PROTECTED]> wrote: ... In tsearch2 we have get_covers() function, which produces all excerpts like: I had not realized till just now that the 8.3 core version of tsearch omitted any material feature of contrib/tsearch2. Why was get_covers() left out? That time we considered it as developers function useful only for debugging. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Fragments in tsearch2 headline
> On 10/30/07, Oleg Bartunov <[EMAIL PROTECTED]> wrote: >> ... In tsearch2 we have get_covers() function, >> which produces all excerpts like: I had not realized till just now that the 8.3 core version of tsearch omitted any material feature of contrib/tsearch2. Why was get_covers() left out? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Fragments in tsearch2 headline
This is a nice idea and seems easy to implement. I will try to write it down and send a patch to the mailing list. I was also working to add support for phrase search. Currently to check for phrase you have to match the entire document. It will be better if a filter like are_words_consecutive(tsvector *t, tsquery *q) can be added to reduce the number of matching documents before we actually do the phrase search. Do you think this will improve the performance of phrase search? If so I will like to write this function and send a patch. -Sushant. On 10/30/07, Oleg Bartunov <[EMAIL PROTECTED]> wrote: > On Tue, 30 Oct 2007, Catalin Marinas wrote: > > > On 30/10/2007, Richard Huxton <[EMAIL PROTECTED]> wrote: > >> Oleg Bartunov wrote: > >>> Catalin, > >>> > >>> what is your need ? What's wrong with this ? > >>> > >>> postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3 > >>> xyz','2'::tsquery, 'StartSel=...,StopSel=...') > >>> ; > >>> ts_headline > >>> --- > >>> 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz > >> > >> I think he want's something like: "1 2 3 ... abc 2 3 ..." > >> > >> A few characters of context around each match and then ... between. Kind > >> of like grep -C. > > > > That's pretty much correct (with the difference that I'd like context > > of words rather than lines as in "grep" and StartSel=, > > StopSel=). > > > > Since the text I want a headline for might be pretty long (tens of > > lines), I'd like to only show the excerpts around the matching words. > > Similar to the above example: > > > > select ts_headline('1 2 3 4 5 3 4 abc x y z 2 3', '2 & abc'::tsquery); > > > > should give: > > > > '1 2 3 4 ... 3 4 abc x y' > > > > Currently, if you limit the maximum words so that 'abc' is too far, it > > only highlights the first match. > > ok, then you have to formalize many things - how long should be excerpts, > how much excerpts to show, etc. In tsearch2 we have get_covers() function, > which produces all excerpts like: > > =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), > '2&3'::tsquery); > get_covers > > 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3 > (1 row) > > Once you formalize your requirements, you can look on it and adapt to your > needs (and share with people). I think it could be nice contrib module. > > > > > > Many of the search engines (including google) show the headline this > > way. I think Lucene can do this as well but I've never used it to be > > sure. > > > > > > Regards, > Oleg > _ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > ---(end of broadcast)--- > TIP 6: explain analyze is your friend > ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Fragments in tsearch2 headline
On Tue, 30 Oct 2007, Catalin Marinas wrote: On 30/10/2007, Richard Huxton <[EMAIL PROTECTED]> wrote: Oleg Bartunov wrote: Catalin, what is your need ? What's wrong with this ? postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3 xyz','2'::tsquery, 'StartSel=...,StopSel=...') ; ts_headline --- 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz I think he want's something like: "1 2 3 ... abc 2 3 ..." A few characters of context around each match and then ... between. Kind of like grep -C. That's pretty much correct (with the difference that I'd like context of words rather than lines as in "grep" and StartSel=, StopSel=). Since the text I want a headline for might be pretty long (tens of lines), I'd like to only show the excerpts around the matching words. Similar to the above example: select ts_headline('1 2 3 4 5 3 4 abc x y z 2 3', '2 & abc'::tsquery); should give: '1 2 3 4 ... 3 4 abc x y' Currently, if you limit the maximum words so that 'abc' is too far, it only highlights the first match. ok, then you have to formalize many things - how long should be excerpts, how much excerpts to show, etc. In tsearch2 we have get_covers() function, which produces all excerpts like: =# select get_covers(to_tsvector('1 2 3 4 5 3 4 abc x y z 2 3'), '2&3'::tsquery); get_covers 1 {1 2 3 }1 4 5 {2 3 4 abc x y z {3 2 }2 3 }3 (1 row) Once you formalize your requirements, you can look on it and adapt to your needs (and share with people). I think it could be nice contrib module. Many of the search engines (including google) show the headline this way. I think Lucene can do this as well but I've never used it to be sure. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Fragments in tsearch2 headline
On 30/10/2007, Richard Huxton <[EMAIL PROTECTED]> wrote: > Oleg Bartunov wrote: > > Catalin, > > > > what is your need ? What's wrong with this ? > > > > postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3 > > xyz','2'::tsquery, 'StartSel=...,StopSel=...') > > ; > > ts_headline > > --- > > 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz > > I think he want's something like: "1 2 3 ... abc 2 3 ..." > > A few characters of context around each match and then ... between. Kind > of like grep -C. That's pretty much correct (with the difference that I'd like context of words rather than lines as in "grep" and StartSel=, StopSel=). Since the text I want a headline for might be pretty long (tens of lines), I'd like to only show the excerpts around the matching words. Similar to the above example: select ts_headline('1 2 3 4 5 3 4 abc x y z 2 3', '2 & abc'::tsquery); should give: '1 2 3 4 ... 3 4 abc x y' Currently, if you limit the maximum words so that 'abc' is too far, it only highlights the first match. Many of the search engines (including google) show the headline this way. I think Lucene can do this as well but I've never used it to be sure. -- Catalin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] Fragments in tsearch2 headline
Oleg Bartunov wrote: Catalin, what is your need ? What's wrong with this ? postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3 xyz','2'::tsquery, 'StartSel=...,StopSel=...') ; ts_headline --- 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz I think he want's something like: "1 2 3 ... abc 2 3 ..." A few characters of context around each match and then ... between. Kind of like grep -C. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] Fragments in tsearch2 headline
Catalin, what is your need ? What's wrong with this ? postgres=# select ts_headline('1 2 3 4 5 3 4 abc abc 2 3 xyz','2'::tsquery, 'StartSel=...,StopSel=...') ; ts_headline --- 1 ...2... 3 4 5 3 4 abc abc ...2... 3 xyz Oleg On Tue, 30 Oct 2007, Catalin Marinas wrote: On 28/10/2007, Oleg Bartunov <[EMAIL PROTECTED]> wrote: On Sat, 27 Oct 2007, Tom Lane wrote: "Catalin Marinas" <[EMAIL PROTECTED]> writes: Is there an easy way to generate a headline from separate fragments containing the search words and maybe separated by "..."? Hmm, the documentation for ts_headline claims it does this already: [...] However, a quick look at the code suggests this is a lie --- I see no evidence whatever that there's any smarts for putting in ellipses. Probably documentation is not correct here. 'ellipsis-separated' should be treated as a general wording. Default highlighting is .. as it stated below in docs. It seems that I'll have to implement the headline outside the query (Python, in my case). I would use to_tsvector and to_tsquery to generate the lexemes and the work position, add them to a hash table and use the position of the matching lexemes to generate the headline. I could also highlight the full text and generate the headline I want based on it but if I limit the number of excerpts, it gets complicated to avoid the same lexeme being shown in all excerpts. Is a lexeme always a substring of the corresponding token (so that I can use simple regexp)? Any other ideas? Thanks. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] Fragments in tsearch2 headline
On 28/10/2007, Oleg Bartunov <[EMAIL PROTECTED]> wrote: > On Sat, 27 Oct 2007, Tom Lane wrote: > > > "Catalin Marinas" <[EMAIL PROTECTED]> writes: > >> Is there an easy way to generate a headline from separate fragments > >> containing the search words and maybe separated by "..."? > > > > Hmm, the documentation for ts_headline claims it does this already: [...] > > However, a quick look at the code suggests this is a lie --- I see no > > evidence whatever that there's any smarts for putting in ellipses. > > Probably documentation is not correct here. 'ellipsis-separated' should be > treated as a general wording. Default highlighting is .. as it > stated below in docs. It seems that I'll have to implement the headline outside the query (Python, in my case). I would use to_tsvector and to_tsquery to generate the lexemes and the work position, add them to a hash table and use the position of the matching lexemes to generate the headline. I could also highlight the full text and generate the headline I want based on it but if I limit the number of excerpts, it gets complicated to avoid the same lexeme being shown in all excerpts. Is a lexeme always a substring of the corresponding token (so that I can use simple regexp)? Any other ideas? Thanks. -- Catalin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Fragments in tsearch2 headline
On Sat, 27 Oct 2007, Tom Lane wrote: "Catalin Marinas" <[EMAIL PROTECTED]> writes: Is there an easy way to generate a headline from separate fragments containing the search words and maybe separated by "..."? Hmm, the documentation for ts_headline claims it does this already: ts_headline accepts a document along with a query, and returns one or more ellipsis-separated excerpts from the document in which terms from the query are highlighted. However, a quick look at the code suggests this is a lie --- I see no evidence whatever that there's any smarts for putting in ellipses. Oleg, Teodor, is there something I missed here? Or do we need to change the documentation? Probably documentation is not correct here. 'ellipsis-separated' should be treated as a general wording. Default highlighting is .. as it stated below in docs. postgres=# select ts_headline('this is a highlighted text','highlight'::tsquery, 'StartSel=...,StopSel=...') postgres-# ; ts_headline -- this is a ...highlighted... text Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Fragments in tsearch2 headline
"Catalin Marinas" <[EMAIL PROTECTED]> writes: > Is there an easy way to generate a headline from separate fragments > containing the search words and maybe separated by "..."? Hmm, the documentation for ts_headline claims it does this already: ts_headline accepts a document along with a query, and returns one or more ellipsis-separated excerpts from the document in which terms from the query are highlighted. However, a quick look at the code suggests this is a lie --- I see no evidence whatever that there's any smarts for putting in ellipses. Oleg, Teodor, is there something I missed here? Or do we need to change the documentation? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[GENERAL] Fragments in tsearch2 headline
Hi, (I first posted it via google groups and realised that I have to be subscribed; now posting directly) I searched the list but couldn't find anyone raising the issue (or it might simply be my way of using the tool). I'd like to search through some text documents for words and generate headlines. The search works fine but, if the words are far apart in the document, the headline only highlights one of the words. Enlarging the headline with Min/MaxWords is not an option as I have limited space for displaying it. Is there an easy way to generate a headline from separate fragments containing the search words and maybe separated by "..."? An option is to generate separate headlines and concatenate them before displaying (with a problem when the words are in the same fragment). Another option would be to somehow get the found words position in the document (anyone knows how?) and generate the headline myself. Any other ideas? Thanks. -- Catalin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings