Given a document and a query, the goal of headline generation is to produce text excerpts in which the query appears. Currently the headline generation in postgres follows the following steps:
1. Tokenize the documents and obtain the lexemes 2. Decide on lexemes that should be the part of the headline 3. Generate the headline So the time taken by the headline generation is directly dependent on the size of the document. The longer the document, the more time taken to tokenize and more lexemes to operate on. Most of the time is taken during the tokenization phase and for very big documents, the headline generation is very expensive. Here is a simple patch that limits the number of words during the tokenization phase and puts an upper-bound on the headline generation. The headline function takes a parameter MaxParsedWords. If this parameter is negative or not supplied, then the entire document is tokenized and operated on (the current behavior). However, if the supplied MaxParsedWords is a positive number, then the tokenization stops after MaxParsedWords is obtained. The remaining headline generation happens on the tokens obtained till that point. The current patch can be applied to 9.1rc1. It lacks changes to the documentation and test cases. I will add them if you folks agree on the functionality. -Sushant.
diff -ru postgresql-9.1rc1/src/backend/tsearch/ts_parse.c postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c --- postgresql-9.1rc1/src/backend/tsearch/ts_parse.c 2011-08-19 02:53:13.000000000 +0530 +++ postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c 2011-08-23 21:27:10.000000000 +0530 @@ -525,10 +525,11 @@ } void -hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen) +hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen, int max_parsed_words) { int type, - lenlemm; + lenlemm, + numparsed = 0; char *lemm = NULL; LexizeData ldata; TSLexeme *norms; @@ -580,8 +581,8 @@ else addHLParsedLex(prs, query, lexs, NULL); } while (norms); - - } while (type > 0); + numparsed += 1; + } while (type > 0 && (max_parsed_words < 0 || numparsed < max_parsed_words)); FunctionCall1(&(prsobj->prsend), PointerGetDatum(prsdata)); } --- postgresql-9.1rc1/src/backend/tsearch/wparser.c 2011-08-19 02:53:13.000000000 +0530 +++ postgresql-9.1rc1-dev/src/backend/tsearch/wparser.c 2011-08-23 21:30:12.000000000 +0530 @@ -304,6 +304,8 @@ text *out; TSConfigCacheEntry *cfg; TSParserCacheEntry *prsobj; + ListCell *l; + int max_parsed_words = -1; cfg = lookup_ts_config_cache(PG_GETARG_OID(0)); prsobj = lookup_ts_parser_cache(cfg->prsId); @@ -317,13 +319,21 @@ prs.lenwords = 32; prs.words = (HeadlineWordEntry *) palloc(sizeof(HeadlineWordEntry) * prs.lenwords); - hlparsetext(cfg->cfgId, &prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ); if (opt) prsoptions = deserialize_deflist(PointerGetDatum(opt)); else prsoptions = NIL; + foreach(l, prsoptions) + { + DefElem *defel = (DefElem *) lfirst(l); + char *val = defGetString(defel); + if (pg_strcasecmp(defel->defname, "MaxParsedWords") == 0) + max_parsed_words = pg_atoi(val, sizeof(int32), 0); + } + + hlparsetext(cfg->cfgId, &prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ, max_parsed_words); FunctionCall3(&(prsobj->prsheadline), PointerGetDatum(&prs), PointerGetDatum(prsoptions), diff -ru postgresql-9.1rc1/src/include/tsearch/ts_utils.h postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h --- postgresql-9.1rc1/src/include/tsearch/ts_utils.h 2011-08-19 02:53:13.000000000 +0530 +++ postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h 2011-08-23 21:04:14.000000000 +0530 @@ -98,7 +98,7 @@ */ extern void hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, - char *buf, int4 buflen); + char *buf, int4 buflen, int max_parsed_words); extern text *generateHeadline(HeadlineParsedText *prs); /*
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers