Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2014-08-06 Thread Bruce Momjian

FYI, I have kept this email from 2011 about poor performance of parsed
words in headline generation.  If someone wants to research it, please
do so:

http://www.postgresql.org/message-id/1314117620.3700.12.camel@dragflick

---

On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> Sushant Sinha  writes:
> >> Doesn't this force the headline to be taken from the first N words of
> >> the document, independent of where the match was?  That seems rather
> >> unworkable, or at least unhelpful.
> 
> > In headline generation function, we don't have any index or knowledge of
> > where the match is. We discover the matches by first tokenizing and then
> > comparing the matches with the query tokens. So it is hard to do
> > anything better than first N words.
> 
> After looking at the code in wparser_def.c a bit more, I wonder whether
> this patch is doing what you think it is.  Did you do any profiling to
> confirm that tokenization is where the cost is?  Because it looks to me
> like the match searching in hlCover() is at least O(N^2) in the number
> of tokens in the document, which means it's probably the dominant cost
> for any long document.  I suspect that your patch helps not so much
> because it saves tokenization costs as because it bounds the amount of
> effort spent in hlCover().
> 
> I haven't tried to do anything about this, but I wonder whether it
> wouldn't be possible to eliminate the quadratic blowup by saving more
> state across the repeated calls to hlCover().  At the very least, it
> shouldn't be necessary to find the last query-token occurrence in the
> document from scratch on each and every call.
> 
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document?  How is that helpful?
> 
>   regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2013-01-24 Thread Bruce Momjian
On Wed, Aug 15, 2012 at 11:09:18PM +0530, Sushant Sinha wrote:
> I will do the profiling and present the results.

Sushant, do you have any profiling results on this issue from August?

---


> 
> On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > Is this a TODO?
> > 
> > AFAIR nothing's been done about the speed issue, so yes.  I didn't
> > like the idea of creating a user-visible knob when the speed issue
> > might be fixable with internal algorithm improvements, but we never
> > followed up on this in either fashion.
> > 
> > regards, tom lane
> > 
> > > ---
> > 
> > > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> > >> Sushant Sinha  writes:
> > >>> Doesn't this force the headline to be taken from the first N words of
> > >>> the document, independent of where the match was?  That seems rather
> > >>> unworkable, or at least unhelpful.
> > >> 
> > >>> In headline generation function, we don't have any index or knowledge of
> > >>> where the match is. We discover the matches by first tokenizing and then
> > >>> comparing the matches with the query tokens. So it is hard to do
> > >>> anything better than first N words.
> > >> 
> > >> After looking at the code in wparser_def.c a bit more, I wonder whether
> > >> this patch is doing what you think it is.  Did you do any profiling to
> > >> confirm that tokenization is where the cost is?  Because it looks to me
> > >> like the match searching in hlCover() is at least O(N^2) in the number
> > >> of tokens in the document, which means it's probably the dominant cost
> > >> for any long document.  I suspect that your patch helps not so much
> > >> because it saves tokenization costs as because it bounds the amount of
> > >> effort spent in hlCover().
> > >> 
> > >> I haven't tried to do anything about this, but I wonder whether it
> > >> wouldn't be possible to eliminate the quadratic blowup by saving more
> > >> state across the repeated calls to hlCover().  At the very least, it
> > >> shouldn't be necessary to find the last query-token occurrence in the
> > >> document from scratch on each and every call.
> > >> 
> > >> Actually, this code seems probably flat-out wrong: won't every
> > >> successful call of hlCover() on a given document return exactly the same
> > >> q value (end position), namely the last token occurrence in the
> > >> document?  How is that helpful?
> > >> 
> > >> regards, tom lane
> > >> 
> > >> -- 
> > >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > >> To make changes to your subscription:
> > >> http://www.postgresql.org/mailpref/pgsql-hackers
> > 
> > > -- 
> > >   Bruce Momjian  http://momjian.us
> > >   EnterpriseDB http://enterprisedb.com
> > 
> > >   + It's impossible for everything to be true. +
> > 
> > 
> > > -- 
> > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-hackers
> 
> 

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2012-08-15 Thread Sushant Sinha
I will do the profiling and present the results.

On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Is this a TODO?
> 
> AFAIR nothing's been done about the speed issue, so yes.  I didn't
> like the idea of creating a user-visible knob when the speed issue
> might be fixable with internal algorithm improvements, but we never
> followed up on this in either fashion.
> 
>   regards, tom lane
> 
> > ---
> 
> > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> >> Sushant Sinha  writes:
> >>> Doesn't this force the headline to be taken from the first N words of
> >>> the document, independent of where the match was?  That seems rather
> >>> unworkable, or at least unhelpful.
> >> 
> >>> In headline generation function, we don't have any index or knowledge of
> >>> where the match is. We discover the matches by first tokenizing and then
> >>> comparing the matches with the query tokens. So it is hard to do
> >>> anything better than first N words.
> >> 
> >> After looking at the code in wparser_def.c a bit more, I wonder whether
> >> this patch is doing what you think it is.  Did you do any profiling to
> >> confirm that tokenization is where the cost is?  Because it looks to me
> >> like the match searching in hlCover() is at least O(N^2) in the number
> >> of tokens in the document, which means it's probably the dominant cost
> >> for any long document.  I suspect that your patch helps not so much
> >> because it saves tokenization costs as because it bounds the amount of
> >> effort spent in hlCover().
> >> 
> >> I haven't tried to do anything about this, but I wonder whether it
> >> wouldn't be possible to eliminate the quadratic blowup by saving more
> >> state across the repeated calls to hlCover().  At the very least, it
> >> shouldn't be necessary to find the last query-token occurrence in the
> >> document from scratch on each and every call.
> >> 
> >> Actually, this code seems probably flat-out wrong: won't every
> >> successful call of hlCover() on a given document return exactly the same
> >> q value (end position), namely the last token occurrence in the
> >> document?  How is that helpful?
> >> 
> >> regards, tom lane
> >> 
> >> -- 
> >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-hackers
> 
> > -- 
> >   Bruce Momjian  http://momjian.us
> >   EnterpriseDB http://enterprisedb.com
> 
> >   + It's impossible for everything to be true. +
> 
> 
> > -- 
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2012-08-15 Thread Tom Lane
Bruce Momjian  writes:
> Is this a TODO?

AFAIR nothing's been done about the speed issue, so yes.  I didn't
like the idea of creating a user-visible knob when the speed issue
might be fixable with internal algorithm improvements, but we never
followed up on this in either fashion.

regards, tom lane

> ---

> On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
>> Sushant Sinha  writes:
>>> Doesn't this force the headline to be taken from the first N words of
>>> the document, independent of where the match was?  That seems rather
>>> unworkable, or at least unhelpful.
>> 
>>> In headline generation function, we don't have any index or knowledge of
>>> where the match is. We discover the matches by first tokenizing and then
>>> comparing the matches with the query tokens. So it is hard to do
>>> anything better than first N words.
>> 
>> After looking at the code in wparser_def.c a bit more, I wonder whether
>> this patch is doing what you think it is.  Did you do any profiling to
>> confirm that tokenization is where the cost is?  Because it looks to me
>> like the match searching in hlCover() is at least O(N^2) in the number
>> of tokens in the document, which means it's probably the dominant cost
>> for any long document.  I suspect that your patch helps not so much
>> because it saves tokenization costs as because it bounds the amount of
>> effort spent in hlCover().
>> 
>> I haven't tried to do anything about this, but I wonder whether it
>> wouldn't be possible to eliminate the quadratic blowup by saving more
>> state across the repeated calls to hlCover().  At the very least, it
>> shouldn't be necessary to find the last query-token occurrence in the
>> document from scratch on each and every call.
>> 
>> Actually, this code seems probably flat-out wrong: won't every
>> successful call of hlCover() on a given document return exactly the same
>> q value (end position), namely the last token occurrence in the
>> document?  How is that helpful?
>> 
>> regards, tom lane
>> 
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers

> -- 
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com

>   + It's impossible for everything to be true. +


> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2012-08-15 Thread Bruce Momjian

This might indicate that the hlCover() item is resolved.

---

On Wed, Aug 24, 2011 at 10:08:11AM +0530, Sushant Sinha wrote:
> 
> 
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document?  How is that helpful?
> 
>regards, tom lane
> 
> 
> There is a line that saves the computation state from the previous call and
> search only starts from there:
> 
> int pos = *p;
> 
> 
> 

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2012-08-15 Thread Bruce Momjian

Is this a TODO?

---

On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> Sushant Sinha  writes:
> >> Doesn't this force the headline to be taken from the first N words of
> >> the document, independent of where the match was?  That seems rather
> >> unworkable, or at least unhelpful.
> 
> > In headline generation function, we don't have any index or knowledge of
> > where the match is. We discover the matches by first tokenizing and then
> > comparing the matches with the query tokens. So it is hard to do
> > anything better than first N words.
> 
> After looking at the code in wparser_def.c a bit more, I wonder whether
> this patch is doing what you think it is.  Did you do any profiling to
> confirm that tokenization is where the cost is?  Because it looks to me
> like the match searching in hlCover() is at least O(N^2) in the number
> of tokens in the document, which means it's probably the dominant cost
> for any long document.  I suspect that your patch helps not so much
> because it saves tokenization costs as because it bounds the amount of
> effort spent in hlCover().
> 
> I haven't tried to do anything about this, but I wonder whether it
> wouldn't be possible to eliminate the quadratic blowup by saving more
> state across the repeated calls to hlCover().  At the very least, it
> shouldn't be necessary to find the last query-token occurrence in the
> document from scratch on each and every call.
> 
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document?  How is that helpful?
> 
>   regards, tom lane
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha
>
> Actually, this code seems probably flat-out wrong: won't every
> successful call of hlCover() on a given document return exactly the same
> q value (end position), namely the last token occurrence in the
> document?  How is that helpful?
>
>regards, tom lane
>

There is a line that saves the computation state from the previous call and
search only starts from there:

int pos = *p;


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Tom Lane
Sushant Sinha  writes:
>> Doesn't this force the headline to be taken from the first N words of
>> the document, independent of where the match was?  That seems rather
>> unworkable, or at least unhelpful.

> In headline generation function, we don't have any index or knowledge of
> where the match is. We discover the matches by first tokenizing and then
> comparing the matches with the query tokens. So it is hard to do
> anything better than first N words.

After looking at the code in wparser_def.c a bit more, I wonder whether
this patch is doing what you think it is.  Did you do any profiling to
confirm that tokenization is where the cost is?  Because it looks to me
like the match searching in hlCover() is at least O(N^2) in the number
of tokens in the document, which means it's probably the dominant cost
for any long document.  I suspect that your patch helps not so much
because it saves tokenization costs as because it bounds the amount of
effort spent in hlCover().

I haven't tried to do anything about this, but I wonder whether it
wouldn't be possible to eliminate the quadratic blowup by saving more
state across the repeated calls to hlCover().  At the very least, it
shouldn't be necessary to find the last query-token occurrence in the
document from scratch on each and every call.

Actually, this code seems probably flat-out wrong: won't every
successful call of hlCover() on a given document return exactly the same
q value (end position), namely the last token occurrence in the
document?  How is that helpful?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha

> > Here is a simple patch that limits the number of words during the
> > tokenization phase and puts an upper-bound on the headline generation.
> 
> Doesn't this force the headline to be taken from the first N words of
> the document, independent of where the match was?  That seems rather
> unworkable, or at least unhelpful.
> 
>   regards, tom lane

In headline generation function, we don't have any index or knowledge of
where the match is. We discover the matches by first tokenizing and then
comparing the matches with the query tokens. So it is hard to do
anything better than first N words.


One option could be that we start looking for "good match" while
tokenizing and then stop if we have found good match. Currently the
algorithms that decide a good match operates independently of the
tokenization and there are two of them. So integrating them would not be
easy.

The patch is very helpful if you believe in the common case assumption
that most of the time a good match is at the top of the document.
Typically a search application generates headline for the top matches of
a query i.e., those in which the query terms appears frequently. So
there should be atleast one or two good text excerpt matches at the top
of the document.



-Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Alvaro Herrera
Excerpts from Tom Lane's message of mar ago 23 15:59:18 -0300 2011:
> Sushant Sinha  writes:
> > Given a document and a query, the goal of headline generation is to
> > produce text excerpts in which the query appears.
> 
> ... right ...
> 
> > Here is a simple patch that limits the number of words during the
> > tokenization phase and puts an upper-bound on the headline generation.
> 
> Doesn't this force the headline to be taken from the first N words of
> the document, independent of where the match was?  That seems rather
> unworkable, or at least unhelpful.

Yeah ...

Doesn't a search result include the position on which the tokens were
found within the document?  Wouldn't it make more sense to improve the
system somehow so that it can restrict searching for headlines in the
general area where the tokens were found?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Tom Lane
Sushant Sinha  writes:
> Given a document and a query, the goal of headline generation is to
> produce text excerpts in which the query appears.

... right ...

> Here is a simple patch that limits the number of words during the
> tokenization phase and puts an upper-bound on the headline generation.

Doesn't this force the headline to be taken from the first N words of
the document, independent of where the match was?  That seems rather
unworkable, or at least unhelpful.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] text search: restricting the number of parsed words in headline generation

2011-08-23 Thread Sushant Sinha
Given a document and a query, the goal of headline generation is to
produce text excerpts in which the query appears. Currently the headline
generation in postgres follows the following steps:

1. Tokenize the documents and obtain the lexemes
2. Decide on lexemes that should be the part of the headline
3. Generate the headline

So the time taken by the headline generation is directly dependent on
the size of the document. The longer the document, the more time taken
to tokenize and more lexemes to operate on.

Most of the time is taken during the tokenization phase and for very big
documents, the headline generation is very expensive. 

Here is a simple patch that limits the number of words during the
tokenization phase and puts an upper-bound on the headline generation.
The headline function takes a parameter MaxParsedWords. If this
parameter is negative or not supplied, then the entire document is
tokenized  and operated on (the current behavior). However, if the
supplied MaxParsedWords is a positive number, then the tokenization
stops after MaxParsedWords is obtained. The remaining headline
generation happens on the tokens obtained till that point.

The current patch can be applied to 9.1rc1. It lacks changes to the
documentation and test cases. I will add them if you folks agree on the
functionality.

-Sushant.
diff -ru postgresql-9.1rc1/src/backend/tsearch/ts_parse.c postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c
--- postgresql-9.1rc1/src/backend/tsearch/ts_parse.c	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c	2011-08-23 21:27:10.0 +0530
@@ -525,10 +525,11 @@
 }
 
 void
-hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen)
+hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen, int max_parsed_words)
 {
 	int			type,
-lenlemm;
+lenlemm,
+numparsed = 0;
 	char	   *lemm = NULL;
 	LexizeData	ldata;
 	TSLexeme   *norms;
@@ -580,8 +581,8 @@
 			else
 addHLParsedLex(prs, query, lexs, NULL);
 		} while (norms);
-
-	} while (type > 0);
+		numparsed += 1;
+	} while (type > 0 && (max_parsed_words < 0 || numparsed < max_parsed_words));
 
 	FunctionCall1(&(prsobj->prsend), PointerGetDatum(prsdata));
 }
--- postgresql-9.1rc1/src/backend/tsearch/wparser.c	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/backend/tsearch/wparser.c	2011-08-23 21:30:12.0 +0530
@@ -304,6 +304,8 @@
 	text	   *out;
 	TSConfigCacheEntry *cfg;
 	TSParserCacheEntry *prsobj;
+	ListCell   *l;
+int max_parsed_words = -1;
 
 	cfg = lookup_ts_config_cache(PG_GETARG_OID(0));
 	prsobj = lookup_ts_parser_cache(cfg->prsId);
@@ -317,13 +319,21 @@
 	prs.lenwords = 32;
 	prs.words = (HeadlineWordEntry *) palloc(sizeof(HeadlineWordEntry) * prs.lenwords);
 
-	hlparsetext(cfg->cfgId, &prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ);
 
 	if (opt)
 		prsoptions = deserialize_deflist(PointerGetDatum(opt));
 	else
 		prsoptions = NIL;
 
+	foreach(l, prsoptions)
+	{
+		DefElem*defel = (DefElem *) lfirst(l);
+		char	   *val = defGetString(defel);
+		if (pg_strcasecmp(defel->defname, "MaxParsedWords") == 0)
+			max_parsed_words = pg_atoi(val, sizeof(int32), 0);
+}
+
+	hlparsetext(cfg->cfgId, &prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ, max_parsed_words);
 	FunctionCall3(&(prsobj->prsheadline),
   PointerGetDatum(&prs),
   PointerGetDatum(prsoptions),
diff -ru postgresql-9.1rc1/src/include/tsearch/ts_utils.h postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h
--- postgresql-9.1rc1/src/include/tsearch/ts_utils.h	2011-08-19 02:53:13.0 +0530
+++ postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h	2011-08-23 21:04:14.0 +0530
@@ -98,7 +98,7 @@
  */
 
 extern void hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query,
-			char *buf, int4 buflen);
+			char *buf, int4 buflen, int max_parsed_words);
 extern text *generateHeadline(HeadlineParsedText *prs);
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers