On Sunday 01 November 2009 16:19:43 Andres Freund wrote: > While playing around/evaluating tsearch I notices that to_tsvector is > obscenely slow for some files. After some profiling I found that this is > due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If > a multibyte encoding is in use TParserInit copies the whole remaining > input and converts it to wchar_t or pg_wchar - for every email or protocol > prefixed url in the the document. Which obviously is bad. > > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose > which reuses the the already converted strings of the original TParser - > only at different offsets. > > Another approach would be to get rid of the separate parser invocations - > requiring a bunch of additional states. This seemed more complex to me, so > I wanted to get some feedback first. > > Without patch: > andres=# SELECT to_tsvector('english', document) FROM document WHERE > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > ────────────────────────────────────────────────────────────────────────── > ─────────────────────────── ... > (1 row) > > Time: 5835.676 ms > > With patch: > andres=# SELECT to_tsvector('english', document) FROM document WHERE > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > ────────────────────────────────────────────────────────────────────────── > ─────────────────────────── ... > (1 row) > > Time: 395.341 ms > > Ill cleanup the patch if it seems like a sensible solution... As nobody commented here is a corrected (stupid thinko) and cleaned up version. Anyone cares to comment whether I am the only one thinking this is an issue?
Andres
From cbdeb0bb636f3b7619d0a3019854809ea5565dac Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Sun, 8 Nov 2009 16:30:42 +0100 Subject: [PATCH] Fix TSearch inefficiency because of repeated copying of strings --- src/backend/tsearch/wparser_def.c | 63 ++++++++++++++++++++++++++++++++++-- 1 files changed, 59 insertions(+), 4 deletions(-) diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 301c1eb..7bbd826 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -328,6 +328,41 @@ TParserInit(char *str, int len) return prs; } +/* + * As an alternative to a full TParserInit one can create a + * TParserCopy which basically is a normally TParser without a private + * copy of the string - instead it uses the one from another TParser. + * This is usefull because at some places TParsers are created + * recursively and the repeated copying around of the strings can + * cause major inefficiency. + * Obviously one may not close the original TParser before the copy. + */ +static TParser * +TParserCopyInit(const TParser const* orig) +{ + TParser *prs = (TParser *) palloc0(sizeof(TParser)); + + prs->charmaxlen = orig->charmaxlen; + prs->usewide = orig->usewide; + prs->lenstr = orig->lenstr - orig->state->posbyte; + + prs->str = orig->str + orig->state->posbyte; + if(orig->pgwstr) + prs->pgwstr = orig->pgwstr + orig->state->poschar; + if(orig->wstr) + prs->wstr = orig->wstr + orig->state->poschar; + + prs->state = newTParserPosition(NULL); + prs->state->state = TPS_Base; + +#ifdef WPARSER_TRACE + fprintf(stderr, "parsing copy \"%.*s\"\n", len, str); +#endif + + return prs; +} + + static void TParserClose(TParser *prs) { @@ -350,6 +385,26 @@ TParserClose(TParser *prs) } /* + * See TParserCopyInit + */ +static void +TParserCopyClose(TParser *prs) +{ + while (prs->state) + { + TParserPosition *ptr = prs->state->prev; + + pfree(prs->state); + prs->state = ptr; + } +#ifdef WPARSER_TRACE + fprintf(stderr, "closing parser copy"); +#endif + pfree(prs); +} + + +/* * Character-type support functions, equivalent to is* macros, but * working with any possible encodings and locales. Notes: * - with multibyte encoding and C-locale isw* function may fail @@ -617,7 +672,7 @@ p_isignore(TParser *prs) static int p_ishost(TParser *prs) { - TParser *tmpprs = TParserInit(prs->str + prs->state->posbyte, prs->lenstr - prs->state->posbyte); + TParser *tmpprs = TParserCopyInit(prs); int res = 0; tmpprs->wanthost = true; @@ -631,7 +686,7 @@ p_ishost(TParser *prs) prs->state->charlen = tmpprs->state->charlen; res = 1; } - TParserClose(tmpprs); + TParserCopyClose(tmpprs); return res; } @@ -639,7 +694,7 @@ p_ishost(TParser *prs) static int p_isURLPath(TParser *prs) { - TParser *tmpprs = TParserInit(prs->str + prs->state->posbyte, prs->lenstr - prs->state->posbyte); + TParser *tmpprs = TParserCopyInit(prs); int res = 0; tmpprs->state = newTParserPosition(tmpprs->state); @@ -654,7 +709,7 @@ p_isURLPath(TParser *prs) prs->state->charlen = tmpprs->state->charlen; res = 1; } - TParserClose(tmpprs); + TParserCopyClose(tmpprs); return res; } -- 1.6.5.12.gd65df24
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers