Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
On Sunday 01 November 2009 16:19:43 Andres Freund wrote: While playing around/evaluating tsearch I notices that to_tsvector is obscenely slow for some files. After some profiling I found that this is due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If a multibyte encoding is in use TParserInit copies the whole remaining input and converts it to wchar_t or pg_wchar - for every email or protocol prefixed url in the the document. Which obviously is bad. I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which reuses the the already converted strings of the original TParser - only at different offsets. Another approach would be to get rid of the separate parser invocations - requiring a bunch of additional states. This seemed more complex to me, so I wanted to get some feedback first. Without patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ── ─── ... (1 row) Time: 5835.676 ms With patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ── ─── ... (1 row) Time: 395.341 ms Ill cleanup the patch if it seems like a sensible solution... As nobody commented here is a corrected (stupid thinko) and cleaned up version. Anyone cares to comment whether I am the only one thinking this is an issue? Andres From cbdeb0bb636f3b7619d0a3019854809ea5565dac Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Sun, 8 Nov 2009 16:30:42 +0100 Subject: [PATCH] Fix TSearch inefficiency because of repeated copying of strings --- src/backend/tsearch/wparser_def.c | 63 ++-- 1 files changed, 59 insertions(+), 4 deletions(-) diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c index 301c1eb..7bbd826 100644 --- a/src/backend/tsearch/wparser_def.c +++ b/src/backend/tsearch/wparser_def.c @@ -328,6 +328,41 @@ TParserInit(char *str, int len) return prs; } +/* + * As an alternative to a full TParserInit one can create a + * TParserCopy which basically is a normally TParser without a private + * copy of the string - instead it uses the one from another TParser. + * This is usefull because at some places TParsers are created + * recursively and the repeated copying around of the strings can + * cause major inefficiency. + * Obviously one may not close the original TParser before the copy. + */ +static TParser * +TParserCopyInit(const TParser const* orig) +{ + TParser*prs = (TParser *) palloc0(sizeof(TParser)); + + prs-charmaxlen = orig-charmaxlen; + prs-usewide = orig-usewide; + prs-lenstr = orig-lenstr - orig-state-posbyte; + + prs-str = orig-str + orig-state-posbyte; + if(orig-pgwstr) + prs-pgwstr = orig-pgwstr + orig-state-poschar; + if(orig-wstr) + prs-wstr = orig-wstr + orig-state-poschar; + + prs-state = newTParserPosition(NULL); + prs-state-state = TPS_Base; + +#ifdef WPARSER_TRACE + fprintf(stderr, parsing copy \%.*s\\n, len, str); +#endif + + return prs; +} + + static void TParserClose(TParser *prs) { @@ -350,6 +385,26 @@ TParserClose(TParser *prs) } /* + * See TParserCopyInit + */ +static void +TParserCopyClose(TParser *prs) +{ + while (prs-state) + { + TParserPosition *ptr = prs-state-prev; + + pfree(prs-state); + prs-state = ptr; + } +#ifdef WPARSER_TRACE + fprintf(stderr, closing parser copy); +#endif + pfree(prs); +} + + +/* * Character-type support functions, equivalent to is* macros, but * working with any possible encodings and locales. Notes: * - with multibyte encoding and C-locale isw* function may fail @@ -617,7 +672,7 @@ p_isignore(TParser *prs) static int p_ishost(TParser *prs) { - TParser*tmpprs = TParserInit(prs-str + prs-state-posbyte, prs-lenstr - prs-state-posbyte); + TParser *tmpprs = TParserCopyInit(prs); int res = 0; tmpprs-wanthost = true; @@ -631,7 +686,7 @@ p_ishost(TParser *prs) prs-state-charlen = tmpprs-state-charlen; res = 1; } - TParserClose(tmpprs); + TParserCopyClose(tmpprs); return res; } @@ -639,7 +694,7 @@ p_ishost(TParser *prs) static int p_isURLPath(TParser *prs) { - TParser*tmpprs = TParserInit(prs-str + prs-state-posbyte, prs-lenstr - prs-state-posbyte); + TParser *tmpprs = TParserCopyInit(prs); int res = 0; tmpprs-state = newTParserPosition(tmpprs-state); @@ -654,7 +709,7 @@ p_isURLPath(TParser *prs) prs-state-charlen = tmpprs-state-charlen; res = 1; } - TParserClose(tmpprs); + TParserCopyClose(tmpprs); return res; } -- 1.6.5.12.gd65df24 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote: On Sunday 01 November 2009 16:19:43 Andres Freund wrote: While playing around/evaluating tsearch I notices that to_tsvector is obscenely slow for some files. After some profiling I found that this is due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If a multibyte encoding is in use TParserInit copies the whole remaining input and converts it to wchar_t or pg_wchar - for every email or protocol prefixed url in the the document. Which obviously is bad. I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which reuses the the already converted strings of the original TParser - only at different offsets. Another approach would be to get rid of the separate parser invocations - requiring a bunch of additional states. This seemed more complex to me, so I wanted to get some feedback first. Without patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ?? ? ... (1 row) Time: 5835.676 ms With patch: andres=# SELECT to_tsvector('english', document) FROM document WHERE filename = '/usr/share/doc/libdrm-nouveau1/changelog'; ?? ? ... (1 row) Time: 395.341 ms Ill cleanup the patch if it seems like a sensible solution... As nobody commented here is a corrected (stupid thinko) and cleaned up version. Anyone cares to comment whether I am the only one thinking this is an issue? Andres +1 As a user of tsearch, I can certainly appreciate to speed-up in parsing -- more CPU for everyone else. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why do OLD and NEW have special internal names?
Tom, BTW, this brings up another point, which is that up to now it's often been possible to use plpgsql variable names that conflict with core-parser reserved words, so long as you didn't need to use the reserved word with its special meaning. That will stop working when this patch goes in. Doesn't bother me any, but if anyone thinks it's a serious problem, speak now. Sounds like a potential *big* blocker to upgrading; anyone with several thousand lines of plpgsql can't really afford to refactor away all of the accidental uses of reserved words. That being said, reusing reserved words in this way was always wonky, so I'm not sure how many people will have done so. Best way is to commit it to alpha3, and try to get people to test. --Josh Berkus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typed tables
On Thu, 2009-11-05 at 19:24 +0200, Peter Eisentraut wrote: This is useful in conjunction with PL/Proxy and similar RPC-type setups. On the frontend/proxy instances you only create the type, and the backend instances you create the storage for the type, and the database system would give you a little support keeping them in sync. Think interface and implementation. Not sure I see why this is good. Why is issuing CREATE TYPE so much easier than using CREATE TABLE? Is it worth the extra syntax and code to support it? Can we do anything additional as a result of this? Is this required by the standard or are we going past the standard? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] operator exclusion constraints
On Sun, 2009-11-08 at 13:41 -0800, Jeff Davis wrote: On Sat, 2009-11-07 at 10:56 -0800, Jeff Davis wrote: EXCLUDE probably flows most nicely with the optional USING clause or without. My only complaint was that it's a transitive verb, so it seems to impart more meaning than it actually can. I doubt anyone would actually be more confused in practice, though. If a couple of people agree, I'll change it to EXCLUDE. It looks like EXCLUDE is the winner. Updated patch attached. The feature is still called operator exclusion constraints, and the docs still make reference to that name, but the syntax specification has been updated. Don't think that name is very useful either... sounds like you want to exclude operators, which is why I got lost in the first place. I'd call them generic exclusion constraints or user-defined exclusion constraints. Sorry for this. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
On Sunday 08 November 2009 17:41:15 Kenneth Marshall wrote: On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote: As nobody commented here is a corrected (stupid thinko) and cleaned up version. Anyone cares to comment whether I am the only one thinking this is an issue? Andres +1 As a user of tsearch, I can certainly appreciate to speed-up in parsing -- more CPU for everyone else. Please note that this is mostly an issue when using rather long documents including either email addresses (ie. an @) or links with protocol prefixes (like ?+://) - so it might not give you personally a benefit :-( Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Specific names for plpgsql variable-resolution control options?
Tom Lane wrote: David E. Wheeler da...@kineticode.com writes: That works. plpgsql_variable_conflict = fatal | variable-first | column-first If we do that, presumably the per-function syntax would be #variable_conflict variable_first and so on, which is clear enough but might be thought a bit verbose for something people might be pasting into hundreds of functions. If there's some way to paste it automatically (like, say, an appropriate UPDATE incantation on pg_proc) then that doesn't seem like an important problem. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Specific names for plpgsql variable-resolution control options?
Alvaro Herrera alvhe...@commandprompt.com writes: If there's some way to paste it automatically (like, say, an appropriate UPDATE incantation on pg_proc) then that doesn't seem like an important problem. True, you could do UPDATE pg_proc SET prosrc = 'foo' || prosrc WHERE something-appropriate. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Specific names for plpgsql variable-resolution control options?
On Fri, Nov 06, 2009 at 07:09:46PM -0800, David E. Wheeler wrote: On Nov 6, 2009, at 6:57 PM, Tom Lane wrote: If we do that, presumably the per-function syntax would be #variable_conflict variable_first and so on, which is clear enough but might be thought a bit verbose for something people might be pasting into hundreds of functions. I suspect that most folks will set the GUC and few will actually use it in functions. Just to be clear about the semantics; what should happen if the user doesn't specify a value for the function? Should PG remember the GUC value at creation time, or pull it in at invocation time? I'd lean towards fixing it at function creation time as it'd be one more caveat for security definer functions otherwise. -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] next CommitFest
The next CommitFest is scheduled to start in a week. So far, it doesn't look too bad, though a lot could change between now and then. I would personally prefer not to be involved in the management of the next CommitFest. Having done all of the July CommitFest and a good chunk of the September CommitFest, I am feeling a bit burned out. One pretty major fly in the ointment is that neither Hot Standby nor Streaming Replication has been committed or shows much sign of being about to be committed. I think this is bad. These are big features that figure to have some bugs and break some things. If they're not committed in time for alpha3, then there won't be any significant testing of these prior to alpha4/beta1, at the earliest. I think that's likely to lead to either (1) a very long beta period followed by a late release or (2) a buggy release. I feel like Simon Riggs and Fujii Masao really pulled out all the stops to get these ready in time for the September CommitFest, and while I'm not in a hurry to break the world, I think the sooner these can hit the tree, the better of we'll be in terms of releasing 8.5. Just my $0.02, ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] next CommitFest
Robert Haas robertmh...@gmail.com writes: I would personally prefer not to be involved in the management of the next CommitFest. Having done all of the July CommitFest and a good chunk of the September CommitFest, I am feeling a bit burned out. You did yeoman work on both --- thanks for that! Do we have another volunteer to do this for the November fest? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typed tables
This is useful in conjunction with PL/Proxy and similar RPC-type setups. On the frontend/proxy instances you only create the type, and the backend instances you create the storage for the type, and the database system would give you a little support keeping them in sync. Think interface and implementation. Not sure I see why this is good. Why is issuing CREATE TYPE so much easier than using CREATE TABLE? Is it worth the extra syntax and code to support it? Can we do anything additional as a result of this? Is this required by the standard or are we going past the standard? +1. I'd like to hear from Peter why this is neccessary in the first place. -- Tatsuo Ishii SRA OSS, Inc. Japan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] operator exclusion constraints
On Sun, 2009-11-08 at 22:03 +, Simon Riggs wrote: Don't think that name is very useful either... sounds like you want to exclude operators, which is why I got lost in the first place. I'd call them generic exclusion constraints or user-defined exclusion constraints. Sorry for this. Either of those names are fine with me, too. The current name is a somewhat shortened version of the name operator-based exclusion constraints, so we can also just use that name. Or, just exclusion constraints. I'll leave the current patch as-is for now, and wait for some reviewer comments. This is purely a documentation issue, so there are bound to be a few of these things that I can clarify at once. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers